#paper

Introduction

Dynamo is used to manage the state of services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost-effectiveness and performance.

Background

Store and retrieve data by primary key, but does not need complex querying and management functionality offered by an RDBMS.

System Assumptions and Requirements

Query Model

Simple read / write operations to data uniquely identified by a key. State is stored as binary objects. No operations span multiple tables and no need for relational schema.

ACID Properties

See ACID

Set of properties that guarantee that database transactions are processed reliably.

However, systems with ACID tend to have poor availability, hence, Dynamo operate with weaker consistency.

Service Level Agreements

See SLA

300ms for 99.9% of requests for a peak client load of 500 requests/sec.

Design Considerations

Conflict resolution is on read to ensure that writes are never rejected.

Other key principles:

Incremental scalability

Should be able to scale out one storage host at a time.

Symmetry

Every node should have the same set of responsibilities as its peers

Decentralization

No centralized control, more peer-to-peer techniques

Heterogeneity

Work distribution must be proportional to the capabilities of the individual servers.

TO READ

System Architecture

System Interface

  • get(key)
  • put(key, context, object)

Context encodes system metadata about the object such as object’s version.

Partition Algorithm

Relies on consistent hashing (output range is treated as a fixed circular space).

Each node is assigned a random value within this space. And each data item is hashed by its key and walks clockwise to find the first node larger than its position.

Advantage: departure of a node only affect its immediate neighbours

Disadvantage: Non-uniform data and load distribution.

Solution: each node is assigned to multiple points in the ring, virtual nodes.

  • if a node becomes unavailable, the load is evenly dispersed across the remaining nodes.
  • when a node comes back, it accepts a roughly equivalent amount of load from each other available nodes.

Replication

Each data item is replicated at N hosts (which can be configured). Each key k is assigned to a coordinator node. The coordinator is responsible for the replication of the data items. The data is replicated to the N - 1 clockwise successor nodes in the ring. (must be distinct physical nodes)

The list of nodes responsible for storing a key is called the preference list.

Data Versioning

Updates are replicated asynchronously. Hence reads might get back non up-to-date data.

Each modification is treated as a new and immutable version of the data, and multiple versions can be present at the same time. If conflicts happen, the client does the reconciliation.

Dynamo also uses vector clocks to capture causality between versions of the same object.

When a client wishes to update an object, it must specify which version.

Execution of get() and put() operations

Let R be the minimum number of nodes that must participate in a successful read operation and W be the minimum number of nodes that must participate in a successful write operation.

And R + W > N (sloppy quorum)

For a put() request, the coordinator generates the vector clock and writes the new version locally. It then sends the new data to the N highest-ranked reachable nodes, and when at least W - 1 nodes responds, the write is considered successful.

Same for get() request.

Handling Failures: Hinted Handoff

If a node is down, the replica will be sent to the next node, and it will have a hint in its metadata to suggest the intended recipient of the replica. When it comes back, the data is sent back.

This ensures that read/write operations are not failed due to temporary node or network failures.

Handling permanent failures: Replica synchronization

If hinted replicas also become unavailable, a replica-synchronization protocol is implemented.

Merkle trees - hash trees where leaves are hashes of the values of individual keys and parent nodes are hashes of their children.

Each branch can be checked without the entire tree. (decreases the amount of data transferred)

Each node maintains a separate Merkle tree for each key range and two nodes exchange the Merkle tree corresponding to the key range that they host in common.

Membership and Failure Detection

Ring Membership

Gossip-based protocol is used to propagate membership changes and maintains an eventually consistent view of membership. Each node contacts a peer chosen at random every second and the two nodes efficiently reconcile their persisted membership change histories.

External Discovery

To prevent cases where there are two group of nodes not knowing each other called logical partition, Dynamo uses seeds, which are nodes that are known to all nodes.

Failure Detection

A node can be considered failed if it does not respond to one’s message

Adding / Removing Storage Nodes