Ddia-Trouble

Trouble with Distributed Systems

In this chapter we try to enumerate all the things that can go wrong in distributed systems. Spoiler alert: basically everything. Some of the important issues discussed, with the most practical ramifications:

Common Problems

network

Packets may be lost or arbitrarily delayed. Same goes for their replies, so there's no way to know whether a message gets through.

clock sync

A node's clock may go way out of sync with other nodes, even with NTP. It may jump forward/backward in time, and is completely unreliable.

process

A process may pause for a substantial amount of time (especially if you are not using Rust and have a garbage collector like a chump). In this case, it may be declared dead by other nodes and come back to life without knowing what happened.

System Models

Such partial failures are the defining characteristic of distributed systems. When writing an algorithm for distributed systems, it helps to first create a system model where assumptions are made formally. Common system models for timing assumptions include:

Synchronous model: assumes bounded network delay, process pauses, and clock error. Not very realistic for most systems.
Partially synchronous model: assumes bounded most of the time but sometimes exceeds those bounds. Realistic for most systems.
Asynchronous model: there are no timing assumptions and no clock. Since this model is so restrictive, few algorithms can be designed for this model.

Fault types

System models for nodes include:

Crash-stop faults: A node may stop responding forever and never come back.
Crash-recovery faults: A node may stop responding, and then come back after some unknown time. Nodes are also assumed to have stable storage preserved across crashes, with only in-memory state lost due to crash. This is probably the most useful model of the three.
Byzantine (arbitrary) faults: Nodes can do anything, even try to trick/deceive other nodes.

Algorithm Properties

When specifying the correctness properties of an algorithm, it's helpful to characterize properties into safety and liveness properties:

if a safety property is violated, we can point to exactly when it was violated in time. This violation cannot be undone.
a liveness property may not hold at some point in time, but there is always hope that it will be satisfied in the future.

Commonly, we'll require that safety properties always hold in all possible states of a system (e.g. even if all the nodes crash), but allow caveats for liveness properties (e.g. requests needs to receive response but only if a majority of nodes haven't crashed).