|
| 1 | +# Raft Developer Documentation |
| 2 | + |
| 3 | +This documentation provides a high level introduction to the `hashicorp/raft` |
| 4 | +implementation. The intended audience is anyone interested in understanding |
| 5 | +or contributing to the code. |
| 6 | + |
| 7 | +## Contents |
| 8 | + |
| 9 | +1. [Terminology](#terminology) |
| 10 | +2. [Operations](#operations) |
| 11 | + 1. [Apply](./apply.md) |
| 12 | +3. [Threads](#threads) |
| 13 | + |
| 14 | + |
| 15 | +## Terminology |
| 16 | + |
| 17 | +This documentation uses the following terms as defined. |
| 18 | + |
| 19 | +* **Cluster** - the set of peers in the raft configuration |
| 20 | +* **Peer** - a node that participates in the consensus protocol using `hashicorp/raft`. A |
| 21 | + peer may be in one of the following states: **follower**, **candidate**, or **leader**. |
| 22 | +* **Log** - the full set of log entries. |
| 23 | +* **Log Entry** - an entry in the log. Each entry has an index that is used to order it |
| 24 | + relative to other log entries. |
| 25 | + * **Committed** - A log entry is considered committed if it is safe for that entry to be |
| 26 | + applied to state machines. A log entry is committed once the leader that created the |
| 27 | + entry has replicated it on a majority of the peers. A peer has successfully |
| 28 | + replicated the entry once it is persisted. |
| 29 | + * **Applied** - log entry applied to the state machine (FSM) |
| 30 | +* **Term** - raft divides time into terms of arbitrary length. Terms are numbered with |
| 31 | + consecutive integers. Each term begins with an election, in which one or more candidates |
| 32 | + attempt to become leader. If a candidate wins the election, then it serves as leader for |
| 33 | + the rest of the term. If the election ends with a split vote, the term will end with no |
| 34 | + leader. |
| 35 | +* **FSM** - finite state machine, stores the cluster state |
| 36 | +* **Client** - the application that uses the `hashicorp/raft` library |
| 37 | + |
| 38 | +## Operations |
| 39 | + |
| 40 | +### Leader Write |
| 41 | + |
| 42 | +Most write operations must be performed on the leader. |
| 43 | + |
| 44 | +* RequestConfigChange - update the raft peer list configuration |
| 45 | +* Apply - apply a log entry to the log on a majority of peers, and the FSM. See [raft apply](apply.md) for more details. |
| 46 | +* Barrier - a special Apply that does not modify the FSM, used to wait for previous logs to be applied |
| 47 | +* LeadershipTransfer - stop accepting client requests, and tell a different peer to start a leadership election |
| 48 | +* Restore (Snapshot) - overwrite the cluster state with the contents of the snapshot (excluding cluster configuration) |
| 49 | +* VerifyLeader - send a heartbeat to all voters to confirm the peer is still the leader |
| 50 | + |
| 51 | +### Follower Write |
| 52 | + |
| 53 | +* BootstrapCluster - store the cluster configuration in the local log store |
| 54 | + |
| 55 | + |
| 56 | +### Read |
| 57 | + |
| 58 | +Read operations can be performed on a peer in any state. |
| 59 | + |
| 60 | +* AppliedIndex - get the index of the last log entry applied to the FSM |
| 61 | +* GetConfiguration - return the latest cluster configuration |
| 62 | +* LastContact - get the last time this peer made contact with the leader |
| 63 | +* LastIndex - get the index of the latest stored log entry |
| 64 | +* Leader - get the address of the peer that is currently the leader |
| 65 | +* Snapshot - snapshot the current state of the FSM into a file |
| 66 | +* State - return the state of the peer |
| 67 | +* Stats - return some stats about the peer and the cluster |
| 68 | + |
| 69 | +## Threads |
| 70 | + |
| 71 | +Raft uses the following threads to handle operations. The name of the thread is in bold, |
| 72 | +and a short description of the operation handled by the thread follows. The main thread is |
| 73 | +responsible for handling many operations. |
| 74 | + |
| 75 | +* **run** (main thread) - different behaviour based on peer state |
| 76 | + * follower |
| 77 | + * processRPC (from rpcCh) |
| 78 | + * AppendEntries |
| 79 | + * RequestVote |
| 80 | + * InstallSnapshot |
| 81 | + * TimeoutNow |
| 82 | + * liveBootstrap (from bootstrapCh) |
| 83 | + * periodic heartbeatTimer (HeartbeatTimeout) |
| 84 | + * candidate - starts an election for itself when called |
| 85 | + * processRPC (from rpcCh) - same as follower |
| 86 | + * acceptVote (from askPeerForVote) |
| 87 | + * leader - first starts replication to all peers, and applies a Noop log to ensure the new leader has committed up to the commit index |
| 88 | + * processRPC (from rpcCh) - same as follower, however we don’t actually expect to receive any RPCs other than a RequestVote |
| 89 | + * leadershipTransfer (from leadershipTransferCh) - |
| 90 | + * commit (from commitCh) - |
| 91 | + * verifyLeader (from verifyCh) - |
| 92 | + * user restore snapshot (from userRestoreCh) - |
| 93 | + * changeConfig (from configurationChangeCh) - |
| 94 | + * dispatchLogs (from applyCh) - handle client Raft.Apply requests by persisting logs to disk, and notifying replication goroutines to replicate the new logs |
| 95 | + * checkLease (periodically LeaseTimeout) - |
| 96 | +* **runFSM** - has exclusive access to the FSM, all reads and writes must send a message to this thread. Commands: |
| 97 | + * apply logs to the FSM, from the fsmMutateCh, from processLogs, from leaderLoop (leader) or appendEntries RPC (follower/candidate) |
| 98 | + * restore a snapshot to the FSM, from the fsmMutateCh, from restoreUserSnapshot (leader) or installSnapshot RPC (follower/candidate) |
| 99 | + * capture snapshot, from fsmSnapshotCh, from takeSnapshot (runSnapshot thread) |
| 100 | +* **runSnapshot** - handles the slower part of taking a snapshot. From a pointer captured by the FSM.Snapshot operation, this thread persists the snapshot by calling FSMSnapshot.Persist. Also calls compactLogs to delete old logs. |
| 101 | + * periodically (SnapshotInterval) takeSnapshot for log compaction |
| 102 | + * user snapshot, from userSnapshotCh, takeSnapshot to return to the user |
| 103 | +* **askPeerForVote (candidate only)** - short lived goroutine that synchronously sends a RequestVote RPC to all voting peers, and waits for the response. One goroutine per voting peer. |
| 104 | +* **replicate (leader only)** - long running goroutine that synchronously sends log entry AppendEntry RPCs to all peers. Also starts the heartbeat thread, and possibly the pipelineDecode thread. Runs sendLatestSnapshot when AppendEntry fails. |
| 105 | + * **heartbeat (leader only)** - long running goroutine that synchronously sends heartbeat AppendEntry RPCs to all peers. |
| 106 | + * **pipelineDecode (leader only)** |
0 commit comments