Skip to content

Distributed Systems Notes #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Explanation
- *Transaction*: execution of one or more operations (e.g, SQL queries) on a shared database to perform some higher level function.
- *Transaction*: execution of one or more operations (e.g, SQL queries) on a shared database to perform some higher level function. ^dae2be
- Basic unit of change in DBMS.
- Must be #atomic.
- Transaction starts with the BEGIN command and ends with either COMMIT or ABORT.
Expand Down
3 changes: 2 additions & 1 deletion Database Systems/Database Systems.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Explanation
- *Database*: organized collection of inter-related data that models some aspect of the real world.
- *Database Management System*: software that allows applications to store and analyze information in a database.
- *Database Management System*: software that allows applications to store and analyze information in a database. ^1ec6f9
- Designed to allow definition, creation, querying, updating, and administration, of databases in accordance with some data model.
- *Data Model*: collection of concepts for describing the data in a database. Examples:
- [[Relational Model]] (most common),
Expand All @@ -22,6 +22,7 @@
- [[Query Planning & Optimization]]
- [[Concurrency Control]]
- [[Crash Recovery]]
- [[Distributed Databases]]

# FAQ
- Why do we need a database management system? Why not just store data in a flat CSV file? (AKA Strawman system)? Because there are many issues with this approach:
Expand Down
26 changes: 26 additions & 0 deletions Distributed Systems/CAP Theorem.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Explanation

> *The CAP Theorem*: it is impossible for a distributed system to always be **Consistent**, **Available**, and **Partition-Tolerant**. Only two of these three properties can be chosen.

---

### Consistency
- [[Linearizability#^7f4541|Consistency]] is synonymous with [[Linearizability|linearizability]] for operations on all nodes.
- Once a write completes, all future reads should return the value of that write applied or a later write applied.
- Additionally, once a read has been returned, future reads should return that value or the value of a later applied write.
- NoSQL systems compromise this property in favor of the latter two.
- Other systems will favor this property and one of the latter two.

### Availability
- *Availability*: the concept that all nodes that are up can satisfy all requests.

### Partition-Tolerance
- *Partition tolerance*: the system can still operate correctly despite some message loss between nodes that are trying to reach consensus on values.

## Insights
- If consistency and partition-tolerance is chosen for a system, updates will not be allowed until a majority of nodes are reconnected, typically done in traditional or NewSQL DBMSs.
- There is a modern version that considers consistency vs. latency trade-offs: *PACELC Theorem*.
- In case of **network partitioning** (P) in a distributed system, one has to choose between **availability** (A) and **consistency** (C), else (E), even when the system runs normally without network partitions, one has to choose between **latency** (L) and **consistency** (C).

# Sources
- CMU 15-445 Lecture 22 - "Distributed Transactional Database Systems"
52 changes: 52 additions & 0 deletions Distributed Systems/Chain Replication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Explanation
- *Chain Replication (CR)*: a form of [[Replication#Replicated State Machine|replicated state machine]] for replicating data at the [[Replication#^518c94|application level]] across multiple nodes to provide a strongly consistent storage interface.

## Mechanism
- The basic approach organizes all nodes storing an object in a chain.
- Nodes form a chain of a defined length.
- The chain tail handles all read requests, and the chain head handles all write requests.
- **Writes** propagate down the chain in order and are considered **committed** once they reach the tail.
- When a write operation is received by a node, it is propagated to the next node in the chain.
- Once the write reaches the tail node, it has been applied to all replicas in the chain, and it is considered committed.
- **Reads** are only handled by the tail node, so only committed values can be returned by a read.
- CR achieves **strong consistency**/[[Linearizability]] because all writes are committed at the tail and all reads come from the tail.

## Types
- **Basic Chain Replication**: This is the original chain replication method.
- All reads must be handled by the tail node and writes propagate down the chain from the head.
- This method has good throughput and easy recovery, but suffers from read throughput limitations and potential hotspots.
- **Chain Replication with Apportioned Queries (CRAQ)**: This method improves upon basic Chain Replication by allowing any node in the chain to handle read operations while still providing strong consistency guarantees.
- This is achieved by allowing nodes to store multiple versions of an object and using version queries to ensure consistency.

## Advantages
- **Strong Consistency**: CR guarantees that all reads will see the latest written value, ensuring data consistency.
- **Simplicity**: CR is relatively simple to implement compared to other strong consistency protocols.
- **Good Throughput**: Write operations are cheaper than in other protocols offering strong consistency, and can be pipelined down the chain, distributing transmission costs evenly.
- The head sends each write just once, while in Raft, the leader sends writes to all nodes.
- **Quick and Easy Recovery**: The chain structure simplifies failure recovery, making it faster and easier compared to other methods.
- Every replica knows of every committed write, and if the head or tail fails, their successor takes over.

## Disadvantages
- **Limited Read Throughput**: All reads must go to the tail node, limiting read throughput to the capacity of a single node.
- In Raft, reads involve all servers, not just one as in CR.
- **Hotspots**: The single point of read access at the tail can create performance bottlenecks.
- **Increased Latency**: Wide-area deployments can suffer from increased write latency as writes must propagate through the entire chain.
- **Not Immediately Fault-Tolerant**: All CRAQ replicas must participate for any write to commit. If a node is not reachable, CRAQ must wait. This is different from protocols like ZooKeeper and Raft, which are immediately fault-tolerant.
- **Susceptible to Partitions**: Because CRAQ is not immediately fault-tolerant, it is not able to handle partitions. If a partition happens, the system may experience split brain, and must wait until the network is restored.

## Handling Partitions
- To safely use a replication system that can't handle partitions, a single configuration manager must be used.
- The configuration manager is responsible for choosing the head, chain, and tail of the system.
- Servers and clients must obey the configuration manager, or stop operating, regardless of which nodes they think are alive or dead.
- A **configuration manager** is a common pattern for distributed systems.
- It is a core part of systems such as [[The Google File System]] and VMware-FT.
- Typically, Paxos, Raft, or ZooKeeper are used as the configuration service.

## Addressing Limitations
- One way to improve read throughput in CR is to split objects over multiple chains, where each server participates in multiple chains.
- This only works if the load is evenly distributed among the chains.
- However, load is often not evenly distributed, meaning that splitting objects over multiple chains may not be a viable solution.
- CRAQ offers another way to address the read throughput limitation of CR.

# Sources
- [MIT 6.824 - Lecture 9](https://www.youtube.com/watch?v=IXHzbCuADt0)
12 changes: 12 additions & 0 deletions Distributed Systems/Consistent Hashing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Explanation
- *Consistent Hashing*: a hashing scheme that
1. assigns every node to a location on some logical ring,
2. then the hash of every partition key maps to some location on the ring.
3. The node that is closest to the key in the clockwise direction is responsible for that key.
- When a node is added or removed, keys are only moved between nodes adjacent to the new/removed node.
- A replication factor of $n$ means that each key is replicated at the $n$ closest nodes in the clockwise direction.

- Illustration: ![[Consistent Hashing.png]]

# Sources
- CMU 15-445 Lecture 21 - "Introduction to Distributed Databases"
Binary file added Distributed Systems/Consistent Hashing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Explanation
- A distributed transaction accesses data at one or more partitions, which requires expensive coordination.

## Coordination Approaches
### Centralized coordinator
- The centralized coordinator acts as a global “traffic cop” that coordinates all the behavior.
- Illustration: ![[Centralized Coordinator.png]]

### Middleware
- Centralized coordinators can be used as middleware, which accepts query requests and routes queries to correct partitions.

### Decentralized coordinator
- In a decentralized approach, nodes organize themselves.
- The client directly sends queries to one of the partitions.
- This home partition will send results back to the client.
- The home partition is in charge of communicating with other partitions and committing accordingly.

## Distributed Transactions
- *Distributed [[Concurrency Control#^dae2be|Transaction]]*: a transactions that accesses data on multiple nodes.
- Executing distributed transactions is more challenging than single-node transactions because now when the transaction commits, the DBMS has to make sure that all the nodes agree to commit the transaction.
- The DBMS ensures that the database provides the same [[Concurrency Control#ACID|ACID]] guarantees as a single-node DBMS even in the case of node failures, message delays, or message loss.
- One can assume that all nodes in a distributed DBMS are well-behaved and under the same administrative domain.
- In other words, given that there is not a node failure, a node which is told to commit a transaction will commit the transaction.
- If the other nodes in a distributed DBMS cannot be trusted, then the DBMS needs to use a byzantine fault tolerant protocol (e.g., blockchain) for transactions.

## Atomic Commit Protocols
- When a multi-node transaction finishes, the DBMS needs to ask all of the nodes involved whether it is safe to commit.
- Depending on the protocol, a majority of the nodes or all of the nodes may be needed to commit.
- Examples:
- [[Two-Phase Commit]] (Common)
- Three-Phase Commit (Uncommon)
- Paxos (Common)
- Raft (Common)
- ZAB (Zookeeper Atomic Broadcast protocol, Apache Zookeeper)
- Viewstamped Replication (first provably correct protocol)

# Sources
- CMU 15-445 Lecture 21 - "Introduction to Distributed Databases"
- CMU 15-445 Lecture 22 - "Distributed Transactional Database Systems"
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Explanation
- A DBMS’s system architecture specifies what shared resources are directly accessible to CPUs.
- It affects how CPUs coordinate with each other and where they retrieve and store objects in the database.
- Types: ![[Database System Architectures.png]]

## Shared everything
- A single-node DBMS uses what is called a **shared everything** architecture.
- This single node executes workers on a local CPU(s) with its own local memory address space and disk.

## Shared Memory
- An alternative to shared everything architecture in distributed systems is shared memory.
- CPUs have access to common memory address space via a fast interconnect. CPUs also share the same disk.
- In practice, most DBMSs do not use this architecture, since
- it is provided at the [[Operating Systems|OS]]/kernel level, and
- it also causes problems, since each process’s scope of memory is the same memory address space, which can be modified by multiple processes.
- Each processor has a global view of all the in-memory data structures.
- Each DBMS instance on a processor has to “know” about the other instances.

## Shared Disk
- In a shared [[Disks|disk]] architecture, all CPUs can read and write to a single logical disk directly via an interconnect, but each have their own private memories.
- The local storage on each compute node can act as caches.
- This approach is more common in cloud-based DBMSs.
- The DBMS’s execution layer can scale independently from the storage layer.
- Adding new storage nodes or execution nodes does not affect the layout or location of data in the other layer.
- Nodes must send messages between them to learn about other node’s current state.
- That is, since memory is local, if data is modified, changes must be communicated to other CPUs in the case that piece of data is in main memory for the other CPUs.
- Nodes have their own buffer pool and are considered stateless.
- A node crash does not affect the state of the database since that is stored separately on the shared disk.
- The storage layer persists the state in the case of crashes.

## Shared Nothing
- In a shared nothing environment, each node has its own CPU, memory, and disk.
- Nodes only communicate with each other via network.
- Before the rise of cloud storage platforms, the shared nothing architecture used to be considered the correct way to build distributed DBMSs.
- It is more difficult to increase capacity in this architecture because the DBMS has to physically move data to new nodes.
- It is also difficult to ensure consistency across all nodes in the DBMS, since the nodes must coordinate with each other on the state of transactions.
- The advantage is that shared nothing DBMSs can potentially achieve better performance and are more efficient then other types of distributed DBMS architectures.

# Sources
- CMU 15-445 Lecture 21 - "Introduction to Distributed Databases"
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Explanation
- Distributed database systems must partition the database across multiple resources, including disks, nodes, processors.
- This process is sometimes called sharding in NoSQL systems.
- When the DBMS receives a query,
1. it first analyzes the data that the query plan needs to access.
2. The DBMS may potentially send fragments of the query plan to different nodes,
3. then combines the results to produce a single answer.
- The goal of a partitioning scheme is to maximize single-node transactions, or transactions that only access data contained on one partition.
- This allows the DBMS to not need to coordinate the behavior of concurrent transactions running on other nodes.
- A distributed transaction accessing data at one or more partitions will require expensive, difficult coordination.

- Types:
- *Logical Partitioning*: a type of partitioning where particular nodes are in charge of accessing specific tuples from a shared disk.
- *Physical Partitioning*: a type of partitioning where each shared-nothing node reads and updates tuples it contains on its own local disk.

## Naive Table Partitioning
- The simplest way to partition tables is naive data partitioning where each node stores one table, assuming enough storage space for a given node.
- This is easy to implement because a query is just routed to a specific partitioning.
- This can be bad, since it is not scalable: one partition’s resources can be exhausted if that one table is queried on often, not using all nodes available.
- Illustration: ![[Naive Table Partitioning.png]]

## Vertical Partitioning
- *Vertical Partitioning*: a partitioning scheme which splits a table’s attributes into separate partitions.
- Each partition must also store tuple information for reconstructing the original record.

## Horizontal Partitioning
- *Horizontal Partitioning*: a partitioning scheme which splits a table’s tuples into disjoint subsets that are equal in terms of size, load, or usage, based on some partitioning key(s).

## Hash Partitioning
- The DBMS can partition a database physically (shared nothing) or logically (shared disk) via hash partitioning or range partitioning.
- The problem of hash partitioning is that when a new node is added or removed, a lot of data needs to be shuffled around.
- The solution for this is [[Consistent Hashing]].

# Sources
- CMU 15-445 Lecture 21 - "Introduction to Distributed Databases"
23 changes: 23 additions & 0 deletions Distributed Systems/Distributed Databases/Distributed Databases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Explanation
- *Distributed DBMS*: a [[Database Systems#^1ec6f9|DBMS]] that divides a single logical database across multiple physical resources.
- The application is (usually) unaware that data is split across separated hardware.
- The system relies on the techniques and algorithms from single-node DBMSs to support transaction processing and query execution in a distributed environment.
- An important goal in designing a distributed DBMS is fault tolerance (i.e., avoiding a single one node failure taking down the entire system).

## Parallel vs Distributed Databases
- **Parallel Database**:
- Nodes are physically close to each other.
- Nodes are connected via high-speed LAN (fast, reliable communication fabric).
- The communication cost between nodes is assumed to be small. As such, one does not need to worry about nodes crashing or packets getting dropped when designing internal protocols.
- **Distributed Database**:
- Nodes can be far from each other.
- Nodes are potentially connected via a public network, which can be slow and unreliable.
- The communication cost and connection problems cannot be ignored (i.e., nodes can crash, and packets can get dropped).

# Topics
- [[Distributed Database Architecture]]
- [[Distributed Database Partitioning]]
- [[Distributed Concurrency Control]]

# Sources
- CMU 15-445 Lecture 21 - "Introduction to Distributed Databases"
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
34 changes: 34 additions & 0 deletions Distributed Systems/Distributed Databases/Two-Phase Commit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Explanation

## Mechanism

### Phase 1: Prepare
1. The client sends a Commit Request to the [[Distributed Concurrency Control#Coordination Approaches|coordinator]].
2. In the first phase of this protocol, the coordinator sends a Prepare message, essentially asking the participant nodes if the current transaction is allowed to commit.
3. If a given participant verifies that the given transaction is valid, they send an OK to the coordinator.
4. If the coordinator receives an OK from all the participants, the system can now go into the second phase in the protocol.
5. If anyone sends an Abort to the coordinator, the coordinator sends an Abort to the client.

### Phase 2: Commit
1. The coordinator sends a Commit to all the participants, telling those nodes to commit the transaction, if all the participants sent an OK.
2. Once the participants respond with an OK, the coordinator can tell the client that the transaction is committed.
3. If the transaction was aborted in the first phase, the participants receive an Abort from the coordinator, to which they should respond to with an OK.

- **Notes**:
- Either everyone commits or no one does.
- The coordinator can also be a participant in the system.

### Crash Recovery
- In the case of a crash, all nodes keep track of a non-volatile log of the outcome of each phase.
- Nodes block until they can figure out the next course of action.
- If the coordinator crashes, the participants must decide what to do.
- A safe option is just to abort.
- Alternatively, the nodes can communicate with each other to see if they can commit without the explicit permission of the coordinator.
- If a participant crashes, the coordinator assumes that it responded with an abort if it has not sent an acknowledgement yet.

## Optimizations
- **Early Prepare Voting**: If the DBMS sends a query to a remote node that it knows will be the last one executed there, then that node will also return their vote for the prepare phase with the query result.
- **Early Acknowledgement after Prepare**: If all nodes vote to commit a transaction, the coordinator can send the client an acknowledgement that their transaction was successful before the commit phase finishes.

# Sources
- CMU 15-445 Lecture 22 - "Distributed Transactional Database Systems"
Loading