Time for a literature shootout to determine the initial architecture for sled 0.16's transactions and snapshots! If anyone has particular insights or opinions on these, please jump in here!
There are a number of separate concerns when MVCC is mentioned, that we should be careful not to conflate (taken from the paper "An Empirical Evaluation of In-Memory Multi-Version Concurrency Control" below):
- concurrency control protocols - this is the primary thing under consideration in this issue!
- version storage
- append-only storage - all versions exist in the same storage space. lock free linked list points either from oldest to newest (O2N) which suffers from cache pollution due to having to scan through old versions always, and newest to oldest (N2O) which requires an indirection table (which we have anyway). off-page storage of BLOBs can be complicated, may need reference counts.
- time-travel storage - other versions are stored in a separate table. some systems store the newest version in the master table, others store the oldest there to facilitate easy SI (SAP HANA does this). For our trade-offs, we may be fine keeping old versions in a separate, memory-only space after a transaction completes and invalidates them, and then we could use an epoch system to handle GC. This does not scale for long snapshots, but this is acceptable for the chosen tradeoffs of this system. time-travel storage can suffer from the same issues with off-page storage as append-only storage.
- delta-storage - used by mysql, oracle, and HyPer. just store the main version in storage, and then store deltas as later versions. can incur read performance hits.
- general considerations: balancing write contention and cache pollution are the major trade-offs here
- GC
- vaccuum (doesn't scale but is simple)
- cooperative cleaning - when a page is read if we notice it has stale versions, tell a global GC mechanism about it, vulnerable to "dusty corners" so hekaton does this but also occasionally vacuums. only compatible with append-only storage (so we're good). requires that the mechanism that potentially cleans up data reads the oldest entries. we can implement this by having relocation calls fully read chains, while reads only access the new part of chains that they require to keep cache pollution down. right now sled reads the entire page always, which pollutes cache. this is something to flag for later.
- transaction-level gc, track read+write sets of each transaction per epoch, when epochs end, we can clear all state generated by transactions in that epoch. simple but requires tracking the epoch mapping
- index management of secondary indices (2i). we are NOT going to be implementing 2i in 0.16, but we should be mindful not to make the path toward implementing them in the future needlessly complex
- logical pointer to primary key - grows in size with primary key length
- logical pointer to tuple ID - use a separate lock-free lookup table, just store 64 bit tuple ID
- physical pointer to storage. only usable with append-only storage, so we're good. used by hekaton, memsql. when we update the current version in storage, we update all 2i's to point to the new physical location.
The ideas in silo and tictoc particularly appeal to me due to the focus on reducing scalability barriers at high core counts, where we'll be headed in the coming years. A concern that will have a major impact on the implementation is how snapshots are handled. Sled is optimized for point reads and short-medium scans. It is acceptable for long scans to pay a higher GC penalty, especially if it simplifies implementation complexity. We value reliable worst case latency far more than crazy p0th-p50.
High-Performance Concurrency Control Mechanisms for Main-Memory Databases (2012)
- Hekaton MVCC paper. They went with the optimistic choice here.
Speedy Transactions in Multicore In-Memory Databases (2013)
(silo)
- interesting techniques for supporting many-core, decentralized systems
- epoch-based
it avoids all centralized contention points, including that of
centralized transaction ID assignment. Silo’s key contri-
bution is a commit protocol based on optimistic concur-
rency control that provides serializability while avoid-
ing all shared-memory writes for records that were only
read. Though this might seem to complicate the en-
forcement of a serial order, correct logging and recov-
ery is provided by linking periodically-updated epochs
with the commit protocol. Silo provides the same guar-
antees as any serializable database without unnecessary
scalability bottlenecks or much additional latency.
Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems (2015)
- keep old versions in undo buffers
We present a novel MVCC implementation for main-mem-
ory database systems that has very little overhead compared
to serial execution with single-version concurrency control,
even when maintaining serializability guarantees. Updating
data in-place and storing versions as before-image deltas in
undo buffers not only allows us to retain the high scan per-
formance of single-version systems but also forms the ba-
sis of our cheap and fine-grained serializability validation
mechanism. The novel idea is based on an adaptation of
precision locking and verifies that the (extensional) writes
of recently committed transactions do not intersect with the
(intensional) read predicate space of a committing transac-
tion. We experimentally show that our MVCC model allows
very fast processing of transactions with point accesses
as well as read-heavy transactions and that there is little need
to prefer SI over full serializability any longer.
High performance transactions via early write visibility (2017)
- restrict transaction abort conditions to avoid delayed write visibility
delayed write visibility stems from the fact that database systems can arbitrarily abort transactions at any point during their execution. Accordingly, we make the case for database systems which only abort transactions under a restricted set of conditions, thereby enabling a new recoverability mechanism, early write visibility, which safely makes transactions' writes visible prior to the end of their execution. We design a new serializable concurrency control protocol, piece-wise visibility (PWV), with the explicit goal of enabling early write visibility. We evaluate PWV against state-of-the-art serializable protocols and a highly optimized implementation of read committed, and find that PWV can outperform serializable protocols by an order of magnitude and read committed by 3X on high contention workloads.
Efficiently making (almost) any concurrency control mechanism serializable (2017)
We propose the serial safety net (SSN), a
serializability-enforcing certifier which can be applied on top of var-
ious CC schemes that offer higher performance but admit anomalies,
such as snapshot isolation and read committed. The underlying CC
mechanism retains control of scheduling and transactional accesses,
while SSN tracks the resulting dependencies. At commit time, SSN
performs a validation test by examining only direct dependencies
of the committing transaction to determine whether it can commit
safely or must abort to avoid a potential dependency cycle.
An Empirical Evaluation of In-Memory Multi-Version Concurrency Control (2017)
- a survey of trade-offs, mentions a few of the previous contestants here
we conduct an extensive study of the
scheme’s four key design decisions: concurrency control protocol,
version storage, garbage collection, and index management. We
implemented state-of-the-art variants of all of these in an in-memory
DBMS and evaluated them using OLTP workloads. Our analysis
identifies the fundamental bottlenecks of each design choice
Transaction Healing: Scaling Optimistic Concurrency Control on Multicores (2016)
we propose a new concurrency-control mechanism,
called transaction healing, that exploits program
semantics to scale the conventional OCC towards
dozens of cores even under highly contended workloads.
Transaction healing captures the dependencies across
operations within a transaction prior to its execution.
Instead of blindly rejecting a transaction once its validation
fails, the proposed mechanism judiciously restores any
non-serializable operation and heals inconsistent transaction
states as well as query results according to the extracted
dependencies. Transaction healing can partially update the
membership of read/write sets when processing dependent
transactions. Such overhead, however, is largely reduced by
carefully avoiding false aborts and rearranging validation
orders. We implemented the idea of transaction healing in
TheDB, a main-memory database prototype that provides
full ACID guarantee with a scalable commit protocol. By
evaluating TheDB on a 48-core machine with two widely-used
benchmarks, we confirm that transaction healing can scale
near-linearly, yielding significantly higher transaction rate than
the state-of-the-art OCC implementations.
TicToc: Time Traveling Optimistic Concurrency Control (2016)
- avoid centralized timestamp ordering
Previous research has shown that timestamp management is the key
scalability bottleneck in concurrency control algorithms. This pre-
vents the system from scaling to large numbers of cores.
In this paper we present TicToc, a new optimistic concurrency
control algorithm that avoids the scalability and concurrency bot-
tlenecks of prior T/O schemes. TicToc relies on a novel and prov-
ably correct data-driven timestamp management protocol. Instead
of assigning timestamps to transactions, this protocol assigns read
and write timestamps to data items and uses them to lazily com-
pute a valid commit timestamp for each transaction. TicToc re-
moves the need for centralized timestamp allocation, and commits
transactions that would be aborted by conventional T/O schemes.
Time for a literature shootout to determine the initial architecture for sled 0.16's transactions and snapshots! If anyone has particular insights or opinions on these, please jump in here!
There are a number of separate concerns when MVCC is mentioned, that we should be careful not to conflate (taken from the paper "An Empirical Evaluation of In-Memory Multi-Version Concurrency Control" below):
The ideas in silo and tictoc particularly appeal to me due to the focus on reducing scalability barriers at high core counts, where we'll be headed in the coming years. A concern that will have a major impact on the implementation is how snapshots are handled. Sled is optimized for point reads and short-medium scans. It is acceptable for long scans to pay a higher GC penalty, especially if it simplifies implementation complexity. We value reliable worst case latency far more than crazy p0th-p50.
High-Performance Concurrency Control Mechanisms for Main-Memory Databases (2012)
Speedy Transactions in Multicore In-Memory Databases (2013)
(silo)
Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems (2015)
High performance transactions via early write visibility (2017)
delayed write visibility stems from the fact that database systems can arbitrarily abort transactions at any point during their execution. Accordingly, we make the case for database systems which only abort transactions under a restricted set of conditions, thereby enabling a new recoverability mechanism, early write visibility, which safely makes transactions' writes visible prior to the end of their execution. We design a new serializable concurrency control protocol, piece-wise visibility (PWV), with the explicit goal of enabling early write visibility. We evaluate PWV against state-of-the-art serializable protocols and a highly optimized implementation of read committed, and find that PWV can outperform serializable protocols by an order of magnitude and read committed by 3X on high contention workloads.Efficiently making (almost) any concurrency control mechanism serializable (2017)
An Empirical Evaluation of In-Memory Multi-Version Concurrency Control (2017)
Transaction Healing: Scaling Optimistic Concurrency Control on Multicores (2016)
TicToc: Time Traveling Optimistic Concurrency Control (2016)