MVCC and transactional design shootout :gun: :bomb: :explosionbomb: :extrabigexplosion:

Time for a literature shootout to determine the initial architecture for sled 0.16's transactions and snapshots! If anyone has particular insights or opinions on these, please jump in here!

There are a number of separate concerns when MVCC is mentioned, that we should be careful not to conflate (taken from the paper "An Empirical Evaluation of In-Memory Multi-Version Concurrency Control" below):
* concurrency control protocols - this is the primary thing under consideration in this issue!
* version storage
    * append-only storage - all versions exist in the same storage space. lock free linked list points either from oldest to newest (O2N) which suffers from cache pollution due to having to scan through old versions always, and newest to oldest (N2O) which requires an indirection table (which we have anyway). off-page storage of BLOBs can be complicated, may need reference counts.
    * time-travel storage - other versions are stored in a separate table. some systems store the newest version in the master table, others store the oldest there to facilitate easy SI (SAP HANA does this). For our trade-offs, we may be fine keeping old versions in a separate, memory-only space after a transaction completes and invalidates them, and then we could use an epoch system to handle GC. This does not scale for long snapshots, but this is acceptable for the chosen tradeoffs of this system. time-travel storage can suffer from the same issues with off-page storage as append-only storage.
    * delta-storage - used by mysql, oracle, and HyPer. just store the main version in storage, and then store deltas as later versions. can incur read performance hits.
    * general considerations: balancing write contention and cache pollution are the major trade-offs here
* GC
    * vaccuum (doesn't scale but is simple)
    * cooperative cleaning - when a page is read if we notice it has stale versions, tell a global GC mechanism about it, vulnerable to "dusty corners" so hekaton does this but also occasionally vacuums. only compatible with append-only storage (so we're good). requires that the mechanism that potentially cleans up data reads the oldest entries. we can implement this by having relocation calls fully read chains, while reads only access the new part of chains that they require to keep cache pollution down. right now sled reads the entire page always, which pollutes cache. this is something to flag for later.
    * transaction-level gc, track read+write sets of each transaction per epoch, when epochs end, we can clear all state generated by transactions in that epoch. simple but requires tracking the epoch mapping
* index management of secondary indices (2i). we are **NOT** going to be implementing 2i in 0.16, but we should be mindful not to make the path toward implementing them in the future needlessly complex
    * logical pointer to primary key - grows in size with primary key length
    * logical pointer to tuple ID - use a separate lock-free lookup table, just store 64 bit tuple ID
    * physical pointer to storage. only usable with append-only storage, so we're good. used by hekaton, memsql. when we update the current version in storage, we update all 2i's to point to the new physical location.

The ideas in silo and tictoc particularly appeal to me due to the focus on reducing scalability barriers at high core counts, where we'll be headed in the coming years. A concern that will have a major impact on the implementation is how snapshots are handled. Sled is optimized for **point reads** and **short-medium scans**. It is acceptable for long scans to pay a higher GC penalty, especially if it simplifies implementation complexity. We value **reliable worst case latency** far more than crazy p0th-p50.

[High-Performance Concurrency Control Mechanisms for Main-Memory Databases (2012)](http://vldb.org/pvldb/vol5/p298_per-akelarson_vldb2012.pdf)
* Hekaton MVCC paper. They went with the optimistic choice here.

[Speedy Transactions in Multicore In-Memory Databases (2013)](http://db.csail.mit.edu/pubs/silo.pdf)
(silo)
* interesting techniques for supporting many-core, decentralized systems
* epoch-based
```
it avoids all centralized contention points, including that of
centralized transaction ID assignment. Silo’s key contri-
bution is a commit protocol based on optimistic concur-
rency  control  that  provides  serializability  while  avoid-
ing all shared-memory writes for records that were only
read.  Though  this  might  seem  to  complicate  the  en-
forcement of a serial order, correct logging and recov-
ery is provided by linking periodically-updated epochs
with the commit protocol. Silo provides the same guar-
antees as any serializable database without unnecessary
scalability bottlenecks or much additional latency. 
```

[Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems (2015)](http://15721.courses.cs.cmu.edu/spring2016/papers/p677-neumann.pdf)
* keep old versions in undo buffers
```
We present a novel MVCC implementation for main-mem-
ory database systems that has very little overhead compared
to serial execution with single-version concurrency control,
even when maintaining serializability guarantees.  Updating
data in-place and storing versions as before-image deltas in
undo buffers not only allows us to retain the high scan per-
formance  of  single-version  systems  but  also  forms  the  ba-
sis  of  our  cheap  and  fine-grained  serializability  validation
mechanism.   The  novel  idea  is  based  on  an  adaptation  of
precision  locking  and  verifies  that  the  (extensional)  writes
of recently committed transactions do not intersect with the
(intensional) read predicate space of a committing transac-
tion.  We experimentally show that our MVCC model allows
very  fast processing of  transactions with point accesses
as well as read-heavy transactions and that there is little need
to prefer SI over full serializability any longer.
```

[High performance transactions via early write visibility (2017)](http://www.vldb.org/pvldb/vol10/p613-faleiro.pdf)
* restrict transaction abort conditions to avoid delayed write visibility
```delayed write visibility stems from the fact that database systems can arbitrarily abort transactions at any point during their execution. Accordingly, we make the case for database systems which only abort transactions under a restricted set of conditions, thereby enabling a new recoverability mechanism, early write visibility, which safely makes transactions' writes visible prior to the end of their execution. We design a new serializable concurrency control protocol, piece-wise visibility (PWV), with the explicit goal of enabling early write visibility. We evaluate PWV against state-of-the-art serializable protocols and a highly optimized implementation of read committed, and find that PWV can outperform serializable protocols by an order of magnitude and read committed by 3X on high contention workloads.```

[Efficiently making (almost) any concurrency control mechanism serializable (2017)](https://arxiv.org/pdf/1605.04292.pdf)
* certifier approach
```
We propose the serial safety net (SSN), a
serializability-enforcing certifier which can be applied on top of var-
ious CC schemes that offer higher performance but admit anomalies,
such as snapshot isolation and read committed. The underlying CC
mechanism retains control of scheduling and transactional accesses,
while SSN tracks the resulting dependencies. At commit time, SSN
performs a validation test by examining only direct dependencies
of the committing transaction to determine whether it can commit
safely or must abort to avoid a potential dependency cycle.
```

[An Empirical Evaluation of In-Memory Multi-Version Concurrency Control (2017)](https://www.cs.cmu.edu/~jarulraj/papers/2017.mvcc.vldb.pdf)
* a survey of trade-offs, mentions a few of the previous contestants here
```
we conduct an extensive study of the
scheme’s four key design decisions: concurrency control protocol,
version storage, garbage collection, and index management.  We
implemented state-of-the-art variants of all of these in an in-memory
DBMS and evaluated them using OLTP workloads.  Our analysis
identifies the fundamental bottlenecks of each design choice
```
[Transaction Healing: Scaling Optimistic Concurrency Control on Multicores (2016)](http://www.comp.nus.edu.sg/~yingjun/papers/sigmod2016.pdf)
```
we propose a new concurrency-control mechanism, 
called transaction healing, that exploits program
 semantics to scale the conventional OCC towards 
dozens of cores even under highly contended workloads.
 Transaction healing captures the dependencies across 
operations within a transaction prior to its execution. 
Instead of blindly rejecting a transaction once its validation 
fails, the proposed mechanism judiciously restores any
non-serializable operation and heals inconsistent transaction 
states as well as query results according to the extracted 
dependencies. Transaction healing can partially update the 
membership of read/write sets when processing dependent 
transactions. Such overhead, however, is largely reduced by 
carefully avoiding false aborts and rearranging validation 
orders. We implemented the idea of transaction healing in 
TheDB, a main-memory database prototype that provides 
full ACID guarantee with a scalable commit protocol. By 
evaluating TheDB on a 48-core machine with two widely-used 
benchmarks, we confirm that transaction healing can scale 
near-linearly, yielding significantly higher transaction rate than 
the state-of-the-art OCC implementations.
```

[TicToc: Time Traveling Optimistic Concurrency Control (2016)](http://people.csail.mit.edu/yxy/pubs/tictoc.pdf)
* avoid centralized timestamp ordering

```
Previous research has shown that timestamp management is the key
scalability bottleneck in concurrency control algorithms. This pre-
vents the system from scaling to large numbers of cores.
In this paper we present TicToc, a new optimistic concurrency
control algorithm that avoids the scalability and concurrency bot-
tlenecks of prior T/O schemes. TicToc relies on a novel and prov-
ably correct data-driven timestamp management protocol. Instead
of assigning timestamps to transactions, this protocol assigns read
and write timestamps to data items and uses them to lazily com-
pute a valid commit timestamp for each transaction. TicToc re-
moves the need for centralized timestamp allocation, and commits
transactions that would be aborted by conventional T/O schemes.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MVCC and transactional design shootout 🔫 💣 :explosionbomb: :extrabigexplosion: #232

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MVCC and transactional design shootout 🔫 💣 :explosionbomb: :extrabigexplosion: #232

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions