6 months, 3 weeks

Multi-Version Concurrency Control

Multi-version concurrency control (MVCC) is currently the most popular transaction management scheme in modern database management systems (DBMSs). Although first proposed in 1978 MIT Ph.D. dissertation, it is used in almost every major relational DBMS released in the last decade. Maintaining multiple versions of data potentially increases parallelism without sacrificing serializability when processing transactions. But scaling MVCC in a multi-core and in-memory setting is non-trivial: when there are a large number of threads running in parallel, the synchronization overhead can outweigh the benefits of multi-versioning.

Transaction isolation is one of the most fundamental features offered by a database management system (DBMS). It provides the user with the illusion of being alone in the database system, even in the presence of multiple concurrent users, which greatly simplifies application development. In the background, the DBMS ensures that the resulting concurrent access patterns are safe, ideally by being serializable.

Serializability is a great concept, but it is hard to implement efficiently. A classical way to ensure serializability is to rely on a variant of Two-Phase Locking (2PL). Using 2PL, the DBMS maintains read and write locks to ensure that conflicting transactions are executed in a well-defined order, which results in serializable execution schedules. Locking, however, has several major disadvantages: First, reader and writer block each other. Second, most transactions are read-only and therefore harmless from a transaction-ordering perspective. Using a locking-based isolation mechanism, no update transaction is allowed to change a data object that has been read by a potentially long-running read transaction and thus has to wait until the read transaction finishes. This severely limits the degree of concurrency in the system. 

Multi-Version Concurrency Control (MVCC)  offers an elegant solution to this problem. Instead of updating data objects in-place, each update creates a new version of that data object, such that concurrent readers can still see the old version while the update transaction proceeds concurrently. As a consequence, read-only transactions never have to wait, and in fact, do not have to use locking at all. This is an extremely desirable property and the reason why many DBMSs implement MVCC, e.g., Oracle, Microsoft SQL Server, SAP HANA, and PostgreSQL. However, most systems that use MVCC do not guarantee serializability, but the weaker isolation level Snapshot Isolation (SI). Under SI, every transaction sees the database in a certain state (typically the last committed state at the beginning of the transaction) and the DBMS ensures that two concurrent transactions do not update the same data object. Although SI offers fairly good isolation, some non-serializable schedules are still allowed. This is often reluctantly accepted because making SI serializable tends to be prohibitively expensive. In particular, the known solutions require keeping track of the entire read set of every transaction, which creates a huge overhead for read-heavy (e.g., analytical) workloads. Still, it is desirable to detect serializability conflicts as they can lead to silent data corruption, which in turn can cause hard-to-detect bugs.

Main Benefits                                                                                           
      • Writes do not block readers.                                                                                        
      • Read-only transactions can read a consistent snapshot without acquiring locks.
      • Easily support time-travel queries.                                                                         

MVCC is more than just a “concurrency control protocol”. It completely affects how the DBMS manages transactions and the database. There are five key design decisions: 
• Concurrency Control Protocol                                                                                      
• Version Storage                                                                           
• Garbage Collection                                                                          
• Index Management                                                                                                                          
• Transaction Id Wraparound

Concurrency Control Protocol                                                                                   

The DBMS is able to use all of the same concurrency control protocols under MVCC.

Timestamp Ordering (MV-TO):                                                                    

• Use a read-ts field in the header to keep track of the timestamp of the last transaction that read it.

• Transaction is allowed to read version if the lock is unset and its Tid is between begin-ts and end-ts.

• For writes, the transaction creates a new version if no other transaction holds lock and Tid is greater than read-ts.

Version Storage                                                                                 
The DBMS uses the tuple’s pointer field to create a latch-free version chain per logical tuple. This allows the DBMS to find the version that is visible to a particular transaction at runtime. Indexes always point to  “head” of the chain.
Thread store versions in “local” memory regions to avoid contention on centralized data structures. Different storage schemes determine where/what to store for each version.
For non-inline attributes, the DBMS can reuse pointers to variable-length pool for values that do not change between versions. This requires reference counters to know when it is safe to free memory. This optimization also makes it more difficult to relocate memory from the variable-length pool.

Append-Only Storage                                                                                           
• All of the physical versions of a logical tuple are stored in the same table space (table heap).
• On update, append new tuple to same table heap in an empty slot.

• Oldest-to-Newest (O2N): Append new version to end of chain, traverse entire chain on lookup.
• Newest-to-Oldest (N2O): Have to update index pointers for every new version, but do not have to traverse chain on look ups.
Time-Travel Storage                                                                                 
• On every update, copy current version to the time-travel table, and update pointer.
• Overwrite master version in main table, and update pointer.
Delta Storage                                                                                      
• On every update, copy only the values that were modified into the delta storage and overwrite the master version.
• Transaction can recreate old versions by applying the delta in reverse order.


Garbage Collection                                              
The DBMS needs to remove reclaimable physical versions from the database over time. A version is reclaimable if (1) no active transaction in the DBMS can see that version or (2) the version was created by an aborted transaction.
Tuple Level                                                               
• Find old versions by examining tuples directly.
• Background Vacuuming: Separate threads periodically scan the table and look for reclaimable versions. Works with any version storage technique.
• Cooperative Cleaning: Worker threads identify reclaimable versions as they traverse version change. Only works with O2N version chains.
Transaction Level                                                                                        
• Transactions keep track of their old version so the DBMS does not have to scan tuples to determine visibility.
• The DBMS determines when all versions created by a finishing transaction are no longer visible.
• May still maintain multiple threads to reclaim the memory fast enough for the workload.

Index Management                                                                                       
All MVCC DBMSs keep the database’s versioning information separate from its indexes. That is, the existence of a key in an index means that some version exists with that key but the index entry does not contain information about which versions of the tuple match. We define an index entry as a key/value pair, where the key is a tuple’s indexed attribute(s) and the value is a pointer to that tuple. The DBMS follows this pointer to a tuple’s version chain and then scans the chain to locate the version that is visible for a transaction. The DBMS will never incur a false negative from an index, but it may get false positive matches because the index can point to a version for a key that may not be visible to a particular transaction.
Primary Key                                                                                                            
Primary key indexes always point to the version chain head. If a transaction updates a primary key attribute(s), then this is treated as a DELETE followed by an INSERT.
Secondary Indexes                                                                  
Approach #1: Logical Pointer                                                                 
• Use a fixed identifier per tuple that does not change
• Requires an extra indirection layer
• Secondary indexes can store Primary Key or Tuple ID
Approach #2: Physical Address                                                                       

• Pointer physical points to tuple

Transaction Id Wraparound                                         
If the DBMS reaches the max value for its timestamps, it will have to wrap around and start at zero. This will make all previous versions be in the “future” from new transactions.
Postgres Txn-ID Wraparound                                                                   
• Stop accepting new commands when the system gets close to the max transaction id.
• Set a flag in each tuple header that says that it is “frozen” in the past. Any new transaction id will always be newer than a frozen version.
• Runs the vacuum before the system gets close to this upper limit.