본문 바로가기

PostgreSQL

PostgreSQL Concurrency With MVCC

How MVCC works in PostgreSQL

Let's look deeper into how MVCC works in PostgreSQL to allow "no-locking." Each row in PostgreSQL has two transaction IDs: a creation transaction ID for the transaction that created the row, and an expiration transaction ID for the transaction that expired the row. When an UPDATE is performed, PostgreSQL creates a new row and expires the old row. It's the same row -- just different versions. PostgreSQL creates a new version of the row and also retains the old or expired version. Database systems that use row-level locking do not retain old versions of the data, hence the need for locks to maintain data consistency.

So now that you know how PostgreSQL creates versions of the data, you might be wondering how it knows which version to display.

It's quite simple. At the start of a query, PostgreSQL records two things: 1) the current transaction ID, and 2) all in-process transaction IDs. When a query is issued, PostgreSQL displays all the row versions that match the following criteria:

  • The row's creation transaction ID is a committed transaction and is less than the current transaction counter.
  • The row lacks an expiration transaction ID or its expiration transaction ID was in process at query start.

The power of MVCC is in keeping track of transaction IDs to determine the version of the data, and thereby avoid having to issue any locks. It's very logical and efficient. New users to PostgreSQL will be pleased with the performance improvements of MVCC over row-level locking, especially those running in a large multi-user environment.

Additionally, MVCC offers another advantage: hot backups. MVCC allows PostgreSQL to make a full database backup while the database is live. For many database systems, users need to shutdown the database or lock all tables to get a consistent snapshot. Not so with PostgreSQL. It simply takes a snapshot of the entire database at a point in time and dumps the output even while data is being inserted, updated or deleted.

In conclusion, MVCC ensures that readers never wait for writers and writers never wait for readers. If you don't believe that PostgreSQL's multi-version model is better than row-level locking, I challenge you to try out PostgreSQL for yourself.


PostgreSQL Concurrency With MVCC

One of the big selling points of Postgres is how it handles concurrency. The promise is simple: reads never block writes and vice versa. Postgres achieves this via a mechanism called Multi Version Concurrency Control. This technique is not unique to Postgres: there are several databases that implement some form of MVCC including Oracle, Berkeley DB, CouchDB and many more. Understanding how MVCC is implemented in Postgres is important when designing highly concurrent apps on PostgreSQL. It’s actually a very elegant and simple solution to a hard problem.

How MVCC works

Every transaction in postgres gets a transaction ID called XID. This includes single one statement transactions such as an insert, update or delete, as well as explicitely wrapping a group of statements together via BEGIN - COMMIT. When a transaction starts, Postgres increments an XID and assigns it to the current transaction. Postgres also stores transaction information on every row in the system, which is used to determine whether a row is visible to the transaction or not.

For example, when you insert a row, postgres will store the XID in the row and call it xmin. Every row that has been committed and has anxmin that is less than the current transaction’s XID is visible to the transaction. This means that you can start a transaction and insert a row, and until that transaction COMMITs that row will not be visible to other transactions. Once it commits and other transactions get created, they will be able to view the new row because they satisfy the xmin < XID condition – and the transaction that created the row has completed.

A similar mechanism occurs for DELETEs and UPDATEs, only in these cases Postgres stores an xmax value on each row in order to determine visibility. This diagram shows two concurrent transactions inserting and reading rows, and how MVCC comes into play in terms of transaction isolation.

For the following charts, assume the following DDL:

CREATE TABLE numbers (value int);

MVCC read/write isolation

While the xmin and xmax values are hidden from daily operations, you can actually just ask for them and Postgres will hapilly give them to you:

SELECT *, xmin, xmax FROM numbers;

You can also get the XID for the current transaction pretty easily:

SELECT txid_current();

Neat!

I know what you’re thinking though: what about a two transactions updating the same row at the same time? This is where transaction isolation levels come in. Postgres basically supports two models that allow you to control how this situation should be handled. The default,READ COMMITTED, reads the row after the inital transaction has completed and then executes the statement. It basically starts over if the row changed while it was waiting. For instance, if you issue anUPDATE with a WHERE clause, the WHERE clause will rerun after the initial transaction commits, and the UPDATE takes place if the WHEREclause is still satisfied. Here’s an example of two transactions modifying a row where the initial UPDATE causes the WHERE clause of the second transaction to return no rows. Therefore the second transaction does not update any rows at all:

READ COMMITTED transaction isolation

If you need finer control over this behavior, you can set the transaction isolation level to SERIALIZABLE. With this strategy the above scenario will just fail, because it says “If the row I’m modifying has been modified by another transaction, don’t even try,” and Postgres will respond with the error message ERROR: could not serialize access due to concurrent update. It’s up to your app to handle that error and try again, or to give up if that’s what makes sense.

SERIALIZABLE transaction isolation

Disadvantages of MVCC

Now that you know how MVCC and transaction isolation actualy works, you’ve added another tool for solving the kinds of problems where aSERIALIZABLE isolation level comes in handy. While the advantages of MVCC are clear it also has some disadvantages.

Because different transactions will have visibility to a different set of rows, Postgres needs to maintain potentially obsolete records. This is why an UPDATE actually creates a new row and why DELETE doesn’treally remove the row: it merely marks it as deleted and sets the XID values appropriately. As transactions complete, there will be rows in the database that cannot possibly be visible to any future transactions. These are called dead rows. Another problem that comes from MVCC is that transaction IDs can only ever grow so much – they are 32 bits and can “only” support around 4 billion transactions. When the XID reaches its max, it will wraparound and start back at zero. Suddenly allrows appear to be in future transactions, and no new transactions would have visibility into those rows.

Both dead rows and the transaction XID wraparound problem are solved with VACUUM. This should be routine maintenance, but thankfuly Postgres comes with an auto_vacuum daemon that will run at a configurable frequency. It’s important to keep an eye on this because different deployments will have different needs when it comes to vacuum frequency. You can read more about what VACUUM actually does on the Postgres docs.

참조 ) https://devcenter.heroku.com/articles/postgresql-concurrency