Course Content
ACID Properties and Query Processing
0/2
PostgreSQL Replication
0/1
PostgreSQL Upgrade
0/1
PostgreSQL Tutorial for Absolute Beginners [Administration]
About Lesson

Write Ahead Logging — WAL

In the field of computer science, WAL is an acronym of Write Ahead Logging, which is a protocol or a rule to write both changes and actions into a transaction log, whereas in PostgreSQL, WAL is an acronym of Write Ahead Log. There the term is used as synonym of transaction log, and also used to refer to an implemented mechanism related to writing action to a transaction log (WAL).

PostgreSQL : Write A Head Logs

Reliability is one of the important aspects of database management system. PostgreSQL allows reliability using WAL Files.

WAL is a Write – ahead log file which stores the transactions log information in the WAL files which helps us with durability.

Whenever a change is occurred in the database, the changed transaction is written in the wal buffers and upon subsequent commit, an operating systems wal_sync_method call is  made and the contents in the wal buffer is written to the wal file.

What is WALWritelock

  • This lock is used to write transaction log buffer data to WAL file.

After taking this lock all the transaction log buffer data upto predecided

point will get flushed.

  • Places where it get used are

– Flush of transaction log which can be due to Commit, Flush of data buffers, truncate of commit log etc.

– During Switch of Xlog.

– During get of new Xlog buffer, if all buffers are already occupied and not flushed.

  • Get the time of the last xlog segment switch

The WAL File is stored under PG_DATA/pg_wal

The first file that is created is 000000010000000000000001.

here,

00000001 is physical

00000000 is logical

00000001 is timelineid

If the first one has been filled up with the writing of XLOG records, the second one 000000010000000000000002 would be provided.

Files of successor is used in ascending order in succession, after 0000000100000000000000FF has been filled up, next one 000000010000000100000000 will be provided. In this way, whenever the last 2-digit carries over, the middle 8-digit number increases one.

Example.,

A logical xlog file has 255 segments, physical is 1 segment, i.e., after 000000FF log switches, the logical value would be incremented by 1.

testdb=# select pg_switch_xlog();
pg_switch_xlog
----------------
1/8B000078
(1 row)

Here, 1 is logical segment and 8B000078 is offset inside logical segment.

Replication:

postgres=# select pg_last_xlog_replay_location();
pg_last_xlog_replay_location
------------------------------
74B/E4D1C628
(1 row)

In the master:

postgres=# select pg_current_xlog_location();
pg_current_xlog_location
--------------------------
74B/E4D3B070
(1 row)

The logical files here are same, so E4D3B070 – E4D1C628 = 1EA48 (125512 bytes) i.e., replica is roughly 123 kB behind the master

Asynchronous Commit:

Asynchronous commit is an option that allows transactions to complete more quickly, at the cost that the most recent transactions may be lost if the database should crash. In many applications this is an acceptable trade-off. Data loss can be seen in Asynchronous Commit.

commit_delay:

Commit_delay also sounds very similar to asynchronous commit, but it is actually a synchronous commit method. commit_delay causes a delay just before a transaction flushes WAL to disk,The setting can be thought of as a way of increasing the time window in which transactions can join a group about to participate in a single flush, to amortize the cost of the flush among multiple transactions.

Synchronous Commit:

Specifies whether transaction commit will wait for WAL records to be written to disk before the command returns a “success” indication to the client.

Asynchronous commit provides behavior different from setting fsync = off. fsync is a server-wide setting that will alter the behavior of all transactions where as Asynchronous commit can be set at session level with SET command.

What happens when fsync() and Synchronous Commit is set OFF?

When fsync is OFF, the PostgreSQL server will never issue any fsync system call, leaving entirely to the operating system the decision as to which blocks to transfer from write cache to disk, and when.

When synchronous_commit is OFF, the server issues fsync call, but not necessarily immediately at commit time for each transaction. It may delay them after the commit, for a maximum of wal_writer_delay multiplied by 3. With a default configuration that would be 600ms.

WAL Configuration:

In general, the transactions are first written to wal buffers and then to the WAL file, these wal files are needed for recovery purpose on a system crash.

The data is the shared buffers is then written to permanent storage at a later point of time especially during CHECKPOINT.

At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file.

In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the log (known as the redo record) from which it should start the REDO operation. Any changes made to data files before that point are guaranteed to be already on disk. Hence, after a checkpoint, log segments preceding the one containing the redo record are no longer needed and can be recycled or removed.

Excerpt from the WAL File

rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 86/E5000028, prev 86/E4000108, desc: RUNNING_XACTS nextXid 2091743 latestCompletedXid 2091742 oldestRunningXid 2091743

rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 86/E5000060, prev 86/E5000028, desc: RUNNING_XACTS nextXid 2091743 latestCompletedXid 2091742 oldestRunningXid 2091743

rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn: 86/E5000098, prev 86/E5000060, desc: CHECKPOINT_ONLINE redo 86/E5000060; tli 1; prev tli 1; fpw false; xid 0:2091743; oid 190334; multi 1; offset 0; oldest xid 1664 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 2091743; online

rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 86/E5000108, prev 86/E5000098, desc: RUNNING_XACTS nextXid 2091743 latestCompletedXid 2091742 oldestRunningXid 2091743

rmgr: Heap len (rec/tot): 71/ 71, tx: 2091743, lsn: 86/E5000140, prev 86/E5000108, desc: HOT_UPDATE off 48 xmax 2091743 ; new off 49 xmax 0, blkref #0: rel 1663/182014/190331 blk 1

The content highlighted in yellow and the previous log files are no longer required for recovery.

When full_page_writes is set and the system is very busy, setting wal_buffers higher will help smooth response times during the period immediately following each checkpoint.

Internals of checkpoint

  1. After a checkpoint process starts, the REDO point is stored in memory; REDO point is the location to write the XLOG record at the moment when the latest checkpoint is started, and is the starting point of database recovery.
  2. A XLOG record of this checkpoint (i.e. checkpoint record) is written to the WAL buffer. The data-portion of the record is defined by the structure CheckPoint, which contains several variables such as the REDO point stored with step (1).
  3. In addition, the location to write checkpoint record is literally called the checkpoint.
  4. All data in shared memory (e.g. the contents of the clog, etc..) are flushed to the storag
  5. All dirty pages on the shared buffer pool are written and flushed into the storage, gradually.
  6. The pg_control file is updated. This file contains the fundamental information such as the location where the checkpoint record has written (a.k.a. checkpoint location). The details of this file later.