How TLA+ Caught a Silent Data Divergence Bug in Postgres’s pg_rewind
Multigres is Supabase's high-availability and horizontal scaling project for Postgres. The Multigres team at Supabase used TLA+ to model Postgres streaming replication and found a silent data-loss bug in pg_rewind. Here's what we found and how we fixed it.
Postgres HA with Streaming Replication
Postgres’s built-in high-availability story is built on streaming replication. The basic idea is that a primary server writes changes to its write-ahead log (WAL) and streams those changes to one or more standby servers in near real time. If the primary fails, a standby can be promoted to take over as the new primary. This is a process that is fast, well-understood, and widely used in production.
When the old primary eventually comes back, it cannot simply rejoin as a standby. While it was down, the cluster moved on: the new primary may have written records the old primary never saw, and the old primary may have written records that nobody else ever saw. If the old primary is only behind the new primary, it will eventually catch up, but more problematic is that the old primary can contain writes that are not available on the new primary. In other words, the two servers’ histories have diverged. Naively connecting the old primary as a standby would leave it in an inconsistent state, and these writes could re-appear from the users’ perspective if the old primary is later promoted again, or be visible as phantom reads if it is used as a read replica.
This is the problem pg_rewind solves. It compares the old primary’s data directory against the new primary’s, identifies the point of divergence, and rewinds the old primary’s state back to that point so it can cleanly follow the new primary’s WAL. After pg_rewind completes,
the old primary rejoins as a standby with a consistent view of history.
Timelines
Central to all of this is the concept of timelines. Every time a standby is promoted, Postgres increments the timeline ID (TLI). The new primary writes a timeline history file, named 00000002.history, 00000003.history, and so on. It records the server’s lineage: which TLI it came from, the WAL location where the switch happened, and a human-readable string giving the reason for the promotion.
# 00000002.history
1 0/3000000 no recovery target specified
The reason string is generated by Postgres and is not something the user can affect. Normally the reason is just “no recovery target specified”, but in a few other scenarios it can contain other messages.
Note that each timeline history file contains the complete promotion history of that server, not just the most recent promotion. So 00000003.history records both the switch from TLI 1 to TLI 2 and the switch from TLI 2 to TLI 3, with the LSN for each switch.
This lineage is what pg_rewind uses to find the divergence point. It fetches the timeline history from both the source (that is, the current primary) and the target (that is, the server being rewound), and finds the last timeline in the source and target that matches. Everything after that point on the target is rewound.
The invariant pg_rewind is supposed to enforce is simple to state: after it finishes, the target should contain no data that did not come from the source’s history. As we will see, there is a scenario where this invariant is silently violated.
The Bug: Two Crashes, Two Promotions, One Collision
The failure scenario requires two crashes in quick succession. This is unusual, but not impossible in a real production environment, especially during the kind of cascading failure that high-availability systems are designed to survive.
With synchronous replication you have guarantees that data written is acknowledged by a standby before the client receives a commit confirmation. There is, however, a window between the primary writes a record to its WAL, when the record is sent to the standbys, and when it is acknowledged. If the primary crashes in that window and the WAL record was not sent out, the client never receives its commit confirmation, and the record exists only on the primary’s local WAL and no other node has ever seen it. Since the entire point of high-availability is to guarantee that changes do not magically disappear, this is a real problem.
This is precisely why pg_rewind is necessary: without it, those records would become phantom writes. They would be present on the old primary but unknown to the rest of the cluster. When the old primary rejoins as a standby, it would silently carry data that is not known to any other node.
Consider a group of three servers, S1 (primary), S2, and S3 (standbys), configured to handle high-availability by promoting standbys when primary goes down. All start on timeline 1 and each transaction is expected to be acknowledged by one standby to ensure no transactions are lost.
- S1 crashes. The group is now without a primary.
- S2 is selected as the new primary and promoted. It transition to timeline TLI 2 and update the history file.
- S2 starts writing and writes some record W1 to the WAL. Since synchronous replication with one acknowledging server is used, S2 is now waiting for acknowledgement from a standby before confirming the writes to the client.
- S2 crashes before W1 has been streamed to any standby. S2 is now on TLI 2, carrying data that exists nowhere else.
- S3 is selected as primary and promoted. It has no knowledge of the promotion of S2, so it promotes to TLI 2 independently, from the same point in TLI 1 that S2 promoted from.
- S1 recovers, rewinds using the primary as source (which will not result in anything being rewound) and connect as a standby to the now fully functional primary S3. It will move over to the TLI 2 that S3 is using.
- S3 writes some records W2 to its WAL. Since synchronous replication with one acknowledging server is used, S3 is waiting for acknowledgement from a standby before confirming these writes to the client. Since we have a standby, it can be acknowledged.
- The group now has two distinct versions of TLI 2: call them TLI 2.1 (on S2) and TLI 2.2 (on S3).
- S2 recovers again and needs to rejoin as a standby. You run
pg_rewindwith S3 (the current primary) as source to remove the divergent changes. - Here,
pg_rewindsees that both S2 and S3 are on TLI 2 and concludes there is nothing to rewind. It exits successfully. S2 rejoins, still carrying W1, data that was never replicated or acknowledged.
The cluster reports no errors. S2 looks like a healthy standby. The divergent data sits silently in its storage. If you query the node, it will report that W1 exists, and if you later promote S2 to be a primary again, the data will appear on the primary. From the user’s perspective, the data appeared from nowhere.
It is worth noting that this scenario is probably not entirely unknown to the Postgres community, but as of the time of writing, it remains unfixed in the Postgres source.
What makes this especially insidious is the structure of the timeline history files. Both S1 and S2 produced a 00000002.history file, and both files look identical to pg_rewind: the same parent TLI, the same switch LSN (the point where both diverged from the parent TLI). There is nothing in the existing file format to distinguish an independent promotion from a legitimate continuation of the same timeline.
Modeling It With TLA+
The scenario above is difficult to catch with conventional testing. It requires two crashes within a precise window. The first promotion must advance its timeline and crash before standbys try to connect, with a second independent promotion happening in between. A test suite would have to know in advance that this particular interleaving is worth looking for.
This is exactly the kind of problem that TLA+ is designed to find. TLA+ is a formal specification language for describing concurrent and distributed systems, and TLC is its model checker. It exhaustively explores every possible state and transition in a model to verify that a given invariant always holds. Rather than running a system and hoping the bad interleaving shows up, TLC systematically enumerates all interleavings and reports a counterexample the moment it finds one that violates the invariant.
The TLA+ models for this work are available at github.com/mkindahl/tla-postgres. Note that it is work in progress.
The model
The model captures the key actors and actions in a Postgres streaming replication setup. Note that the model is not necessarily a faithful representation of what you want to model check. It is actually a bad idea to add more to the model than necessary since that will increase the state-space that the model checker needs to explore.
State
The state is split across two levels. At the top level, TLA+ variables track the cluster-wide view: role for each server’s current status, connected for whether each standby is streaming, and replication channels between servers:
VARIABLES
server, \* [Servers -> server record]
role, \* [Servers -> {"primary", "promoting", "standby", "down"}]
connected, \* [Servers -> BOOLEAN]
pendingCommits, \* set of [client |-> c, lsn |-> n] awaiting sync ack
ackedCommitLsns \* set of Nat — LSNs of commits acked to clients
\* ... and several more
Each entry in server is a record defined in the PromotionServer module that captures the full state of a single Postgres instance: its WAL, its storage, its current LSN, and its timeline:
InitServer(tl) ==
[ storage |-> {}, \* set of write IDs applied to storage
wal |-> <<>>, \* sequence of WAL entries
lsn |-> 0, \* current Log Sequence Number
timeline |-> tl, \* current timeline ID
replayIndex |-> 0, \* index of last replayed WAL entry
nextWrite |-> 1 \* next client write ID to assign
]
The storage field is the key one for our invariant: it is the set of write identifiers that have been applied to the server’s durable storage.
Writes are modeled as monotonically increasing integers assigned by the primary to each client write, starting at 1. Each WAL entry carries one write identifier, and they appear in the WAL in the order the writes were made. The model uses write as a simple abstract representation of data: rather than modeling the actual content of a row or page, it tracks which writes have occurred and whether they are present on each server. This is sufficient to detect the bug, because what matters is not what the data says, but whether a server holds writes that the current primary has never seen.
Actions are TLA+ predicates that describe how the state may change in a single atomic step. In TLA+, a primed variable X' denotes the value of X after the action. So role' = [role EXCEPT ![p] = "down"] means “the new value of role is the old value with server p set to "down"”. Variables listed in UNCHANGED keep their current values. For example, PrimaryCrash models the primary failing: it sets its role to "down", disconnects all standbys, and clears in-flight replication channels:
PrimaryCrash ==
/\ HasPrimary
/\ LET p == ThePrimary
IN /\ role' = [role EXCEPT ![p] = "down"]
/\ connected' = [s \in Servers |-> FALSE]
/\ pendingCommits' = {}
/\ clientStatus' = [c \in Clients |-> "idle"]
/\ replChan' = [s \in Servers |-> <<>>]
/\ ackChan' = [s \in Servers |-> <<>>]
/\ UNCHANGED << server, ackedCommitLsns, sentIdx, ackedLsn >>
The HasPrimary guard ensures the action can only fire when a primary exists. ThePrimary is a helper that returns the identity of the current primary server.
The invariant we care about most is StorageConsistency. Since a phantom write is a write identifier that appears in a standby’s storage but not in the primary’s, we need to write an invariant that states this. The formal expression for this is:
StorageConsistency ==
LET primary == CHOOSE s \in Servers: role[s] = "primary"
IN \A server \in Servers:
( role[server] = "standby" /\ connected[server] ) =>
server[server].storage \subseteq server[primary].storage
In plain English: for every connected server whose role is “standby” , its storage must be a subset of the storage of the server whose role is “primary”. Any write identifier present on server but absent from the primary is a phantom. It is data that exists nowhere in the cluster’s agreed history.
The counterexample
When TLC runs against the model, it finds the counterexample quickly (it explores about 7000 distinct states to find the counterexample). The trace it produces maps directly onto the 8-step sequence described above: two crashes, two independent promotions to TLI 2, and a pg_rewind run that skips the rewind entirely because it sees matching timeline identifiers. The invariant is violated: the rejoining server still holds W1, a record the current primary has never seen.
The value of the TLC counterexample is not just that it confirms the bug exists. It also gives you a precise, reproducible witness. That witness can be translated directly into a regression test, which is what the proposed fix does (more on that in the next section).
What pg_rewind does wrong
To understand why pg_rewind fails silently, it helps to look at what it actually checks when deciding whether a rewind is needed.
The same-TLI shortcut
When pg_rewind starts, one of the first things it does is compare the current timeline ID (TLI) of the source (the new primary) and the target (the server being rewound). If they match, it takes a shortcut: it concludes that both servers are on the same timeline, that no divergence has occurred, and that there is nothing to rewind. It exits successfully without touching the target’s data directory.
This shortcut is the source of the bug. In the scenario above, both server 1 and server 2 are genuinely on TLI 2, but they arrived there independently, via separate promotions. The shortcut has no way to distinguish this from the normal case where a server simply fell behind on the same timeline and needs no rewinding.
The root cause is that the TLI alone is not enough to identify a timeline uniquely. Two servers can independently promote to TLI 2 from the same TLI 1 base, producing history files with identical content:
# 00000002.history on S2 (TLI 2.1)
1 0/3000000 no recovery target specified
# 00000002.history on S3 (TLI 2.2)
1 0/3000000 no recovery target specified
From the perspective of pg_rewind, these files are indistinguishable. There is nothing in them to signal that the two TLI 2 are the result of different promotions.
The deeper three-timeline case
The same problem appears in a subtler form when the history is longer. Suppose the target has gone through three timelines (TLI 1 → TLI 2 → TLI 3), while the source independently promoted from TLI 1 directly to its own TLI 2 (call it TLI 2’). Their history files look like this:
# Target: 00000003.history
1 0/3000000 no recovery target specified
2 0/5000000 no recovery target specified
# Source: 00000002.history
1 0/3000000 no recovery target specified
When pg_rewind walks through these timelines, it sees TLI 1 match on both sides, then sees TLI 2.2 on the target match TLI 2.1 on the source, with the same timeline number and same begin LSN. It accepts TLI 2.1 as the common ancestor and starts its WAL scan from the TLI 2.2 shutdown checkpoint.
But that starting point is too late. Any writes that happened on the target during TLI 2.2, before the TLI 3 promotion, occurred before the scan window. They are never copied from the source, and they survive the rewind intact. This is another silent phantom write, this time buried one promotion deeper.
The Fix: Promotion UUIDs
The fix is conceptually simple: give each promotion a unique identity that survives in the timeline history file, so that two independent promotions to the same TLI can always be told apart. A UUID is a simple way to ensure a unique identifier without a central coordinator.
The full proposal is available on the pgsql-hackers mailing list at the original thread.
A UUID per promotion
Every time a standby promotes, Postgres now generates a UUIDv7 ****value and writes it into the timeline history file alongside the existing TLI number and switch LSN. UUIDv7 was chosen for two reasons. First, UUIDv7 values are time-ordered: the leading bits encode a millisecond-precision timestamp, which means UUIDs generated at different times sort naturally by when the promotion occurred. This helps in debugging and auditing. Second, the remaining bits are random, making collisions between two independently generated UUIDs astronomically unlikely even if two promotions happen at exactly the same millisecond. Together these properties give each promotion a unique, time-stamped identity that is cheap to generate, safe to compare, and does not require a centralized coordinator. This fits naturally into a failover scenario where servers may not be able to reach each other. The history file format becomes:
# 00000002.history (new format)
1 0/3000000 019612a3-1234-7abc-8def-000000000001 no recovery target specified
Two servers that independently promote to TLI 2 will each generate a different UUID. Their 00000002.history files now look like this, and the timelines are clearly different since they have different UUID.
# 00000002.history on S1 (TLI 2.1)
1 0/3000000 019612a3-1234-7abc-8def-000000000001 no recovery target specified
# 00000002.history on S2 (TLI 2.2)
1 0/3000000 019612a3-5678-7def-9abc-000000000002 no recovery target specified
The UUID is used in two places in pg_rewind:
- In
findCommonAncestorTimeline, the function now compares UUIDs for each TLI entry as it walks the history. A UUID mismatch signals independent promotions, even when the TLI number and begin LSN agree. - In the same-TLI shortcut, before concluding that no rewind is needed,
pg_rewindnow also compares the UUID of the most recent completed history entry on both sides. A mismatch forces a full rewind instead of a no-op.
Backward compatibility
Old timeline history files have no UUID field. The parser treats a missing or unrecognized UUID as all-zeros, and the comparison logic treats a zero UUID on either side as “unknown, treat as compatible”. This means the fix degrades gracefully when rewinding against a server running an older version of Postgres: the UUID check simply does not fire, and behavior is identical to before.
Demonstration
The clearest way to see the impact of the fix is to walk through what happens in the same scenario with and without it, using the test cases from the patch. It also allows you to see how to set up replication for high-availability and test this yourself.
Setting up the topology
First start by creating a server that will be used as primary. This is done by calling initdb, starting the server, and creating a table with some data in it.
$ cd /tmp/example
$ export LANG=C
$ export LC_ALL=C
$ initdb -D primary -c port=5450 -c wal_log_hints=on -c wal_keep_size=64MB
The files belonging to this database system will be owned by user "mats".
This user must also own the server process.
...
$ pg_ctl -D primary -l primary/postgresql.log start
waiting for server to start.... done
server started
$ psql -p 5450 -d postgres
psql (18.3 (Homebrew))
Type "help" for help.
postgres=# CREATE TABLE tbl (val text);
CREATE TABLE
postgres=# INSERT INTO tbl VALUES ('some value');
INSERT 0 1
postgres=#
\q
Next step is to set up the standbys by taking a backup of the primary and use that to bootstrap the two standbys. Since we added wal_log_hints = on all servers will be possible to rewind using pg_rewind. Here, -m immediate is used to simulate a crash. If you skip it, a checkpoint will be written to the WAL. It will still work, but behave differently.
$ pg_basebackup -p 5450 -D standby1 -R -X stream
$ echo "port = 5451" >>standby1/postgresql.conf
$ pg_basebackup -p 5450 -D standby2 -R -X stream
$ echo "port = 5452" >>standby1/postgresql.conf
$ pg_ctl -D primary stop -m immediate
Now none of the servers are running, but the two standbys are clones of the primary. All servers are also on TLI 1. This situation mimics a crashed primary where some data has been replicated to the standby and they are all in sync.
First promotion of a standby
Now promote one of the standbys to move it over to TLI 2 and start it. You can check that it is on timeline 2 by using pg_control_checkpoint() function, which returns information from the pg_control file containing timeline information (among other things).
$ pg_ctl -D standby1 -l standby1/postgresql.log start
waiting for server to start.... done
server started
$ pg_ctl -D standby1 -D standby1 promote
waiting for server to promote.... done
server promoted
$ psql -p 5451 -d postgres
psql (18.3 (Homebrew))
Type "help" for help.
postgres=# SELECT timeline_id, redo_lsn FROM pg_control_checkpoint();
timeline_id | redo_lsn
-------------+-----------
2 | 0/3000098
(1 row)
Note that no replica is connected to the promoted standby right now, so we can insert some data and it will not be replicated to a standby, simulating the crash of the standby before it can send out the changes.
postgres=# INSERT INTO tbl VALUES ('my magic');
INSERT 0 1
postgres=# SELECT * FROM tbl;
val
------------
some value
my magic
(2 rows)
postgres=#
\q
$ pg_ctl -D standby1 stop -m immediate
waiting for server to shut down.... done
server stopped
After the second crash, you can promote the second standby in the same way as you did with the first standby. As you can see, it is also on timeline 2. Some data is also added that is different from the data on the crashed standby.
$ pg_ctl -D standby2 -l standby2/postgresql.log start
waiting for server to start.... done
server started
$ pg_ctl -D standby2 promote
waiting for server to promote.... done
server promoted
$ psql -p 5452 -d postgres
psql (18.3 (Homebrew))
Type "help" for help.
postgres=# SELECT timeline_id, redo_lsn, redo_wal_file FROM pg_control_checkpoint();
timeline_id | redo_lsn
-------------+-----------
2 | 0/40001D8
(1 row)
postgres=# INSERT INTO tbl VALUES ('here be dragons');
INSERT 0 1
postgres=# SELECT * FROM tbl;
val
-----------------
some value
here be dragons
(2 rows)
Recover the primary
So, there is now:
- A crashed primary on timeline 1.
- A crashed standby on timeline 2.
- A running standby on timeline 2’ (the duplicate) that will be the new primary.
Both timelines 2 and 2’ have some additional changes after being promoted that did not make it over to the standbys.
Everything is now set up for recovering the primary and the standby and connect them to the now promoted standby. Simulating the recovery of the primary is straightforward, and here is where pg_rewind is needed.
$ pg_rewind -P -D primary --source-server="host=localhost port=5452 dbname=postgres"
...
pg_rewind: Done!
$ echo "port = 5450" >> primary/postgresql.conf
$ pg_ctl -D primary -l primary/postgresql.log start
Rewinding the primary will make sure that all data that is potentially present at the old primary but not at the promoted standby (the new primary) is removed. It is necessary to add the old primary port to the config file since this example is running with servers on different ports on the same machine and pg_rewind copies the configuration file from the source. If you are running on different nodes and using the same port on all servers, this is not necessary.
The -P option makes sure to add configuration to point at the new primary. The configuration is added to the postgresql.auto.conf, which is read last and override any other settings in the postgresql.conf file.
Next step is to add the crashed standby as a standby to the new primary as well. In this case we take a look at the output of pg_rewind.
$ pg_rewind -R -D standby1 '--source-server=host=localhost port=5452 dbname=postgres'
pg_rewind: executing ...
...
pg_rewind: source and target cluster are on the same timeline
pg_rewind: no rewind required
$ echo "port = 5451" >> standby1/postgresql.conf
$ pg_ctl -D standby1 -l standby1/postgresql.log start
waiting for server to start.... done
server started
Looking at the output, you can see that pg_rewind decided that no rewind was necessary since they were on the same timeline. However, this standby is on timeline 2 and the promoted standby is on timeline 2’, so this is not correct.
Checking the contents of the two servers shows the problem quite clearly.
$ psql -p 5452 -d postgres -c 'SELECT * FROM tbl'
val
-----------------
some value
here be dragons
(2 rows)
$ psql -p 5451 -d postgres -c 'SELECT * FROM tbl'
val
------------
some value
my magic
(2 rows)
With the patched version of the server, the result of running pg_rewind is different. This clearly shows that the mismatching timelines are detected and an earlier timeline is rewound to. If you try this yourself, make sure that you run both a patched pg_rewind and a patched server. Just using a patch pg_rewind on an unpatched server will give strange errors.
pg_rewind -R -D standby1 --source-server="host=localhost port=5452 dbname=postgres"
pg_rewind: servers diverged at WAL location 0/03000000 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/02000088 on timeline 1
pg_rewind: Done!
Reflections on Using TLA+ for Database Work
Finding this bug required reasoning about a precise interleaving of events across multiple nodes. Two crashes within a narrow window, interleaved with two independent promotions, is essentially impossible to trigger reliably in an integration test. You would have to know in advance that this exact sequence was worth looking for, and then engineer a test harness capable of pausing execution at just the right moments. In practice, this class of bug tends to go undetected until it surfaces in production under the wrong conditions.
TLA+ approaches the problem differently. Rather than running the system and hoping the bad interleaving shows up, you write a model of the system’s possible behaviors and let TLC enumerate them exhaustively. The model for this work is modest in size, a few hundred lines of TLA+ across two modules, but the state space it explores is large enough to cover all the relevant crash and promotion sequences. TLC found the counterexample quickly.
A few things made TLA+ well suited here:
The model stays close to the real system. The actions in the spec PrimaryCrash, BeginPromote, StandbyConnect map directly onto real Postgres behaviors. The StandbyConnect action even models the specific pg_rewind shortcut that contains the bug: when source and target are on the same timeline, it skips the rewind. This means the counterexample TLC produces is not an abstract state sequence. It is a concrete description of what a real Postgres cluster would do.
The invariant is precise and simple. StorageConsistency is eleven lines of TLA+. It says what we want: a connected standby’s storage must be a subset of the primary’s storage. There is no ambiguity about what “correct” means, and no room for the kind of underspecification that lets bugs hide in the gap between what a test checks and what the system actually guarantees.
The counterexample becomes a regression test. The trace TLC produces maps step-by-step onto the test cases in the patch. This is one of the most practical payoffs of formal modeling: the witness to a violation is precise enough to drive a real test, not just a conceptual understanding of the problem.
The broader lesson is that formal methods pay off most where testing is weakest: rare, timing-dependent, multi-node failure scenarios. These are the scenarios that matter most for a high-availability system, and the ones that conventional testing is least equipped to find.
Conclusion
The bug described here is a good example of how correctness issues in distributed systems can hide in plain sight. The scenario involves two crashes and two independent promotions to the same timeline ID. It is unusual enough that no existing test covered it, yet realistic enough to occur in a real production cluster under cascading failure. The consequence is a standby silently carrying phantom writes with no error from pg_rewind. This is the kind of data integrity issue that HA systems are supposed to prevent.
The fix is small and elegant: embed a UUIDv7 in each timeline history entry at promotion time, and use it to distinguish independent promotions that happen to share a timeline number. The change is fully backward-compatible, and the counterexample TLC produced translated directly into regression tests that cover both the simple and three-timeline variants.
The TLA+ models are available at github.com/mkindahl/tla-postgres, and the proposed patch and discussion can be found on the pgsql-hackers mailing list at the original proposal. Feedback on both the model and the patch is very welcome.This work is part of Supabase's broader investment in making Postgres the most reliable foundation for high-availability workloads. We contribute these findings back to the Postgres community because Postgres's reliability is our reliability.