[6X Only] Fix missing xlog files with orphaned prepared transactions after cras… #10654

paul-guo- · 2020-08-17T06:10:32Z

…h recovery

After crash recovery finishes, the startup process will create an
end-of-recovery checkpoint. The checkpoint will recycle/remove xlog files
according to orphaned prepared transaction LSNs, replication slot data, etc.
The orphaned prepared transaction LSN data (TwoPhaseState->prepXacts, etc) for
checkpoint are populated in the startup process RecoverPreparedTransactions(),
but the function is called after the end-of-recovery checkpoint creation so
xlog files with orphaned prepared transactions might be recycled/removed. This
can cause "requested WAL segment pg_xlog/000000010000000000000009 has already
been removed" kind of error when bringing up the crashed primary. For example
if you run 'gprecoverseg -a -v', you might be able to see failure with stack as
below. In more details, this happens when running the single mode postgres in
pg_rewind.
2 0x5673a1 postgres (xlogutils.c:572)
3 0x567b2f postgres read_local_xlog_page (xlogutils.c:870)
4 0x5658f3 postgres (xlogreader.c:503)
5 0x56518a postgres XLogReadRecord (xlogreader.c:226)
6 0x54d725 postgres RecoverPreparedTransactions (twophase.c:1955)
7 0x55b0b7 postgres StartupXLOG (xlog.c:7760)
8 0xb0087f postgres InitPostgres (postinit.c:704)
9 0x97a97c postgres PostgresMain (postgres.c:4829)

Fixing this by prescanning the prepared transaction data from xlogs directly
for checkpoint creation if it's still InRecovery (i.e. when
TwoPhaseState->prepXacts is not populated).

Please note that "orphaned" might be temporary (usually happens on an
environment with heavy write-operation load) or permanent unless master dtx
recovery (implies a product bug).

asimrp

This is a very good catch! What complicates the matters is Greenplum's special way of handling prepared transactions state. Let me try to summarize my different places we record this information so far:

Shared memory referenced by TwoPhaseState records transactions in prepared but not committed/aborted state.
The prepared_transaction_agg_state records the list of currently prepared transactions in WAL, as an extension of checkpoint record. Each item in the list is of the form <xid, prepare_lsn>. This is created during checkpoint. Upstream (PostgreSQL 9.4) records prepared transactions in two phase state file instead and the file is fsync during checkpoint.
The crashRecoverPostCheckpointPreparedTransactions_map_ht hash table. It is local to a process. The entries are of the form <xid, prepare_lsn>. As far as I can tell, it's used by startup process during crash recovery and once recovery is complete, it's used by the checkpointer process.
The above hash table is populated during redo of prepare WAL records.
At startup, when the checkpoint record is read for the first time, the startup process populates this hash table based on the prepared xacts recorded in the checkpoint record. See SetupCheckpointPreparedTransactionList().
On standby, the startup / replay process populates this hash table but it is not used by the checkpointer process.

This patch fixes a bug where end-of-recovery checkpoint removes WAL segments containing prepared records that are still needed. The points 4 and 5 above seem to indicate the prepared transaction information is already available in the crashRecoverPostCheckpointPreparedTransactions_map_ht hash table. The WAL pages need not be read again. What do you think?

src/test/isolation2/expected/segwalrep/checkpoint_remove_xlog.out

asimrp · 2020-08-24T09:22:01Z

src/test/isolation2/expected/segwalrep/checkpoint_remove_xlog.out

+CHECKPOINT
+
+-- shutdown primary and make sure the segment is down
+-1U: select pg_ctl((SELECT datadir from gp_segment_configuration c where c.role='p' and c.content=0), 'stop', 'immediate');


If the segment is restarted instead of stop, would it be suffice? That is, can we avoid running gprecoverseg (and then rebalance)? Crash recovery upon restart should also create the desired end-of-recovery checkpoint.

Yes, 'immediate' restart could trigger the issue also (I quickly double-confirmed by modifying the test) but I'd still keep the current test logic since the current test logic is more like what we saw in a real product environment and also I want to test a bit on gprecoverseg on this kind of scenario.

My preference is to keep tests as simple as possible, it helps understand the problem clearly and it also reduces flakiness.

src/test/isolation2/expected/segwalrep/checkpoint_remove_xlog.out

asimrp · 2020-08-24T09:33:14Z

src/test/isolation2/expected/segwalrep/checkpoint_remove_xlog.out

+-- issue a checkpoint since a new checkpoint depends on previous checkpoint.redo
+-- for xlog file recycling/removing.
+checkpoint;
+CHECKPOINT


This checkpoint should retain all WAL up to 0/2C0AB170, the oldest prepared transaction.

asimrp · 2020-08-24T09:50:13Z

src/backend/access/transam/twophase.c

+void
+getTwoPhasePreparedTransactionData(prepared_transaction_agg_state **ptas)
+{
+	if (InRecovery)


Pedantic comment: my preference is to use RecoveryInProgress() instead of InRecovery flag because this code can be executed by startup process but also by checkpointer process, but InRecovey flag is meaningful only in the startup process.

getTwoPhasePreparedTransactionDataInRecovery() is expected to run in startup only. In startup, once InRecovery is false we depend on getTwoPhasePreparedTransactionDataNotInRecovery() for checkpointer. startup code RecoverPreparedTransactions() will recover the data structure for getTwoPhasePreparedTransactionDataNotInRecovery()

Right. On a slightly different note, the name *NotInRecovery() is not very appealing. The new name can be avoided if we refactor getTwoPhasePreparedTransactionData() like so:

--- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -2211,6 +2211,12 @@ getTwoPhasePreparedTransactionData(prepared_transaction_agg_state **ptas) Assert(*ptas == NULL); + if (InRecovery) + { + getTwoPhasePreparedTransactionDataInRecovery(ptas); + return; + } + TwoPhaseAddPreparedTransactionInit(ptas, &maxCount); for (int i = 0; i < numberOfPrepareXacts; i++)

asimrp · 2020-08-24T12:11:24Z

src/backend/access/transam/twophase.c

+		/* Deconstruct header */
+		hdr = (TwoPhaseFileHeader *) XLogRecGetData(tfRecord);
+
+		TwoPhaseAddPreparedTransaction(ptas, &maxCount, hdr->xid, &tfXLogRecPtr);


To be clear, what I'm trying to suggest is why cannot we use the xid from the hash table entry. The entry is of type:

typedef struct prpt_map { TransactionId xid; XLogRecPtr xlogrecptr; } prpt_map;

Yes, you are right. Thank. I copied the code from RecoverPreparedTransactions() and did pruning but that still made things complex.

paul-guo- · 2020-08-26T03:44:18Z

This is a very good catch! What complicates the matters is Greenplum's special way of handling prepared transactions state. Let me try to summarize my different places we record this information so far:

Shared memory referenced by TwoPhaseState records transactions in prepared but not committed/aborted state.

Yes.

The prepared_transaction_agg_state records the list of currently prepared transactions in WAL, as an extension of checkpoint record. Each item in the list is of the form <xid, prepare_lsn>. This is created during checkpoint. Upstream (PostgreSQL 9.4) records prepared transactions in two phase state file instead and the file is fsync during checkpoint.

Yes.

The crashRecoverPostCheckpointPreparedTransactions_map_ht hash table. It is local to a process. The entries are of the form <xid, prepare_lsn>. As far as I can tell, it's used by startup process during crash recovery and once recovery is complete, it's used by the checkpointer process.

crashRecoverPostCheckpointPreparedTransactions_map_ht is mainly used on startup only. But checkpointer process does not use that.

They are used on primaries for prepare and commit/abort prepare also but they seems to be useless.

EndPrepare()
FinishPreparedTransaction()

The above hash table is populated during redo of prepare WAL records.

Yes. Besides checkpint xlog redo.

At startup, when the checkpoint record is read for the first time, the startup process populates this hash table based on the prepared xacts recorded in the checkpoint record. See SetupCheckpointPreparedTransactionList().

Yes.

On standby, the startup / replay process populates this hash table but it is not used by the checkpointer process.

This patch fixes a bug where end-of-recovery checkpoint removes WAL segments containing prepared records that are still needed. The points 4 and 5 above seem to indicate the prepared transaction information is already available in the crashRecoverPostCheckpointPreparedTransactions_map_ht hash table. The WAL pages need not be read again. What do you think?

crashRecoverPostCheckpointPreparedTransactions_map_ht just includes xid + recptr. If the mirror is promoted we still need the wal of orphaned prepared transactions for commit/abort prepare.

asimrp

Thank you for incorporating the feedback.

They are used on primaries for prepare and commit/abort prepare also but they seems to be useless.

Let's remove the usage of crashRecoverPostCheckpointPreparedTransactions_map_ht from EndPrepare() and FinishPreparedTransaction(). It causes unnecessary confusion. Additionally, Assert(AmStartupProcess()) can be added to all functions that make use of this hash table.

src/backend/access/transam/twophase.c

asimrp · 2020-08-26T09:38:15Z

src/backend/access/transam/twophase.c

+void
+getTwoPhasePreparedTransactionData(prepared_transaction_agg_state **ptas)
+{
+	if (InRecovery)


Right. On a slightly different note, the name *NotInRecovery() is not very appealing. The new name can be avoided if we refactor getTwoPhasePreparedTransactionData() like so:

--- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -2211,6 +2211,12 @@ getTwoPhasePreparedTransactionData(prepared_transaction_agg_state **ptas) Assert(*ptas == NULL); + if (InRecovery) + { + getTwoPhasePreparedTransactionDataInRecovery(ptas); + return; + } + TwoPhaseAddPreparedTransactionInit(ptas, &maxCount); for (int i = 0; i < numberOfPrepareXacts; i++)

asimrp · 2020-08-26T09:44:09Z

In the commit message, can we please add a note on why this is a 6X-only change?

…h recovery After crash recovery finishes, the startup process will create an end-of-recovery checkpoint. The checkpoint will recycle/remove xlog files according to orphaned prepared transaction LSNs, replication slot data, etc. The orphaned prepared transaction LSN data (TwoPhaseState->prepXacts, etc) for checkpoint are populated in the startup process RecoverPreparedTransactions(), but the function is called after the end-of-recovery checkpoint creation so xlog files with orphaned prepared transactions might be recycled/removed. This can cause "requested WAL segment pg_xlog/000000010000000000000009 has already been removed" kind of error when bringing up the crashed primary. For example if you run 'gprecoverseg -a -v', you might be able to see failure with stack as below. In more details, this happens when running the single mode postgres in pg_rewind. 2 0x5673a1 postgres <symbol not found> (xlogutils.c:572) 3 0x567b2f postgres read_local_xlog_page (xlogutils.c:870) 4 0x5658f3 postgres <symbol not found> (xlogreader.c:503) 5 0x56518a postgres XLogReadRecord (xlogreader.c:226) 6 0x54d725 postgres RecoverPreparedTransactions (twophase.c:1955) 7 0x55b0b7 postgres StartupXLOG (xlog.c:7760) 8 0xb0087f postgres InitPostgres (postinit.c:704) 9 0x97a97c postgres PostgresMain (postgres.c:4829) Fixing this by prescanning the prepared transaction data from xlogs directly for checkpoint creation if it's still InRecovery (i.e. when TwoPhaseState->prepXacts is not populated). Please note that "orphaned" might be temporary (usually happens on an environment with heavy write-operation load) or permanent unless master dtx recovery (implies a product bug). This is a gpdb 6 only issue. On gpdb 7, state files are used to store prepared transactions during checkpoint so we do not keep the wal files with orphaned prepared transactions and thus we won't encounter this issue. Reviewed-by: Asim R P <pasim@vmware.com>

paul-guo- · 2020-08-27T03:17:39Z

Pushed into 6X_STABLE. Thanks for reviewing.

hidva · 2021-03-23T10:19:26Z

How does Checkpointer in mirror obtain the TwoPhasePreparedTransactionData? When mirror is promoted to primary and fast_promoted is not available, we will invoke RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) to do the end-of-recovery checkpoint, it means that CreateCheckPoint() will be invoked in the checkpointer process of mirror, and getTwoPhasePreparedTransactionData() in the checkpointer process will return an empty set of TwoPhasePreparedTransactionData. So if the mirror that has been promoted to primary panic when it receives T message for the first time, all TwoPhasePreparedTransactionData will be lost.

paul-guo- · 2021-03-24T03:12:14Z

How does Checkpointer in mirror obtain the TwoPhasePreparedTransactionData? When mirror is promoted to primary and fast_promoted is not available, we will invoke RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) to do the end-of-recovery checkpoint, it means that CreateCheckPoint() will be invoked in the checkpointer process of mirror, and getTwoPhasePreparedTransactionData() in the checkpointer process will return an empty set of TwoPhasePreparedTransactionData. So if the mirror that has been promoted to primary panic when it receives T message for the first time, all TwoPhasePreparedTransactionData will be lost.

@hidva You are right the fix does not seem to apply to !fast_promoted (i.e. fallback_promote) case, but no worry since fallback_promote is not used in GP6. For GP7 and above, there is no such issue.

hidva · 2021-03-24T03:16:38Z

fallback_promote

Is it possible for ReadCheckpointRecord() in if (fast_promote) to return NULL?

paul-guo- · 2021-03-25T01:39:09Z

@hidva I take back my yesterday comment. I read the context of this PR and code carefully today. This patch is to fix the issue that in the single-mode postgres launched by pg_rewind, see comment in the commit message.

In more details, this happens when running the single mode postgres in
pg_rewind.
2 0x5673a1 postgres (xlogutils.c:572)
3 0x567b2f postgres read_local_xlog_page (xlogutils.c:870)
4 0x5658f3 postgres (xlogreader.c:503)
5 0x56518a postgres XLogReadRecord (xlogreader.c:226)
6 0x54d725 postgres RecoverPreparedTransactions (twophase.c:1955)
7 0x55b0b7 postgres StartupXLOG (xlog.c:7760)
8 0xb0087f postgres InitPostgres (postinit.c:704)
9 0x97a97c postgres PostgresMain (postgres.c:4829)

For this an End-Of-Recovery checkpoint is created after crash recovery finishes, so surely there is promote. I should have mentioned the scenario at the beginning of this PR.

Another fix for "requested WAL segment pg_xlog/000000010000000000000003 has already been removed" kind of issue is restartpoint(checkpoint) relevant. See 4d282cb. FYI.

paul-guo- force-pushed the 6x_checkpoint branch from 253df53 to 6e8d17e Compare August 17, 2020 13:34

asimrp reviewed Aug 24, 2020

View reviewed changes

paul-guo- force-pushed the 6x_checkpoint branch from 6e8d17e to 05b33e5 Compare August 26, 2020 05:20

asimrp approved these changes Aug 26, 2020

View reviewed changes

paul-guo- force-pushed the 6x_checkpoint branch from 05b33e5 to 451d103 Compare August 26, 2020 12:50

paul-guo- force-pushed the 6x_checkpoint branch from 451d103 to f166978 Compare August 27, 2020 00:22

paul-guo- merged commit dafe0c8 into greenplum-db:6X_STABLE Aug 27, 2020

paul-guo- deleted the 6x_checkpoint branch August 27, 2020 03:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[6X Only] Fix missing xlog files with orphaned prepared transactions after cras… #10654

[6X Only] Fix missing xlog files with orphaned prepared transactions after cras… #10654

paul-guo- commented Aug 17, 2020

asimrp left a comment

asimrp Aug 24, 2020

paul-guo- Aug 26, 2020

asimrp Aug 26, 2020

asimrp Aug 24, 2020

asimrp Aug 24, 2020

paul-guo- Aug 26, 2020

asimrp Aug 26, 2020

asimrp Aug 24, 2020

paul-guo- Aug 26, 2020

paul-guo- commented Aug 26, 2020

asimrp left a comment

asimrp Aug 26, 2020

asimrp commented Aug 26, 2020

paul-guo- commented Aug 27, 2020

hidva commented Mar 23, 2021

paul-guo- commented Mar 24, 2021

hidva commented Mar 24, 2021

paul-guo- commented Mar 25, 2021

[6X Only] Fix missing xlog files with orphaned prepared transactions after cras… #10654

[6X Only] Fix missing xlog files with orphaned prepared transactions after cras… #10654

Conversation

paul-guo- commented Aug 17, 2020

asimrp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-guo- commented Aug 26, 2020

asimrp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asimrp commented Aug 26, 2020

paul-guo- commented Aug 27, 2020

hidva commented Mar 23, 2021

paul-guo- commented Mar 24, 2021

hidva commented Mar 24, 2021

paul-guo- commented Mar 25, 2021