New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[6X Only] Fix missing xlog files with orphaned prepared transactions after cras… #10654
Conversation
253df53
to
6e8d17e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good catch! What complicates the matters is Greenplum's special way of handling prepared transactions state. Let me try to summarize my different places we record this information so far:
- Shared memory referenced by
TwoPhaseState
records transactions in prepared but not committed/aborted state. - The
prepared_transaction_agg_state
records the list of currently prepared transactions in WAL, as an extension of checkpoint record. Each item in the list is of the form <xid, prepare_lsn>. This is created during checkpoint. Upstream (PostgreSQL 9.4) records prepared transactions in two phase state file instead and the file is fsync during checkpoint. - The
crashRecoverPostCheckpointPreparedTransactions_map_ht
hash table. It is local to a process. The entries are of the form <xid, prepare_lsn>. As far as I can tell, it's used by startup process during crash recovery and once recovery is complete, it's used by the checkpointer process. - The above hash table is populated during redo of prepare WAL records.
- At startup, when the checkpoint record is read for the first time, the startup process populates this hash table based on the prepared xacts recorded in the checkpoint record. See
SetupCheckpointPreparedTransactionList()
. - On standby, the startup / replay process populates this hash table but it is not used by the checkpointer process.
This patch fixes a bug where end-of-recovery checkpoint removes WAL segments containing prepared records that are still needed. The points 4 and 5 above seem to indicate the prepared transaction information is already available in the crashRecoverPostCheckpointPreparedTransactions_map_ht
hash table. The WAL pages need not be read again. What do you think?
CHECKPOINT | ||
|
||
-- shutdown primary and make sure the segment is down | ||
-1U: select pg_ctl((SELECT datadir from gp_segment_configuration c where c.role='p' and c.content=0), 'stop', 'immediate'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the segment is restarted instead of stop, would it be suffice? That is, can we avoid running gprecoverseg (and then rebalance)? Crash recovery upon restart should also create the desired end-of-recovery checkpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 'immediate' restart could trigger the issue also (I quickly double-confirmed by modifying the test) but I'd still keep the current test logic since the current test logic is more like what we saw in a real product environment and also I want to test a bit on gprecoverseg on this kind of scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference is to keep tests as simple as possible, it helps understand the problem clearly and it also reduces flakiness.
-- issue a checkpoint since a new checkpoint depends on previous checkpoint.redo | ||
-- for xlog file recycling/removing. | ||
checkpoint; | ||
CHECKPOINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checkpoint should retain all WAL up to 0/2C0AB170, the oldest prepared transaction.
void | ||
getTwoPhasePreparedTransactionData(prepared_transaction_agg_state **ptas) | ||
{ | ||
if (InRecovery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantic comment: my preference is to use RecoveryInProgress()
instead of InRecovery
flag because this code can be executed by startup process but also by checkpointer process, but InRecovey
flag is meaningful only in the startup process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getTwoPhasePreparedTransactionDataInRecovery()
is expected to run in startup only. In startup, once InRecovery is false we depend on getTwoPhasePreparedTransactionDataNotInRecovery()
for checkpointer. startup code RecoverPreparedTransactions()
will recover the data structure for getTwoPhasePreparedTransactionDataNotInRecovery()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. On a slightly different note, the name *NotInRecovery()
is not very appealing. The new name can be avoided if we refactor getTwoPhasePreparedTransactionData()
like so:
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2211,6 +2211,12 @@ getTwoPhasePreparedTransactionData(prepared_transaction_agg_state **ptas)
Assert(*ptas == NULL);
+ if (InRecovery)
+ {
+ getTwoPhasePreparedTransactionDataInRecovery(ptas);
+ return;
+ }
+
TwoPhaseAddPreparedTransactionInit(ptas, &maxCount);
for (int i = 0; i < numberOfPrepareXacts; i++)
/* Deconstruct header */ | ||
hdr = (TwoPhaseFileHeader *) XLogRecGetData(tfRecord); | ||
|
||
TwoPhaseAddPreparedTransaction(ptas, &maxCount, hdr->xid, &tfXLogRecPtr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, what I'm trying to suggest is why cannot we use the xid from the hash table entry. The entry is of type:
typedef struct prpt_map
{
TransactionId xid;
XLogRecPtr xlogrecptr;
} prpt_map;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. Thank. I copied the code from RecoverPreparedTransactions()
and did pruning but that still made things complex.
Yes.
Yes.
crashRecoverPostCheckpointPreparedTransactions_map_ht is mainly used on startup only. But checkpointer process does not use that. They are used on primaries for prepare and commit/abort prepare also but they seems to be useless.
Yes. Besides checkpint xlog redo.
Yes.
|
6e8d17e
to
05b33e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for incorporating the feedback.
They are used on primaries for prepare and commit/abort prepare also but they seems to be useless.
Let's remove the usage of crashRecoverPostCheckpointPreparedTransactions_map_ht
from EndPrepare() and FinishPreparedTransaction(). It causes unnecessary confusion. Additionally, Assert(AmStartupProcess())
can be added to all functions that make use of this hash table.
void | ||
getTwoPhasePreparedTransactionData(prepared_transaction_agg_state **ptas) | ||
{ | ||
if (InRecovery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. On a slightly different note, the name *NotInRecovery()
is not very appealing. The new name can be avoided if we refactor getTwoPhasePreparedTransactionData()
like so:
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2211,6 +2211,12 @@ getTwoPhasePreparedTransactionData(prepared_transaction_agg_state **ptas)
Assert(*ptas == NULL);
+ if (InRecovery)
+ {
+ getTwoPhasePreparedTransactionDataInRecovery(ptas);
+ return;
+ }
+
TwoPhaseAddPreparedTransactionInit(ptas, &maxCount);
for (int i = 0; i < numberOfPrepareXacts; i++)
In the commit message, can we please add a note on why this is a 6X-only change? |
05b33e5
to
451d103
Compare
…h recovery After crash recovery finishes, the startup process will create an end-of-recovery checkpoint. The checkpoint will recycle/remove xlog files according to orphaned prepared transaction LSNs, replication slot data, etc. The orphaned prepared transaction LSN data (TwoPhaseState->prepXacts, etc) for checkpoint are populated in the startup process RecoverPreparedTransactions(), but the function is called after the end-of-recovery checkpoint creation so xlog files with orphaned prepared transactions might be recycled/removed. This can cause "requested WAL segment pg_xlog/000000010000000000000009 has already been removed" kind of error when bringing up the crashed primary. For example if you run 'gprecoverseg -a -v', you might be able to see failure with stack as below. In more details, this happens when running the single mode postgres in pg_rewind. 2 0x5673a1 postgres <symbol not found> (xlogutils.c:572) 3 0x567b2f postgres read_local_xlog_page (xlogutils.c:870) 4 0x5658f3 postgres <symbol not found> (xlogreader.c:503) 5 0x56518a postgres XLogReadRecord (xlogreader.c:226) 6 0x54d725 postgres RecoverPreparedTransactions (twophase.c:1955) 7 0x55b0b7 postgres StartupXLOG (xlog.c:7760) 8 0xb0087f postgres InitPostgres (postinit.c:704) 9 0x97a97c postgres PostgresMain (postgres.c:4829) Fixing this by prescanning the prepared transaction data from xlogs directly for checkpoint creation if it's still InRecovery (i.e. when TwoPhaseState->prepXacts is not populated). Please note that "orphaned" might be temporary (usually happens on an environment with heavy write-operation load) or permanent unless master dtx recovery (implies a product bug). This is a gpdb 6 only issue. On gpdb 7, state files are used to store prepared transactions during checkpoint so we do not keep the wal files with orphaned prepared transactions and thus we won't encounter this issue. Reviewed-by: Asim R P <pasim@vmware.com>
451d103
to
f166978
Compare
Pushed into 6X_STABLE. Thanks for reviewing. |
How does Checkpointer in mirror obtain the TwoPhasePreparedTransactionData? When mirror is promoted to primary and |
@hidva You are right the fix does not seem to apply to !fast_promoted (i.e. fallback_promote) case, but no worry since fallback_promote is not used in GP6. For GP7 and above, there is no such issue. |
Is it possible for |
@hidva I take back my yesterday comment. I read the context of this PR and code carefully today. This patch is to fix the issue that in the single-mode postgres launched by pg_rewind, see comment in the commit message.
For this an End-Of-Recovery checkpoint is created after crash recovery finishes, so surely there is promote. I should have mentioned the scenario at the beginning of this PR. Another fix for "requested WAL segment pg_xlog/000000010000000000000003 has already been removed" kind of issue is restartpoint(checkpoint) relevant. See 4d282cb. FYI. |
…h recovery
After crash recovery finishes, the startup process will create an
end-of-recovery checkpoint. The checkpoint will recycle/remove xlog files
according to orphaned prepared transaction LSNs, replication slot data, etc.
The orphaned prepared transaction LSN data (TwoPhaseState->prepXacts, etc) for
checkpoint are populated in the startup process RecoverPreparedTransactions(),
but the function is called after the end-of-recovery checkpoint creation so
xlog files with orphaned prepared transactions might be recycled/removed. This
can cause "requested WAL segment pg_xlog/000000010000000000000009 has already
been removed" kind of error when bringing up the crashed primary. For example
if you run 'gprecoverseg -a -v', you might be able to see failure with stack as
below. In more details, this happens when running the single mode postgres in
pg_rewind.
2 0x5673a1 postgres (xlogutils.c:572)
3 0x567b2f postgres read_local_xlog_page (xlogutils.c:870)
4 0x5658f3 postgres (xlogreader.c:503)
5 0x56518a postgres XLogReadRecord (xlogreader.c:226)
6 0x54d725 postgres RecoverPreparedTransactions (twophase.c:1955)
7 0x55b0b7 postgres StartupXLOG (xlog.c:7760)
8 0xb0087f postgres InitPostgres (postinit.c:704)
9 0x97a97c postgres PostgresMain (postgres.c:4829)
Fixing this by prescanning the prepared transaction data from xlogs directly
for checkpoint creation if it's still InRecovery (i.e. when
TwoPhaseState->prepXacts is not populated).
Please note that "orphaned" might be temporary (usually happens on an
environment with heavy write-operation load) or permanent unless master dtx
recovery (implies a product bug).