Skip to content

PG-2234 Fix pg_rewind with encrypted tables#559

Open
dAdAbird wants to merge 13 commits into
percona:mainfrom
dAdAbird:fix_pg_rewind
Open

PG-2234 Fix pg_rewind with encrypted tables#559
dAdAbird wants to merge 13 commits into
percona:mainfrom
dAdAbird:fix_pg_rewind

Conversation

@dAdAbird
Copy link
Copy Markdown
Member

@dAdAbird dAdAbird commented Mar 31, 2026

The rewind might update blocks on target's file with that of the source.
In that case, encrypted files will end up with different blocks
encrypted with different internal keys, leading to data corruption.
To prevent this, we have to decryp blocks coming from the source and
re-encrypt them with the target's internal key.

Along with that, providers might have been changed, primary keys
rotated, etc., after the divergence of clusters. Generally, it's not a
problem because we would replace the keys and providers on the target
with those of the source. But it will render internal keys that we have
used for the partial blocks unreadable. To fix this, we have to
re-encrypt such internal keys with source's primary keys.

So the sequence is next:

  1. Copy source's pg_tde dir into a tmp location. All following reads
    and writes of source keys and providers happen in this tmp dir.
  2. When we have partial updates, check for the source key. And
    re-encrypt blocks if the key exists.
  3. Save the target key into the source _keys file replacning the
    existing one.
  4. Replace target's pg_tde dir with the tmp one.

See commit messages for additional info.

There is no need to do anything with WAL keys. Although rewind
may keep a segment from the destination, it might only be a segment
that was created before common history (before the replica was created);
therefore, the key for such a segment would be copied on the replica
while pg_basebackup. The rest of the segments would be wiped out from
the destination and replaced by the source's ones (along with their keys).

All changes only for pg18 so far. I'll port them to pg17 when I get approvals

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 73.75415% with 79 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.22%. Comparing base (bbafa9f) to head (feda23e).
⚠️ Report is 3 commits behind head on main.

❌ Your project status has failed because the head coverage (76.70%) is below the target coverage (90.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #559      +/-   ##
==========================================
+ Coverage   57.69%   58.22%   +0.52%     
==========================================
  Files          68       69       +1     
  Lines       10758    10997     +239     
  Branches     2650     2704      +54     
==========================================
+ Hits         6207     6403     +196     
- Misses       3278     3289      +11     
- Partials     1273     1305      +32     
Components Coverage Δ
access 82.14% <100.00%> (+0.40%) ⬆️
bin 64.15% <83.33%> (+0.38%) ⬆️
catalog 78.69% <100.00%> (+0.63%) ⬆️
common 92.30% <100.00%> (+4.07%) ⬆️
encryption 63.63% <100.00%> (+6.61%) ⬆️
keyring 65.55% <ø> (ø)
src 87.33% <ø> (ø)
smgr 90.20% <100.00%> (+0.66%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dAdAbird dAdAbird force-pushed the fix_pg_rewind branch 14 times, most recently from 83d50fb to 4d3c4ba Compare April 10, 2026 19:00
@dAdAbird dAdAbird changed the title Fix pg rewind PG-2234 Fix pg_rewind with encrypted tables Apr 10, 2026
@dAdAbird dAdAbird marked this pull request as ready for review April 10, 2026 19:04
Copy link
Copy Markdown
Collaborator

@jeltz jeltz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some quick review comments.

{
char *check_path = datasegpath(rlocator, MAIN_FORKNUM, segNo);

if (strcmp(check_path, path) != 0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you accidentally removed the comment explaining this part of the code.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is in inside path_rlocator where the path parsing is

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No?

	/*
	 * The sscanf tests above can match files that have extra characters at
	 * the end. To eliminate such cases, cross-check that GetRelationPath
	 * creates the exact same filename, when passed the RelFileLocator
	 * information we extracted from the filename.
	 */

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought you are talking about another comment. Yes, indeed, this one got disappeared somehow. I'll fix, thanks!

Comment thread fetools/pg18/pg_rewind/filemap.c Outdated
/*
* Sets rlocator and segNo based on given path. Returns false if didn't find
* a match.
*
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace.

Comment thread fetools/pg18/pg_rewind/filemap.c Outdated
if (strstr(path, ".DS_Store") != NULL)
return FILE_ACTION_NONE;

/*
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace.

Comment thread fetools/pg18/pg_rewind/pg_rewind.c Outdated
while (datapagemap_next(iter, &blkno))
{
offset = blkno * BLCKSZ;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental extra change.

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
static current_file_data current_tde_file =
{
0
};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we usually write these kind of zero initializers as just {0} on the same line as the delcaration.

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
static char tde_tmp_scource[MAXPGPATH] = "/tmp/pg_tde_rewindXXXXXX";

void
flush_current_key(void)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be named something related to tde.

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
}

void
encrypt_block(unsigned char *buf, off_t file_offset)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be named something with tde.

Copy link
Copy Markdown
Collaborator

@artemgavrilov artemgavrilov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review comments

Comment thread src/encryption/enc_tde.c Outdated
Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
Comment thread fetools/pg18/pg_rewind/libpq_source.c Outdated
Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
Comment thread fetools/pg18/pg_rewind/local_source.c Outdated
return false;
if (S_ISDIR(st.st_mode))
return true;
return false;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be simplified a bit as return stat(dir, &st) == 0 && S_ISDIR(st.st_mode);.

{
/*
* Nothing to do, local_queue_fetch_range() copies the ranges immediately.
*/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe write a comment replacing this.

Comment thread src/access/pg_tde_tdemap.c
Comment thread src/encryption/enc_tde.c
CalcBlockIv(forknum, blocknum, relKey->base_iv, iv);

AesEncrypt(relKey->key, relKey->key_len, iv, in, BLCKSZ, out);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this change should be in a separate commit since this refactoring seems to make sense anyway. Turning it more int a library.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense

@jeltz
Copy link
Copy Markdown
Collaborator

jeltz commented Apr 28, 2026

The second commit message needs to be improved and the first contains some typos, e.g. "blok" instead of "block".

@@ -354,6 +364,17 @@ libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len)
static void
libpq_queue_fetch_range(rewind_source *source, const char *path, off_t off,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have this wrapper instead of just changing all callsites to pass either true or false?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what rewind_source interface demands. For the "local source", queue_fetch_range and queue_fetch_range are actually two different functions. I'm not eager to change this interface tbh, as it might bring additional pain with upstream changes.

return;

create_tde_tmp_dir();
init_tde();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this change should be moved to the first commit?

source_has_tde = true;
create_tde_tmp_dir();
atexit(destroy_tde_tmp_dir);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this. Maybe atexti() should be in the frist commit?

Comment thread t/pg_rewind_enc_copy_blocks.pl Outdated
@@ -0,0 +1,82 @@

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this extra newline.

Comment thread t/pg_rewind_enc_copy_blocks.pl Outdated
@@ -0,0 +1,82 @@

# Copyright (c) 2021-2024, PostgreSQL Global Development Group
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally do not add copyright headers to our files.

@@ -0,0 +1,65 @@

# Copyright (c) 2021-2024, PostgreSQL Global Development Group
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about empty line and copyright notice.

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
static bool source_has_tde = false;

static void
recrypt_fork(ForkNumber fork)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is "reencrypt".

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
* Server can recover from wrecked VM/FSM, hence only warnings here
* and in the rest of the function
*/
pg_log_warning("could not open file for reading \"%s\": %m", srcpath);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these error messages maybe should be improved with some context.

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
read_len = read(srcfd, buf.data, sizeof(buf));

if (read_len < 0)
pg_fatal("could not read file \"%s\": %m", srcpath);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this fatal and not jsut a warning? Doesn't PostgreSQL recreate them if they are corrupt?

Comment thread t/pg_rewind_enc_fsm.pl Outdated
@@ -0,0 +1,68 @@

# Copyright (c) 2021-2024, PostgreSQL Global Development Group
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue with empty line and copyright.

@dAdAbird dAdAbird force-pushed the fix_pg_rewind branch 6 times, most recently from 84bd5bb to 33c7ba2 Compare May 4, 2026 16:52
Copy link
Copy Markdown
Collaborator

@artemgavrilov artemgavrilov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated

if (read_len != BLCKSZ)
{
pg_log_warning("unexpected read from fork file \"%s\": %m", srcpath);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%m here will may contain garbage as errno set only when read returns -1
https://man7.org/linux/man-pages/man2/read.2.html

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eah, a copy-paste issue) thanks

@dAdAbird dAdAbird force-pushed the fix_pg_rewind branch 3 times, most recently from 7b8a187 to 0f32ecc Compare May 6, 2026 09:48
Also add an empty sql file so that postgres won't complain about it
missing.
@dAdAbird dAdAbird force-pushed the fix_pg_rewind branch 3 times, most recently from 3b2340d to c9e2e3b Compare May 11, 2026 17:39
dutow added 3 commits May 12, 2026 11:30
Previously we only created a global principal key on the first restart
after creating a default key. This caused a basebackup taken immediately
after it, without a restart between the two to fail.

The fix is simple, we create the global principal key eagerly together
with the default key if it doesn't exists yet. This additionally results
in a better "normal" workflow since we no longer create a new principal
key during startup.
We want to pick up all bugfixes between PostgreSQL 17.8 and 17.10. The
process to create this commit is documented below.

In PSP repo:

    git checkout REL_17_8

In pg_tde repo:

    git checkout -b fetmp17.8 release-2.2.0
    tools/copy_fetools.sh ../postgresql/postgresql 17
    git add fetools/pg17
    git commit -m REL_17_8

In PSP repo:

    git checkout 25c49f3a4a7

In pg_tde repo:

    git checkout -b fetmp17.10 release-2.2.0
    tools/copy_fetools.sh ../postgresql/postgresql 17
    git add fetools/pg17
    git commit -m REL_17_STABLE
    git checkout -b newbranch release-2.2.0
    git diff fetmp17.8 fetmp17.10 -- fetools/pg17 | git apply -3
    git commit
We want to pick up all bugfixes between PostgreSQL 18.2 and 18.4. The
process to create this commit is documented below.

In PSP repo:

    git checkout REL_18_2

In pg_tde repo:

    git checkout -b fetmp18.2 release-2.2.0
    tools/copy_fetools.sh ../postgresql/postgresql 18
    git add fetools/pg18
    git commit -m REL_18_2

In PSP repo:

    git checkout f5cc81719e6

In pg_tde repo:

    git checkout -b fetmp18.4 release-2.2.0
    tools/copy_fetools.sh ../postgresql/postgresql 18
    git add fetools/pg18
    git commit -m REL_18_STABLE
    git checkout -b newbranch release-2.2.0
    git diff fetmp18.2 fetmp18.4 -- fetools/pg18 | git apply -3
    git commit
dAdAbird added 7 commits May 13, 2026 21:25
Make it possible to encrypt and decrypt relation blocks outside of the
SMGR interface. This will be needed for data re-encryption in pg_rewind.
The rewind might update blocks on the target's file with those of the
source. In that case, encrypted files will end up with different blocks
encrypted with different internal keys, leading to data corruption.
To prevent this, we have to decrypt blocks coming from the source and
re-encrypt them with the target's internal key.

Along with that, providers might have been changed, primary keys
rotated, etc., after the divergence of clusters. Generally, it's not a
problem because we would replace the keys and providers on the target
with those of the source. But it will render internal keys that we have
used for the partial blocks unreadable. To fix this, we have to
re-encrypt such internal keys with the source's primary keys.

So the sequence is next:
1. Copy the source's pg_tde dir into a tmp location. All following reads
and writes of source keys and providers happen in this tmp dir.
2. When we have partial updates, check for the source key. And
re-encrypt blocks if the key exists.
3. Save the target key into the source _keys file, replacing the
existing one.
4. Replace the target's pg_tde dir with the tmp one.
Extends changes made in the prev commit to work with the remote source.
This adds fetching `pgdata/pg_tde` content from the remote source.
Also, remote source fetches data in batches, with blocks from different
files queued and processed later. So we need to mark blocks that might
require re-encryption upon download.
Encrypted files might remain unchanged or truncated by rewind on the
target. That leaves them encrypted with the target's key wich vanishes
after the pg_tde replace. So we need to ensure keys for such files.
Meaning if this is an encrypted relation, we save its target key into
the source's keys.
We rewrite the internal key for relations when partially re-encrypting
blocks. That makes its FSM and VM fork unreadable as they are still
encrypted with the old key. To fix that, we re-encrypt such forks with
the proper key after we finish processing the main fork file.

As pg_rewind processes files in the order of operation types (see
file_action_t) and whole-file copies occur before any partial writes,
we assume that for files already in the target datadir, we rewrite
them in-place.
Before this commit, when FE tools changed PG_TDE_DATA_DIR, the WAL keys
and server principal keys cache remained the same, which is wrong, as
the new pg_tde dir means new keys. This commit fixes that. Also,
properly updating wal_key_file_path (before it was set once and for
all). Additionally, it hides pg_tde_set_data_dir from the backend, as
it is not intended for use there.
In some cases, it may happen that the replica would have totally new
WAL keys. And if rewind keeps some WAL segments on target, it makes
those segments unreadable when sice we copy keys from the source. To
fix this, we re-encrypt kept segments with the source WAL keys.
Copy link
Copy Markdown
Collaborator

@Andriciuc Andriciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for updating the variables!

dAdAbird added 2 commits May 15, 2026 18:20
Currently, archive tools expect pg_tde dir (wal encryption) in the
workdir. But the tools that are not necessary run from the pgdata.
pg_rewind, for example, may call archive tools. So we have to look
for pg_tde in relation to the WAL segments dir instead. We expect that
the archive tool will always work with WAL segments in the
<pgdata>/pg_wal directory. Therefore we look at
DEST/SRC-NAME_DIR/../pg_tde, i.e. `<pgdata>/pg_tde`.

This commit also does some code chores. Namely uses `PG_TDE_DATA_DIR`
instead of string literals. And calls the initialisation only when we
may need it, when we process a segment file.

Fixes PG-2358
We can rewrite XLog in the past (usually during crash recovery). And
in this case, we should not use the most recent, a newly created key to
encrypt such data, but use a related key from the past. And we should
not set the most recent key range.start if the data is from the past.
But the check if to update the range.start compares only LSN's. And in
the case of a rewinded cluster, WAL rewrites might happen way back in
the past, in the previous timeline, which breaks this comparison. And
in general, it makes sense to compare WAL ranges only
with wal_location_cmp(). This commit fixes that.

Fixes PG-2363
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants