PG-2234 Fix pg_rewind with encrypted tables#559
Conversation
Codecov Report❌ Patch coverage is ❌ Your project status has failed because the head coverage (76.70%) is below the target coverage (90.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #559 +/- ##
==========================================
+ Coverage 57.69% 58.22% +0.52%
==========================================
Files 68 69 +1
Lines 10758 10997 +239
Branches 2650 2704 +54
==========================================
+ Hits 6207 6403 +196
- Misses 3278 3289 +11
- Partials 1273 1305 +32
🚀 New features to boost your workflow:
|
83d50fb to
4d3c4ba
Compare
jeltz
left a comment
There was a problem hiding this comment.
Just some quick review comments.
| { | ||
| char *check_path = datasegpath(rlocator, MAIN_FORKNUM, segNo); | ||
|
|
||
| if (strcmp(check_path, path) != 0) |
There was a problem hiding this comment.
Seems like you accidentally removed the comment explaining this part of the code.
There was a problem hiding this comment.
The comment is in inside path_rlocator where the path parsing is
There was a problem hiding this comment.
No?
/*
* The sscanf tests above can match files that have extra characters at
* the end. To eliminate such cases, cross-check that GetRelationPath
* creates the exact same filename, when passed the RelFileLocator
* information we extracted from the filename.
*/
There was a problem hiding this comment.
Ah, I thought you are talking about another comment. Yes, indeed, this one got disappeared somehow. I'll fix, thanks!
| /* | ||
| * Sets rlocator and segNo based on given path. Returns false if didn't find | ||
| * a match. | ||
| * |
| if (strstr(path, ".DS_Store") != NULL) | ||
| return FILE_ACTION_NONE; | ||
|
|
||
| /* |
| while (datapagemap_next(iter, &blkno)) | ||
| { | ||
| offset = blkno * BLCKSZ; | ||
|
|
| static current_file_data current_tde_file = | ||
| { | ||
| 0 | ||
| }; |
There was a problem hiding this comment.
I think we usually write these kind of zero initializers as just {0} on the same line as the delcaration.
| static char tde_tmp_scource[MAXPGPATH] = "/tmp/pg_tde_rewindXXXXXX"; | ||
|
|
||
| void | ||
| flush_current_key(void) |
There was a problem hiding this comment.
This should be named something related to tde.
| } | ||
|
|
||
| void | ||
| encrypt_block(unsigned char *buf, off_t file_offset) |
There was a problem hiding this comment.
This should be named something with tde.
artemgavrilov
left a comment
There was a problem hiding this comment.
Partial review comments
| return false; | ||
| if (S_ISDIR(st.st_mode)) | ||
| return true; | ||
| return false; |
There was a problem hiding this comment.
I think this could be simplified a bit as return stat(dir, &st) == 0 && S_ISDIR(st.st_mode);.
| { | ||
| /* | ||
| * Nothing to do, local_queue_fetch_range() copies the ranges immediately. | ||
| */ |
There was a problem hiding this comment.
Maybe write a comment replacing this.
| CalcBlockIv(forknum, blocknum, relKey->base_iv, iv); | ||
|
|
||
| AesEncrypt(relKey->key, relKey->key_len, iv, in, BLCKSZ, out); | ||
| } |
There was a problem hiding this comment.
Maybe this change should be in a separate commit since this refactoring seems to make sense anyway. Turning it more int a library.
|
The second commit message needs to be improved and the first contains some typos, e.g. "blok" instead of "block". |
| @@ -354,6 +364,17 @@ libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len) | |||
| static void | |||
| libpq_queue_fetch_range(rewind_source *source, const char *path, off_t off, | |||
There was a problem hiding this comment.
Why have this wrapper instead of just changing all callsites to pass either true or false?
There was a problem hiding this comment.
This is what rewind_source interface demands. For the "local source", queue_fetch_range and queue_fetch_range are actually two different functions. I'm not eager to change this interface tbh, as it might bring additional pain with upstream changes.
| return; | ||
|
|
||
| create_tde_tmp_dir(); | ||
| init_tde(); |
There was a problem hiding this comment.
Maybe this change should be moved to the first commit?
| source_has_tde = true; | ||
| create_tde_tmp_dir(); | ||
| atexit(destroy_tde_tmp_dir); | ||
| } |
There was a problem hiding this comment.
Same with this. Maybe atexti() should be in the frist commit?
| @@ -0,0 +1,82 @@ | |||
|
|
|||
| @@ -0,0 +1,82 @@ | |||
|
|
|||
| # Copyright (c) 2021-2024, PostgreSQL Global Development Group | |||
There was a problem hiding this comment.
We generally do not add copyright headers to our files.
| @@ -0,0 +1,65 @@ | |||
|
|
|||
| # Copyright (c) 2021-2024, PostgreSQL Global Development Group | |||
There was a problem hiding this comment.
Same comment about empty line and copyright notice.
| static bool source_has_tde = false; | ||
|
|
||
| static void | ||
| recrypt_fork(ForkNumber fork) |
| * Server can recover from wrecked VM/FSM, hence only warnings here | ||
| * and in the rest of the function | ||
| */ | ||
| pg_log_warning("could not open file for reading \"%s\": %m", srcpath); |
There was a problem hiding this comment.
I think these error messages maybe should be improved with some context.
| read_len = read(srcfd, buf.data, sizeof(buf)); | ||
|
|
||
| if (read_len < 0) | ||
| pg_fatal("could not read file \"%s\": %m", srcpath); |
There was a problem hiding this comment.
Why is this fatal and not jsut a warning? Doesn't PostgreSQL recreate them if they are corrupt?
| @@ -0,0 +1,68 @@ | |||
|
|
|||
| # Copyright (c) 2021-2024, PostgreSQL Global Development Group | |||
There was a problem hiding this comment.
Same issue with empty line and copyright.
84bd5bb to
33c7ba2
Compare
|
|
||
| if (read_len != BLCKSZ) | ||
| { | ||
| pg_log_warning("unexpected read from fork file \"%s\": %m", srcpath); |
There was a problem hiding this comment.
%m here will may contain garbage as errno set only when read returns -1
https://man7.org/linux/man-pages/man2/read.2.html
There was a problem hiding this comment.
eah, a copy-paste issue) thanks
7b8a187 to
0f32ecc
Compare
Also add an empty sql file so that postgres won't complain about it missing.
3b2340d to
c9e2e3b
Compare
Previously we only created a global principal key on the first restart after creating a default key. This caused a basebackup taken immediately after it, without a restart between the two to fail. The fix is simple, we create the global principal key eagerly together with the default key if it doesn't exists yet. This additionally results in a better "normal" workflow since we no longer create a new principal key during startup.
We want to pick up all bugfixes between PostgreSQL 17.8 and 17.10. The
process to create this commit is documented below.
In PSP repo:
git checkout REL_17_8
In pg_tde repo:
git checkout -b fetmp17.8 release-2.2.0
tools/copy_fetools.sh ../postgresql/postgresql 17
git add fetools/pg17
git commit -m REL_17_8
In PSP repo:
git checkout 25c49f3a4a7
In pg_tde repo:
git checkout -b fetmp17.10 release-2.2.0
tools/copy_fetools.sh ../postgresql/postgresql 17
git add fetools/pg17
git commit -m REL_17_STABLE
git checkout -b newbranch release-2.2.0
git diff fetmp17.8 fetmp17.10 -- fetools/pg17 | git apply -3
git commit
We want to pick up all bugfixes between PostgreSQL 18.2 and 18.4. The
process to create this commit is documented below.
In PSP repo:
git checkout REL_18_2
In pg_tde repo:
git checkout -b fetmp18.2 release-2.2.0
tools/copy_fetools.sh ../postgresql/postgresql 18
git add fetools/pg18
git commit -m REL_18_2
In PSP repo:
git checkout f5cc81719e6
In pg_tde repo:
git checkout -b fetmp18.4 release-2.2.0
tools/copy_fetools.sh ../postgresql/postgresql 18
git add fetools/pg18
git commit -m REL_18_STABLE
git checkout -b newbranch release-2.2.0
git diff fetmp18.2 fetmp18.4 -- fetools/pg18 | git apply -3
git commit
Make it possible to encrypt and decrypt relation blocks outside of the SMGR interface. This will be needed for data re-encryption in pg_rewind.
The rewind might update blocks on the target's file with those of the source. In that case, encrypted files will end up with different blocks encrypted with different internal keys, leading to data corruption. To prevent this, we have to decrypt blocks coming from the source and re-encrypt them with the target's internal key. Along with that, providers might have been changed, primary keys rotated, etc., after the divergence of clusters. Generally, it's not a problem because we would replace the keys and providers on the target with those of the source. But it will render internal keys that we have used for the partial blocks unreadable. To fix this, we have to re-encrypt such internal keys with the source's primary keys. So the sequence is next: 1. Copy the source's pg_tde dir into a tmp location. All following reads and writes of source keys and providers happen in this tmp dir. 2. When we have partial updates, check for the source key. And re-encrypt blocks if the key exists. 3. Save the target key into the source _keys file, replacing the existing one. 4. Replace the target's pg_tde dir with the tmp one.
Extends changes made in the prev commit to work with the remote source. This adds fetching `pgdata/pg_tde` content from the remote source. Also, remote source fetches data in batches, with blocks from different files queued and processed later. So we need to mark blocks that might require re-encryption upon download.
Encrypted files might remain unchanged or truncated by rewind on the target. That leaves them encrypted with the target's key wich vanishes after the pg_tde replace. So we need to ensure keys for such files. Meaning if this is an encrypted relation, we save its target key into the source's keys.
We rewrite the internal key for relations when partially re-encrypting blocks. That makes its FSM and VM fork unreadable as they are still encrypted with the old key. To fix that, we re-encrypt such forks with the proper key after we finish processing the main fork file. As pg_rewind processes files in the order of operation types (see file_action_t) and whole-file copies occur before any partial writes, we assume that for files already in the target datadir, we rewrite them in-place.
Before this commit, when FE tools changed PG_TDE_DATA_DIR, the WAL keys and server principal keys cache remained the same, which is wrong, as the new pg_tde dir means new keys. This commit fixes that. Also, properly updating wal_key_file_path (before it was set once and for all). Additionally, it hides pg_tde_set_data_dir from the backend, as it is not intended for use there.
In some cases, it may happen that the replica would have totally new WAL keys. And if rewind keeps some WAL segments on target, it makes those segments unreadable when sice we copy keys from the source. To fix this, we re-encrypt kept segments with the source WAL keys.
Andriciuc
left a comment
There was a problem hiding this comment.
LGTM! Thank you for updating the variables!
Currently, archive tools expect pg_tde dir (wal encryption) in the workdir. But the tools that are not necessary run from the pgdata. pg_rewind, for example, may call archive tools. So we have to look for pg_tde in relation to the WAL segments dir instead. We expect that the archive tool will always work with WAL segments in the <pgdata>/pg_wal directory. Therefore we look at DEST/SRC-NAME_DIR/../pg_tde, i.e. `<pgdata>/pg_tde`. This commit also does some code chores. Namely uses `PG_TDE_DATA_DIR` instead of string literals. And calls the initialisation only when we may need it, when we process a segment file. Fixes PG-2358
We can rewrite XLog in the past (usually during crash recovery). And in this case, we should not use the most recent, a newly created key to encrypt such data, but use a related key from the past. And we should not set the most recent key range.start if the data is from the past. But the check if to update the range.start compares only LSN's. And in the case of a rewinded cluster, WAL rewrites might happen way back in the past, in the previous timeline, which breaks this comparison. And in general, it makes sense to compare WAL ranges only with wal_location_cmp(). This commit fixes that. Fixes PG-2363
The rewind might update blocks on target's file with that of the source.
In that case, encrypted files will end up with different blocks
encrypted with different internal keys, leading to data corruption.
To prevent this, we have to decryp blocks coming from the source and
re-encrypt them with the target's internal key.
Along with that, providers might have been changed, primary keys
rotated, etc., after the divergence of clusters. Generally, it's not a
problem because we would replace the keys and providers on the target
with those of the source. But it will render internal keys that we have
used for the partial blocks unreadable. To fix this, we have to
re-encrypt such internal keys with source's primary keys.
So the sequence is next:
and writes of source keys and providers happen in this tmp dir.
re-encrypt blocks if the key exists.
existing one.
See commit messages for additional info.
There is no need to do anything with WAL keys. Although rewind
may keep a segment from the destination, it might only be a segment
that was created before common history (before the replica was created);
therefore, the key for such a segment would be copied on the replica
while pg_basebackup. The rest of the segments would be wiped out from
the destination and replaced by the source's ones (along with their keys).
All changes only for pg18 so far. I'll port them to pg17 when I get approvals