git.postgresql.org Git - postgresql.git/log

Document the 'command' column of pg_stat_progress_repack

Commit ac58465e0618 added and documented a new progress-report view for
REPACK, but neglected to list the 'command' column in the docs. This is
my (Álvaro's) fail, as I added the column in v23 of the patch and forgot
to document it.

In passing, add a note in the docs for pg_stat_progress_cluster that it
might contain rows for sessions running REPACK, though mapping the
command name to either the older commands; and that it is for backwards-
compatibility only. (Maybe we should just remove this older view.)

Author: Noriyoshi Shinoda <noriyoshi.shinoda@hpe.com>
Discussion: https://postgr.es/m/LV8PR84MB37870F0F35EF2E8CB99768CBEE47A@LV8PR84MB3787.NAMPRD84.PROD.OUTLOOK.COM
Discussion: https://postgr.es/m/202510101352.vvp4p3p2dblu@alvherre.pgsql

Use simplehash for backend-private buffer pin refcounts.

Replace dynahash with simplehash for the per-backend PrivateRefCountHash
overflow table. Simplehash generates inlined, open-addressed lookup
code, avoiding the per-call overhead of dynahash that becomes noticeable
when many buffers are pinned with a CPU-bound workload.

Motivated by testing of the index prefetching patch, which pins many
more buffers concurrently than typical index scans.

Author: Peter Geoghegan <pg@bowt.ie>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com

nbtree: Avoid allocating _bt_search stack.

Avoid allocating memory for an nbtree descent stack during index scans.
We only require a descent stack during inserts, when it is used to
determine where to insert a new pivot tuple/downlink into the target
leaf page's parent page in the event of a page split.  (Page deletion's
first phase also performs a _bt_search that requires a descent stack.)

This optimization improves performance by minimizing palloc churn.  It
speeds up index scans that call _bt_search frequently/descend the index
many times, especially when the cost of scanning the index dominates
(e.g., with index-only skip scans).  Testing has shown that the
underlying issue causes performance problems for an upcoming patch that
will replace btgettuple with a new btgetbatch interface to enable I/O
prefetching.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com

Add pg_plan_advice contrib module.

Provide a facility that (1) can be used to stabilize certain plan choices
so that the planner cannot reverse course without authorization and
(2) can be used by knowledgeable users to insist on plan choices contrary
to what the planner believes best. In both cases, terrible outcomes are
possible: users should think twice and perhaps three times before
constraining the planner's ability to do as it thinks best; nevertheless,
there are problems that are much more easily solved with these facilities
than without them.

This patch takes the approach of analyzing a finished plan to produce
textual output, which we call "plan advice", that describes key
decisions made during plan; if that plan advice is provided during
future planning cycles, it will force those key decisions to be made in
the same way. Not all planner decisions can be controlled using advice;
for example, decisions about how to perform aggregation are currently
out of scope, as is choice of sort order. Plan advice can also be edited
by the user, or even written from scratch in simple cases, making it
possible to generate outcomes that the planner would not have produced.
Partial advice can be provided to control some planner outcomes but not
others.

Currently, plan advice is focused only on specific outcomes, such as
the choice to use a sequential scan for a particular relation, and not
on estimates that might contribute to those outcomes, such as a
possibly-incorrect selectivity estimate. While it would be useful to
users to be able to provide plan advice that affects selectivity
estimates or other aspects of costing, that is out of scope for this
commit.

Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Dian Fay <di@nmfay.com>
Reviewed-by: Ajay Pal <ajay.pal.k@gmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com

doc: Document variables for path substitution in SQL tests

Test suites driven by pg_regress can use the following environment
variables for path substitutions since d1029bb5a26c:
- PG_ABS_SRCDIR
- PG_ABS_BUILDDIR
- PG_DLSUFFIX
- PG_LIBDIR

These variables have never been documented, and they can be useful for
out-of-core code based on the options used by the pg_regress command
invoked by installcheck (or equivalent) to build paths to libraries for
various commands, like LOAD or CREATE FUNCTION.

Reviewed-by: Zhang Hu <kongbaik228@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/abDAWzHaesHLDFke@paquier.xyz

bloom: Optimize VACUUM and bulk-deletion with streaming read

This commit replaces the synchronous ReadBufferExtended() loops done in
blbulkdelete() and blvacuumcleanup() with the streaming read equivalent,
to improve I/O efficiency during bloom index vacuum cleanup operations.

Under the same test conditions as 6c228755add8, the runtime is proving
to gain around 30% better, with most the benefits coming from a large
reduction of the IO operation based on the stats retrieved in the
scenarios run.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com

Use streaming read for VACUUM cleanup of GIN

This commit replace the synchronous ReadBufferExtended() loop done in
ginvacuumcleanup() with the streaming read equivalent, to improve I/O
efficiency during GIN index vacuum cleanup operations.

With dm_delay to emulate some latency and debug_io_direct=data to force
synchronous writes and force the read path to be exercised, the author
has noticed a 5x improvement in runtime, with a substantial reduction in
IO stats numbers. I have reproduced similar numbers while running
similar tests, with improvements becoming better with more tuples and
more pages manipulated.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com

Convert NOT IN sublinks to anti-joins when safe

The planner has historically been unable to convert "x NOT IN (SELECT
y ...)" sublinks into anti-joins.  This is because standard SQL
semantics for NOT IN require that if the comparison "x = y" returns
NULL, the "NOT IN" expression evaluates to NULL (effectively false),
causing the row to be discarded.  In contrast, an anti-join preserves
the row if no match is found.  Due to this semantic mismatch regarding
NULL handling, the conversion was previously considered unsafe.

However, if we can prove that neither side of the comparison can yield
NULL values, and further that the operator itself cannot return NULL
for non-null inputs, the behavior of NOT IN and anti-join becomes
identical.  Enabling this conversion allows the planner to treat the
sublink as a first-class relation rather than an opaque SubPlan
filter.  This unlocks global join ordering optimization and permits
the selection of the most efficient join algorithm based on cost,
often yielding significant performance improvements for large
datasets.

This patch verifies that neither side of the comparison can be NULL
and that the operator is safe regarding NULL results before performing
the conversion.

To verify operator safety, we require that the operator be a member of
a B-tree or Hash operator family.  This serves as a proxy for standard
boolean behavior, ensuring the operator does not return NULL on valid
non-null inputs, as doing so would break index integrity.

For operand non-nullability, this patch makes use of several existing
mechanisms.  It leverages the outer-join-aware-Var infrastructure to
verify that a Var does not come from the nullable side of an outer
join, and consults the NOT-NULL-attnums hash table to efficiently
verify schema-level NOT NULL constraints.  Additionally, it employs
find_nonnullable_vars to identify Vars forced non-nullable by qual
clauses, and expr_is_nonnullable to deduce non-nullability for other
expression types.

The logic for verifying the non-nullability of the subquery outputs
was adapted from prior work by David Rowley and Tom Lane.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Zhang Mingli <zmlpostgres@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/CAMbWs495eF=-fSa5CwJS6B-BaEi3ARp0UNb4Lt3EkgUGZJwkAQ@mail.gmail.com

bufmgr: Fix use of wrong variable in GetPrivateRefCountEntrySlow()

Unfortunately, in 30df61990c67, I made GetPrivateRefCountEntrySlow() set a
wrong cache hint when moving entries from the hash table to the faster array.
There are no correctness concerns due to this, just an unnecessary loss of
performance.

Noticed while testing the index prefetching patch.

Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com

Fix use of volatile.

Commit 8185bb5347 misused volatile. Fix it. See also 6307b096e25.

Reported-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1bb21c7d-885f-4f07-a3ed-21b60d7c92c6@eisentraut.org

Add support for altering CHECK constraint enforceability

This commit adds support for ALTER TABLE ALTER CONSTRAINT ... [NOT]
ENFORCED for CHECK constraints. Previously, only foreign key
constraints could have their enforceability altered.

When changing from NOT ENFORCED to ENFORCED, the operation not only
updates catalog information but also performs a full table scan in
Phase 3 to validate that existing data satisfies the constraint.

For partitioned tables and inheritance hierarchies, the operation
recurses to all child tables. When changing to NOT ENFORCED, we must
recurse even if the parent is already NOT ENFORCED, since child
constraints may still be ENFORCED.

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@cybertec.at>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CACJufxHCh_FU-FsEwsCvg9mN6-5tzR6H9ntn+0KUgTCaerDOmg@mail.gmail.com

rename alter constraint enforceability related functions

The functions AlterConstrEnforceabilityRecurse and
ATExecAlterConstrEnforceability are being renamed to
AlterFKConstrEnforceabilityRecurse and ATExecAlterFKConstrEnforceability,
respectively.

The current alter constraint functions only handle Foreign Key constraints.
Renaming them to be more explicit about the constraint type is necessary;
otherwise, it will cause confusion when we later introduce the ability to alter
the enforceability of other constraints.

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Discussion: https://postgr.es/m/CACJufxHCh_FU-FsEwsCvg9mN6-5tzR6H9ntn+0KUgTCaerDOmg@mail.gmail.com

bufmgr: Switch to standard order in MarkBufferDirtyHint()

When we were updating hint bits with just a share lock MarkBufferDirtyHint()
had to use a non-standard order of operations, i.e. WAL log the buffer before
marking the buffer dirty. This was required because the lock level used to set
hints did not conflict with the lock level that was used to flush pages, which
would have allowed flushing the page out before the WAL record. The
non-standard order in turn required preventing the checkpoint from starting
between writing the WAL record and flushing out the page.

Now that setting hints and writing out buffers use share-exclusive, we can
revert back to the normal order of operations.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d

bufmgr: Remove the, now obsolete, BM_JUST_DIRTIED

Due to the recent changes to use a share-exclusive mode for setting hint bits
and for flushing pages - instead of using share mode as before - a buffer
cannot be dirtied while the flush is ongoing. The reason we needed
JUST_DIRTIED was to handle the case where the buffer was dirtied while IO was
ongoing - which is not possible anymore.

Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d

Avoid WAL flush checks for unlogged buffers in GetVictimBuffer()

GetVictimBuffer() rejects a victim buffer if it is from a bulkread
strategy ring and reusing it would require flushing WAL. Unlogged table
buffers can have fake LSNs (e.g. unlogged GiST pages) and calling
XLogNeedsFlush() on a fake LSN is meaningless.

This is a bit of future-proofing because currently the bulkread strategy
is not used for relations with fake LSNs.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Earlier version reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/fmkqmyeyy7bdpvcgkheb6yaqewemkik3ls6aaveyi5ibmvtxnd%40nu2kvy5rq3a6

Do not lock in BufferGetLSNAtomic() on archs with 8 byte atomic reads

On platforms where we can read or write the whole LSN atomically, we do
not need to lock the buffer header to prevent torn LSNs. We can do this
only on platforms with PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY, and when the
pd_lsn field is properly aligned.

For historical reasons the PageXLogRecPtr was defined as a struct with
two uint32 fields. This replaces it with a single uint64 value, to make
the intent clearer. To prevent issues with weak typedefs the value is
still wrapped in a struct.

This also adjusts heapfuncs() in pageinspect, to ensure proper alignment
when reading the LSN from a page on alignment-sensitive hardware.

Idea by Andres Freund. Initial patch by Andreas Karlsson, improved by
Peter Geoghegan. Minor tweaks by me.

Author: Andreas Karlsson <andreas@proxel.se>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/b6610c3b-3f59-465a-bdbb-8e9259f0abc4@proxel.se

Fix indentation from commit 29a0fb21577

Per buildfarm animal koel

Conditional locking in pgaio_worker_submit_internal

With io_method=worker, there's a single I/O submission queue. With
enough workers, the backends and workers may end up spending a lot of
time competing for the AioWorkerSubmissionQueueLock lock. This can
happen with workloads that keep the queue full, in which case it's
impossible to add requests to the queue. Increasing the number of I/O
workers increases the pressure on the lock, worsening the issue.

This change improves the situation in two ways:

* If AioWorkerSubmissionQueueLock can't be acquired without waiting,
the I/O is performed synchronously (as if the queue was full).

* When an entry can't be added to a full queue, stop trying to add more
entries. All remaining entries are handled as synchronous I/O.

The regression was reported by Alexandre Felipe. Investigation and
patch by me, based on an idea by Andres Freund.

Reported-by: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAE8JnxOn4+xUAnce+M7LfZWOqfrMMxasMaEmSKwiKbQtZr65uA@mail.gmail.com

Fixes for C++ typeof implementation

This fixes two bugs in commit 1887d822f14.

First, if we are using the fallback C++ implementation of typeof, then
we need to include the C++ header <type_traits> for
std::remove_reference_t. This header is also likely to be used for
other C++ implementations of type tricks, so we'll put it into the
global includes.

Second, for the case that the C compiler supports typeof in a spelling
that is not "typeof" (for example, __typeof__), then we need to #undef
typeof in the C++ section to avoid warnings about duplicate macro
definitions.

Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org

Remove Int8GetDatum function

We have no uses of Int8GetDatum in our tree and did not have for a
long time (or never), and the inverse does not exist either.

Author: Kirill Reshke <reshkekirill@gmail.com>
Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/CALdSSPhFyb9qLSHee73XtZm1CBWJNo9+JzFNf-zUEWCRW5yEiQ@mail.gmail.com

Sort out table_open vs. relation_open in rewriter

table_open() is a wrapper around relation_open() that checks that the
relkind is table-like and gives a user-facing error message if not.
It is best used in directly user-facing areas to check that the user
used the right kind of command for the relkind. In internal uses
where the relkind was previously checked from the user's perspective,
table_open() is not necessary and might even be confusing if it were
to give out-of-context error messages.

In rewriteHandler.c, there were several such table_open() calls, which
this changes to relation_open(). This currently doesn't make a
difference, but there are plans to have other relkinds that could
appear in the rewriter but that shouldn't be accessible via
table-specific commands, and this clears the way for that.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6d3fef19-a420-4e11-8235-8ea534bf2080%40eisentraut.org
Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org

Require share-exclusive lock to set hint bits and to flush

At the moment hint bits can be set with just a share lock on a page (and,
until 45f658dacb9, in one case even without any lock). Because of this we need
to copy pages while writing them out, as otherwise the checksum could be
corrupted.

The need to copy the page is problematic to implement AIO writes:

1) Instead of just needing a single buffer for a copied page we need one for
   each page that's potentially undergoing I/O
2) To be able to use the "worker" AIO implementation the copied page needs to
   reside in shared memory

It also causes problems for using unbuffered/direct-IO, independent of AIO:
Some filesystems, raid implementations, ... do not tolerate the data being
written out to change during the write. E.g. they may compute internal
checksums that can be invalidated by concurrent modifications, leading e.g. to
filesystem errors (as the case with btrfs).

It also just is plain odd to allow modifications of buffers that are just
share locked.

To address these issues, this commit changes the rules so that modifications
to pages are not allowed anymore while holding a share lock. Instead the new
share-exclusive lock (introduced in fcb9c977aa5) allows at most one backend to
modify a buffer while other backends have the same page share locked. An
existing share-lock can be upgraded to a share-exclusive lock, if there are no
conflicting locks. For that BufferBeginSetHintBits()/BufferFinishSetHintBits()
and BufferSetHintBits16() have been introduced.

To prevent hint bits from being set while the buffer is being written out,
writing out buffers now requires a share-exclusive lock.

The use of share-exclusive to gate setting hint bits means that from now on
only one backend can set hint bits at a time. To allow multiple backends to
set hint bits would require more complicated locking: For setting hint bits
we'd need to store the count of backends currently setting hint bits and we
would need another lock-level for I/O conflicting with the lock-level to set
hint bits. Given that the share-exclusive lock for setting hint bits is only
held for a short time, that backends would often just set the same hint bits
and that the cost of occasionally not setting hint bits in hotly accessed
pages is fairly low, this seems like an acceptable tradeoff.

The biggest change to adapt to this is in heapam. To avoid performance
regressions for sequential scans that need to set a lot of hint bits, we need
to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are
set at a high frequency. To that end HeapTupleSatisfiesMVCCBatch() uses the
new SetHintBitsExt(), which defers BufferFinishSetHintBits() until all hint
bits on a page have been set.  Conversely, to avoid regressions in cases where
we can't set hint bits in bulk (because we're looking only at individual
tuples), use BufferSetHintBits16() when setting hint bits without batching.

Several other places also need to be adapted, but those changes are
comparatively simpler.

After this we do not need to copy buffers to write them out anymore. That
change is done separately however.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m

bloom: Optimize bitmap scan path with streaming read

This commit replaces the per-page buffer read look in blgetbitmap() with
a reading stream, to improve scan efficiency, particularly useful for
large bloom indexes. Some benchmarking with a large number of rows has
shown a very nice improvement in terms of runtime and IO read reduction
with test cases up to 10M rows for a bloom index scan.

For the io_uring method, The author has reported a 3x in runtime with
io_uring while I was at close to a 7x. For the worker method with 3
workers, the author has reported better numbers than myself in runtime,
with the reduction in IO stats being appealing for all the cases
measured.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com

Remove unused PruneState member frz_conflict_horizon

c2a23dcf9e3af1c removed use of PruneState.frz_conflict_horizon but
neglected to actually remove the member. Do that now.

Don't clear pendingRecoveryConflicts at end of transaction

Commit 17f51ea818 introduced a new pendingRecoveryConflicts field in
PGPROC to replace the various ProcSignals. The new field was cleared
in ProcArrayEndTransaction(), which makes sense for conflicts with
e.g. locks or buffer pins which are gone at end of transaction. But it
is not appropriate for conflicts on a database, or a logical slot.

Because of this, the 035_standby_logical_decoding.pl test was
occasionally getting stuck in the buildfarm. It happens if the startup
process signals recovery conflict with the logical slot just when the
walsender process using the slot calls ProcArrayEndTransaction().

To fix, don't clear pendingRecoveryConflicts in
ProcArrayEndTransaction(). We could still clear certain conflict
flags, like conflicts on locks, but we didn't try to do that before
commit 17f51ea818 either.

In the passing, fix a misspelled comment, and make
InitAuxiliaryProcess() to also clear pendingRecoveryConflicts. I don't
think aux processes can have recovery conflicts, but it seems best to
initialize the field and keep InitAuxiliaryProcess() as close to
InitProcess() as possible.

Analyzed-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://www.postgresql.org/message-id/3e07149d-060b-48a0-8f94-3d5e4946ae45@gmail.com

Use the newest to-be-frozen xid as the conflict horizon for freezing

Previously WAL records that froze tuples used OldestXmin as the snapshot
conflict horizon, or the visibility cutoff if the page would become
all-frozen. Both are newer than (or equal to) the newst XID actually
frozen on the page.

Track the newest XID that will be frozen and use that as the snapshot
conflict horizon instead. This yields an older horizon resulting in
fewer query cancellations on standbys.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAAKRu_bbaUV8OUjAfVa_iALgKnTSfB4gO3jnkfpcFgrxEpSGJQ%40mail.gmail.com

Introduce the REPACK command

REPACK absorbs the functionality of VACUUM FULL and CLUSTER in a single
command. Because this functionality is completely different from
regular VACUUM, having it separate from VACUUM makes it easier for users
to understand; as for CLUSTER, the term is heavily overloaded in the
IT world and even in Postgres itself, so it's good that we can avoid it.

We retain those older commands, but de-emphasize them in the
documentation, in favor of REPACK; the difference between VACUUM FULL
and CLUSTER (namely, the fact that tuples are written in a specific
ordering) is neatly absorbed as two different modes of REPACK.

This allows us to introduce further functionality in the future that
works regardless of whether an ordering is being applied, such as (and
especially) a concurrent mode.

Author: Antonin Houska <ah@cybertec.at>
Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/82651.1720540558@antos
Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql

Fix grammar in short description of effective_wal_level.

Align with the convention of using third-person singular (e.g.,
"Shows" instead of "Show") for GUC parameter descriptions.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20260210.143752.1113524465620875233.horikyota.ntt@gmail.com

heapam: Don't mimic MarkBufferDirtyHint() in inplace updates

Previously heap_inplace_update_and_unlock() used an operation order similar to
MarkBufferDirty(), to reduce the number of different approaches used for
updating buffers. However, in an upcoming patch, MarkBufferDirtyHint() will
switch to using the update protocol used by most other places (enabled by hint
bits only being set while holding a share-exclusive lock).

Luckily it's pretty easy to adjust heap_inplace_update_and_unlock(). As a
comment already foresaw, we can use the normal order, with the slight change
of updating the buffer contents after WAL logging.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d

pg_dumpall: simplify coding of dropDBs()

There's no need for a StringInfo when all you want is a string
being constructed in a single pass.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reported-by: Ranier Vilela <ranier.vf@gmail.com>
Reviewed-by: Yang Yuanzhuo <1197620467@qq.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CAEudQAq2wyXZRdsh+wVHcOrungPU+_aQeQU12wbcgrmE0bQovA@mail.gmail.com

Remove duplicate initialization in initialize_brin_buildstate().

Commit dae761a added initialization of some BrinBuildState fields
in initialize_brin_buildstate(). Later, commit b437571 inadvertently
added the same initialization again.

This commit removes that redundant initialization. No behavioral
change is intended.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2nmrca6-9SNChDvRYD6+r==fs9qg5J93kahS7vpoq8QVg@mail.gmail.com

Rename grammar nonterminal to simplify reuse

A list of expressions with optional AS-labels is useful in a few
different places. Right now, this is available as xml_attribute_list
because it was first used in the XMLATTRIBUTES construct, but it is
already used elsewhere, and there are other possible future uses. To
reduce possible confusion going forward, rename it to
labeled_expr_list (like existing expr_list plus ColLabel).

Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org

Allow extensions to mark an individual index as disabled.

Up until now, the only way for a loadable module to disable the use of a
particular index was to use build_simple_rel_hook (or, previous to
yesterday's commit, get_relation_info_hook) to remove it from the index
list. While that works, it has some disadvantages. First, the index
becomes invisible for all purposes, and can no longer be used for
optimizations such as self-join elimination or left join removal, which
can severely degrade the resulting plan.

Second, if the module attempts to compel the use of a certain index
by removing all other indexes from the index list and disabling
other scan types, but the planner is unable to use the chosen index
for some reason, it will fall back to a sequential scan, because that
is only disabled, whereas the other indexes are, from the planner's
point of view, completely gone. While this situation ideally shouldn't
occur, it's hard for a loadable module to be completely sure whether
the planner will view a certain index as usable for a certain query.
If it isn't, it may be better to fall back to a scan using a disabled
index rather than falling back to an also-disabled sequential scan.

Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA%2BTgmoYS4ZCVAF2jTce%3DbMP0Oq_db_srocR4cZyO0OBp9oUoGg%40mail.gmail.com

Switch to FATAL error for missing checkpoint record without backup_label

Crash recovery started without a backup_label previously crashed with a
PANIC if the checkpoint record could not be found.  This commit lowers
the report generated to be a FATAL instead.

With recovery methods being more imaginative these days, this should
provide more flexibility when handling PostgreSQL recovery processing in
the event of a driver error, similarly to 15f68cebdcec.  An extra
benefit of this change is that it becomes possible to add a test to
check that a FATAL is hit with an expected error message pattern.  With
the recovery code becoming more complicated over the last couple of
years, I suspect that this will be benefitial to cover in the long-term.

The original PANIC behavior has been introduced in the early days of
crash recovery, as of 4d14fe0048cf (PANIC did not exist yet, the code
used STOP).

Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Discussion: https://postgr.es/m/CAMm1aWZbQ-Acp_xAxC7mX9uZZMH8+NpfepY9w=AOxbBVT9E=uA@mail.gmail.com

Fix misuse of "volatile" in xml.c

What should be used is not "volatile foo *ptr" but "foo *volatile ptr",
The incorrect (former) style means that what the pointer variable points
to is volatile.  The correct (latter) style means that the pointer
variable itself needs to be treated as volatile.  The latter style is
required to ensure a consistent treatment of these variables after a
longjmp with the TRY/CATCH blocks.

Some casts can be removed thanks to this change.

Issue introduced by 2e947217474c, so no backpatch is required.  A
similar set of issues has been fixed in 93001888d85c for contrib/xml2/.

Author: ChangAo Chen <cca5507@qq.com>
Discussion: https://postgr.es/m/tencent_5BE8DAD985EE140ED62EA728C8D4E1311F0A@qq.com

pg_{dump,restore}: Refactor handling of conflicting options.

This commit makes use of the function added by commit b2898baaf7
for these applications' handling of conflicting options. It
doesn't fix any bugs, but it does trim several lines of code.

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Steven Niu <niushiji@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CACJufxHDYn%2B3-2jR_kwYB0U7UrNP%2B0EPvAWzBBD5EfUzzr1uiw%40mail.gmail.com

Replace get_relation_info_hook with build_simple_rel_hook.

For a long time, PostgreSQL has had a get_relation_info_hook which
plugins can use to editorialize on the information that
get_relation_info obtains from the catalogs. However, this hook is
only called for baserels of type RTE_RELATION, and there is
potential utility in a similar call back for other types of
RTEs. This might have had utility even before commit
4020b370f214315b8c10430301898ac21658143f added pgs_mask to
RelOptInfo, but it certainly has utility now.

So, move the callback up one level, deleting get_relation_info_hook and
adding build_simple_rel_hook instead. The new callback is called just
slightly later than before and with slightly different arguments, but it
should be fairly straightforward to adjust existing code that currently
uses get_relation_info_hook: the values previously available as
relationObjectId and inhparent are now available via rte->relid and
rte->inh, and calls where rte->rtekind != RTE_RELATION can be ignored if
desired.

Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA%2BTgmoYg8uUWyco7Pb3HYLMBRQoO6Zh9hwgm27V39Pb6Pdf%3Dug%40mail.gmail.com

Consider startup cost as a figure of merit for partial paths.

Previously, the comments stated that there was no purpose to considering
startup cost for partial paths, but this is not the case: it's perfectly
reasonable to want a fast-start path for a plan that involves a LIMIT
(perhaps over an aggregate, so that there is enough data being processed
to justify parallel query but yet we don't want all the result rows).

Accordingly, rewrite add_partial_path and add_partial_path_precheck to
consider startup costs. This also fixes an independent bug in
add_partial_path_precheck: commit e22253467942fdb100087787c3e1e3a8620c54b2
failed to update it to do anything with the new disabled_nodes field.
That bug fix is formally separate from the rest of this patch and could
be committed separately, but I think it makes more sense to fix both
issues together, because then we can (as this commit does) just make
add_partial_path_precheck do the cost comparisons in the same way as
compare_path_costs_fuzzily, which hopefully reduces the chances of
ending up with something that's still incorrect.

This patch is based on earlier work on this topic by Tomas Vondra,
but I have rewritten a great deal of it.

Co-authored-by: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Tomas Vondra <tomas@vondra.me>
Discussion: http://postgr.es/m/CA+TgmobRufbUSksBoxytGJS1P+mQY4rWctCk-d0iAUO6-k9Wrg@mail.gmail.com

Prevent restore of incremental backup from bloating VM fork.

When I (rhaas) wrote the WAL summarizer code, I incorrectly believed
that XLOG_SMGR_TRUNCATE truncates all forks to the same length. In
fact, what other parts of the code do is compute the truncation length
for the FSM and VM forks from the truncation length used for the main
fork. But, because I was confused, I coded the WAL summarizer to set the
limit block for the VM fork to the same value as for the main fork.
(Incremental backup always copies FSM forks in full, so there is no
similar issue in that case.)

Doing that doesn't directly cause any data corruption, as far as I can
see. However, it does create a serious risk of consuming a large amount
of extra disk space, because pg_combinebackup's reconstruct.c believes
that the reconstructed file should always be at least as long as the
limit block value. We might want to be smarter about that at some point
in the future, because it's always safe to omit all-zeroes blocks at the
end of the last segment of a relation, and doing so could save disk
space, but the current algorithm will rarely waste enough disk space to
worry about unless we believe that a relation has been truncated to a
length much longer than its actual length on disk, which is exactly what
happens as a result of the problem mentioned in the previous paragraph.

To fix, create a new visibilitymap helper function and use it to include
the right limit block in the summary files. Incremental backups taken
with existing summary files will still have this issue, but this should
improve the situation going forward.

Diagnosed-by: Oleg Tkachenko <oatkachenko@gmail.com>
Diagnosed-by: Amul Sul <sulamul@gmail.com>
Discussion: http://postgr.es/m/CAAJ_b97PqG89hvPNJ8cGwmk94gJ9KOf_pLsowUyQGZgJY32o9g@mail.gmail.com
Discussion: http://postgr.es/m/6897DAF7-B699-41BF-A6FB-B818FCFFD585%40gmail.com
Backpatch-through: 17

Remove trailing period from errmsg in subscriptioncmds.c.

Author: Sahitya Chandra <sahityajb@gmail.com>
Discussion: https://postgr.es/m/20260308142806.181309-1-sahityajb@gmail.com

Move comment back to better place

Commit f014b1b9bb8 inserted some new code in between existing code and
a trailing comment. Move the comment back to near the code it belongs
to.

doc: Document IF NOT EXISTS option for ALTER FOREIGN TABLE ADD COLUMN.

Commit 2cd40adb85d added the IF NOT EXISTS option to ALTER TABLE ADD COLUMN.
This also enabled IF NOT EXISTS for ALTER FOREIGN TABLE ADD COLUMN,
but the ALTER FOREIGN TABLE documentation was not updated to mention it.

This commit updates the documentation to describe the IF NOT EXISTS option for
ALTER FOREIGN TABLE ADD COLUMN.

While updating that section, also this commit clarifies that the COLUMN keyword
is optional in ALTER FOREIGN TABLE ADD/DROP COLUMN. Previously, part of
the documentation could be read as if COLUMN were required.

This commit adds regression tests covering these ALTER FOREIGN TABLE syntaxes.

Backpatch to all supported versions.

Suggested-by: Fujii Masao <masao.fujii@gmail.com>
Author: Chao Li <lic@highgo.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwFk=rrhrwGwPtQxBesbT4DzSZ86Q3ftcwCu3AR5bOiXLw@mail.gmail.com
Backpatch-through: 14

Fix size underestimation of DSA pagemap for odd-sized segments

When make_new_segment() creates an odd-sized segment, the pagemap was
only sized based on a number of usable_pages entries, forgetting that a
segment also contains metadata pages, and that the FreePageManager uses
absolute page indices that cover the entire segment.  This
miscalculation could cause accesses to pagemap entries to be out of
bounds.  During subsequent reuse of the allocated segment, allocations
landing on pages with indices higher than usable_pages could cause
out-of-bounds pagemap reads and/or writes.  On write, 'span' pointers
are stored into the data area, corrupting the allocated objects.  On
read (aka during a dsa_free), garbage is interpreted as a span pointer,
typically crashing the server in dsa_get_address().

The normal geometric path correctly sizes the pagemap for all pages in
the segment.  The odd-sized path needs to do the same, but it works
forward from usable_pages rather than backward from total_size.

This commit fixes the sizing of the odd-sized case by adding pagemap
entries for the metadata pages after the initial metadata_bytes
calculation, using an integer ceiling division to compute the exact
number of additional entries needed in one go, avoiding any iteration in
the calculation.

An assertion is added in the code path for odd-sized segments, ensuring
that the pagemap includes the metadata area, and that the result is
appropriately sized.

This problem would show up depending on the size requested for the
allocation of a DSA segment.  The reporter has noticed this issue when a
parallel hash join makes a DSA allocation large enough to trigger the
odd-sized segment path, but it could happen for anything that does a DSA
allocation.

A regression test is added to test_dsa, down to v17 where the test
module has been introduced.  This adds a set of cheap tests to check the
problem, the new assertion being useful for this purpose.  Sami has
proposed a test that took a longer time than what I have done here; the
test committed is faster and good enough to check the odd-sized
allocation path.

Author: Paul Bunn <paul.bunn@icloud.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/044401dcabac$fe432490$fac96db0$@icloud.com
Backpatch-through: 14

Refactor tests for catalog diff comparisons in stats_import.sql

The tests of stats_import.sql include a set of queries to do
differential checks of the three statistics catalog relations, based on
the comparison of a source relation and a target relation, used for the
copy of the stats data with the restore functions:
- pg_statistic
- pg_stats_ext
- pg_stats_ext_exprs

This commit refactors the tests to reduce the bloat of such differential
queries, by creating a set of objects that make the differential queries
smaller:
- View for a base relation type.
- First function to retrieve stats data, that returns a type based on
the view previously created.
- Second function that checks the difference, based on two calls of the
first function.

This change leads to a nice reduction of stats_import.sql, with a larger
effect on the output file.

While on it, this adds some sanity checks for the three catalogs, to
warn developers that the stats import facilities may need to be updated
if any of the three catalogs change. These are rare in practice, see
918eee0c497c as one example. Another stylistic change is the use of the
extended output format for the differential queries, so as we avoid long
lines of output if a diff is caught.

Author: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://postgr.es/m/CADkLM=eEhxJpSUP+eC=eMGZZsVOpnfKDvVkuCbsFg9CajYwDsA@mail.gmail.com

Fix typo in stats_import.sql

The test mentioned pg_stat_ext_exprs, but the correct catalog name is
pg_stats_ext_exprs.

Thinko in ba97bf9cb7b4.

Discussion: https://postgr.es/m/CADkLM=eEhxJpSUP+eC=eMGZZsVOpnfKDvVkuCbsFg9CajYwDsA@mail.gmail.com

Fix invalid boolean if-test

We were testing the truth value of the array of booleans (which is
always true) instead of the boolean element specific to the affected
table column.

This causes a binary-upgrade dump fail to omit the name of a constraint;
that is, the correct constraint name is always printed, even when it's
not needed.  The affected case is a binary-upgrade dump of a not-null
constraint in an inherited column, which must in addition have no
comment.

Another point is that in order for this to make a difference, the
constraint must have the default name in the child table.  That is, the
constraint must have been created _in the parent table_ with the name
that it would have in the child table, like so:
  CREATE TABLE parent (a int CONSTRAINT child_a_not_null NOT NULL);
  CREATE TABLE child () INHERITS (parent);
Otherwise, the correct name must be printed by binary-upgrade pg_dump
anyway, since it wouldn't match the name produced at the parent.

Moreover, when it does hit, the pre-18-compatibility code (which has to
work with a constraint that has no name) gets involved and uses the
UPDATE on pg_constraint using the conkey instead of column name ... and
so everything ends up working correctly AFAICS.

I think it might cause a problem if the table and column names are
overly long, but I didn't want to spend time investigating further.

Still, it's wrong code, and static analyzers have twice complained about
it, so fix it by adding the array index accessor that was obviously
meant.

Reported-by: Ranier Vilela <ranier.vf@gmail.com>
Reported-by: George Tarasov <george.v.tarasov@gmail.com>
Backpatch-through: 18
Discussion: https://postgr.es/m/CAEudQAo7ah=4TDheuEjtb0dsv6bHoK7uBNqv53Tsub2h-xBSJw@mail.gmail.com
Discussion: https://postgr.es/m/f3029f25-acc9-4cb9-a74f-fe93bcfb3a27@gmail.com

libpq: Introduce PQAUTHDATA_OAUTH_BEARER_TOKEN_V2

For the libpq-oauth module to eventually make use of the
PGoauthBearerRequest API, it needs some additional functionality: the
derived Issuer ID for the authorization server needs to be provided, and
error messages need to be built without relying on PGconn internals.
These features seem useful for application hooks, too, so that they
don't each have to reinvent the wheel.

The original plan was for additions to PGoauthBearerRequest to be made
without a version bump to the PGauthData type. Applications would simply
check a LIBPQ_HAS_* macro at compile time to decide whether they could
use the new features. That theoretically works for applications linked
against libpq, since it's not safe to downgrade libpq from the version
you've compiled against.

We've since found that this strategy won't work for plugins, due to a
complication first noticed during the libpq-oauth module split: it's
normal for a plugin on disk to be *newer* than the libpq that's loading
it, because you might have upgraded your installation while an
application was running. (In other words, a plugin architecture causes
the compile-time and run-time dependency arrows to point in opposite
directions, so plugins won't be able to rely on the LIBPQ_HAS_* macros
to determine what APIs are available to them.)

Instead, extend the original PGoauthBearerRequest (now retroactively
referred to as "v1" in the code) with a v2 subclass-style struct. When
an application implements and accepts PQAUTHDATA_OAUTH_BEARER_TOKEN_V2,
it may safely cast the base request pointer it receives in its callbacks
to v2 in order to make use of the new functionality. libpq will query
the application for a v2 hook first, then v1 to maintain backwards
compatibility, before giving up and using the builtin flow.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAOYmi%2BmrGg%2Bn_X2MOLgeWcj3v_M00gR8uz_D7mM8z%3DdX1JYVbg%40mail.gmail.com

pg_dumpall: Fix handling of conflicting options.

pg_dumpall is missing checks for some conflicting options,
including those passed through to pg_dump. To fix, introduce a
new function that checks whether mutually exclusive options are
set, and use that in pg_dumpall. A similar change could likely be
made for pg_dump and pg_restore, but that is left as a future
exercise.

This is arguably a bug fix, but since this might break existing
scripts, no back-patch for now.

Author: Jian He <jian.universality@gmail.com>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Wang Peng <215722532@qq.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CACJufxFf5%3DwSv2MsuO8iZOvpLZQ1-meAMwhw7JX5gNvWo5PDug%40mail.gmail.com

Use palloc_object() and palloc_array() in more areas of the logical replication.

The idea is to encourage the use of newer routines across the tree, as
these offer stronger type-safety guarantees than raw palloc().

Similar work has been done in commits 1b105f9472bd, 0c3c5c3b06a3,
31d3847a37be, and 4f7dacc5b82a. This commit extends those changes to
more locations within src/backend/replication/logical/.

Author: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAHut+Pv4N7Vpxo18+NAR1r9RGvR8b0BtwTkoeCE2PfFoXgmR6A@mail.gmail.com

Support grouping-expression references and GROUPING() in subqueries.

Until now, substitute_grouped_columns and its predecessor
check_ungrouped_columns intentionally did not cope with references
to GROUP BY expressions (anything more complex than a Var) within
subqueries of the query having GROUP BY.  Because they didn't try to
match subexpressions of subqueries to the GROUP BY list, they'd drill
down to raw Vars of the grouping level and then fail with "subquery
uses ungrouped column from outer query".  There have been remarkably
few complaints about this deficiency, so nobody ever did anything
about it.

The reason for not wanting to deal with it is that within a subquery,
Vars will have varlevelsup different from zero and will thus not be
equal() to the expressions seen in the outer query.  We recognized
this at least as far back as 96ca8ffeb, although I think the comment
I added about it then was just documenting a pre-existing deficiency.
It looks like at the time, the solutions I considered were
(1) write a version of equal() that permits an offset in varlevelsup,
or (2) dynamically apply IncrementVarSublevelsUp at each
subexpression.  (1) would require an amount of new code that seems
rather out of proportion to the benefit, while (2) would add an
exponential amount of cost to the matching process.  But rethinking
it now, what seems attractive is (3) apply IncrementVarSublevelsUp to
the groupingClause list not the subexpressions, and do so only once
per subquery depth level.  Then we can still use plain equal() to
check for matches, and we're not incurring cost proportional to some
power of the subquery's complexity.

This patch continues to use the old logic when the GROUP BY list is
all Vars.  We could discard the special comparison logic for that and
always do it the more general way, but that would be a good deal
slower.  (Micro-benchmarking just parse analysis suggests it's about
50% slower than the Vars-only path.  But we've not heard complaints
about the speed of matching within the main query, so I doubt that
applying the same matching logic within subqueries will be a problem.)
The lack of complaints suggests strongly that this is a very minority
use-case, so I don't want to make the typical case slower to fix it.

While testing that, I was surprised to discover a nearby bug:
GROUPING() within a subquery fails to match GROUP BY Vars that are
join alias Vars.  It tries to apply flatten_join_alias_vars to make
such cases work, but that fails to work inside a subquery because
varlevelsup is wrong.  Therefore, this patch invents a new entry point
flatten_join_alias_for_parser() that allows specification of a
sublevels_up offset.  (It seems cleaner to give the parser its own
entry point rather than abuse the planner's conventions even further.)

While this is pretty clearly a bug fix, I'm hesitant to take the risk
of back-patching, seeing that the existing behavior has stood for so
long with so few complaints.  Maybe we can reconsider once this patch
has baked awhile in master.

Reported-by: PALAYRET Jacques <jacques.palayret@meteo.fr>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/531183.1772058731@sss.pgh.pa.us

CREATE SUBSCRIPTION ... SERVER.

Allow CREATE SUBSCRIPTION to accept a foreign server using the SERVER
clause instead of a raw connection string using the CONNECTION clause.

  * Enables a user with sufficient privileges to create a subscription
    using a foreign server by name without specifying the connection
    details.

  * Integrates with user mappings (and other FDW infrastructure) using
    the subscription owner.

  * Provides a layer of indirection to manage multiple subscriptions
    to the same remote server more easily.

Also add CREATE FOREIGN DATA WRAPPER ... CONNECTION clause to specify
a connection_function. To be eligible for a subscription, the foreign
server's foreign data wrapper must specify a connection_function.

Add connection_function support to postgres_fdw, and bump postgres_fdw
version to 1.3.

Bump catversion.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/61831790a0a937038f78ce09f8dd4cef7de7456a.camel@j-davis.com

Don't include wait_event.h in pgstat.h

wait_event.h itself includes wait_event_types.h, which is a generated
file, so it's nice that we can avoid compiling >10% of the tree just
because that file is regenerated.

To avoid breaking too many third-party modules, we now #include
utils/wait_classes.h in storage/latch.h. Then, the very common case
of doing
WaitLatch(..., PG_WAIT_EXTENSION)
continues to work by including just storage/latch.h. (I didn't try to
determine how many modules would actually break if we don't do this, but
this seems a convenient and low-impact measure.)

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/202602181214.gcmhx2vhlxzp@alvherre.pgsql

doc: Fix capitalization of Unicode

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/2a668979-ed92-49a3-abf9-a3ec2d460ec2%40eisentraut.org

Fix Python deprecation warning

Starting with Python 3.14, contrib/unaccent/generate_unaccent_rules.py
complains

DeprecationWarning: codecs.open() is deprecated. Use open() instead.

This makes that change. This works for all Python 3.x versions.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/2a668979-ed92-49a3-abf9-a3ec2d460ec2%40eisentraut.org

Make unconstify and unvolatize use StaticAssertVariableIsOfTypeMacro

The unconstify and unvolatize macros had an almost identical assertion
as was already defined in StaticAssertVariableIsOfTypeMacro, only it
had a less useful error message and didn't have a sizeof fallback.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/CAGECzQR21OnnKiZO_1rLWO0-16kg1JBxnVq-wymYW0-_1cUNtg@mail.gmail.com

Use typeof everywhere instead of compiler specific spellings

We define typeof ourselves as __typeof__ if it does not exist. So
let's actually use that for consistency. The meson/autoconf checks
for __builtin_types_compatible_p still use __typeof__ though, because
there we have not redefined it.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/CAGECzQR21OnnKiZO_1rLWO0-16kg1JBxnVq-wymYW0-_1cUNtg@mail.gmail.com

Portable StaticAssertExpr

Use a different way to write StaticAssertExpr() that does not require
the GCC extension statement expressions.

For C, we put the static_assert into a struct. This appears to be a
common approach.

We still need to keep the fallback implementation to support buggy
MSVC < 19.33.

For C++, we put it into a lambda expression. (The C approach doesn't
work; it's not permitted to define a new type inside sizeof.)

Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/5fa3a9f5-eb9a-4408-9baf-403d281f8b10%40eisentraut.org

Fix publisher shutdown hang caused by logical walsender busy loop.

Previously, when logical replication was running, shutting down
the publisher could cause the logical walsender to enter a busy loop
and prevent the publisher from completing shutdown.

During shutdown, the logical walsender waits for all pending WAL
to be written out. However, some WAL records could remain unflushed,
causing the walsender to wait indefinitely.

The issue occurred because the walsender used XLogBackgroundFlush() to
flush pending WAL. This function does not guarantee that all WAL is written.
For example, WAL generated by a transaction without an assigned
transaction ID that aborts might not be flushed.

This commit fixes the bug by making the logical walsender call XLogFlush()
instead, ensuring that all pending WAL is written and preventing
the busy loop during shutdown.

Backpatch to all supported versions.

Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAO6_Xqo3co3BuUVEVzkaBVw9LidBgeeQ_2hfxeLMQcXwovB3GQ@mail.gmail.com
Backpatch-through: 14

Improve tests for recovery_target_timeline GUC.

Commit fd7d7b71913 added regression tests to verify recovery_target_timeline
settings. To confirm that invalid values are rejected, those tests started the
server with an invalid setting and then verified that startup failed. While
functionally correct, this approach was expensive because it required
setting up and starting the server for each test case.

This commit updates the tests for recovery_target_timeline to use
the simpler approach introduced by commit bffd7130 for recovery_target_xid,
using ALTER SYSTEM SET to verify that invalid settings are rejected.
This avoids the need to set up and start the server when checking invalid
recovery_target_timeline values.

Author: David Steele <david@pgbackrest.org>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwG44vZbSoBmg076G+xkR6n=Tj2=q+fVkfP7yEsyF1daFA@mail.gmail.com

Fix inconsistency with HeapTuple freeing in extended_stats_funcs.c

heap_freetuple() is a thin wrapper doing a pfree(), and the function
import_pg_statistic(), introduced by ba97bf9cb7b4, had the idea to call
directly pfree() rather than the "dedicated" heap tuple routine.

upsert_pg_statistic_ext_data already uses heap_freetuple(). This code
is harmless as-is, but let's be consistent across the board.

Reported-by: Yonghao Lee <yonghao_lee@qq.com>
Discussion: https://postgr.es/m/tencent_CA1315EE8FB9C62F742C71E95FAD72214205@qq.com

Fix order of columns in pg_stat_recovery

recovery_last_xact_time is listed before current_chunk_start_time in the
documentation, the function definition and the view definition, but
their order was reversed in the code.

Thinko in 01d485b142e4. Mea culpa.

Author: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAOzEurQQ1naKmPJhfE5WOUQjtf5tu08Kw3QCGY5UY=7Rt9fE=w@mail.gmail.com

Fix inconsistent elevel in pg_sync_replication_slots() retry logic.

The commit 0d2d4a0ec3 allowed pg_sync_replication_slots() to retry sync
attempts, but missed a case, when WAL prior to a slot's
confirmed_flush_lsn is not yet flushed locally.

By changing the elevel from ERROR to LOG, we allow the sync loop to
continue. This provides the opportunity for the slot to be synchronized
once the standby catches up with the necessary WAL.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAFPTHDZAA+gWDntpa5ucqKKba41=tXmoXqN3q4rpjO9cdxgQrw@mail.gmail.com

Add system view pg_stat_recovery

This commit introduces pg_stat_recovery, that exposes at SQL level the
state of recovery as tracked by XLogRecoveryCtlData in shared memory,
maintained by the startup process.  This new view includes the following
fields, that are useful for monitoring purposes on a standby, once it
has reached a consistent state (making the execution of the SQL function
possible):
- Last-successfully replayed WAL record LSN boundaries and its timeline.
- Currently replaying WAL record end LSN and its timeline.
- Current WAL chunk start time.
- Promotion trigger state.
- Timestamp of latest processed commit/abort.
- Recovery pause state.

Some of this data can already be recovered from different system
functions, but not all of it.  See pg_get_wal_replay_pause_state or
pg_last_xact_replay_timestamp.  This new view offers a stronger
consistency guarantee, by grabbing the recovery state for all fields
through one spinlock acquisition.

The system view relies on a new function, called pg_stat_get_recovery().
Querying this data requires the pg_read_all_stats privilege.  The view
returns no rows if the node is not in recovery.

This feature originates from a suggestion I have made while discussion
the addition of a CONNECTING state to the WAL receiver's shared memory
state, because we lacked access to some of the state data.  The author
has taken the time to implement it, so thanks for that.

Bump catalog version.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com
Discussion: https://postgr.es/m/aW13GJn_RfTJIFCa@paquier.xyz

Refactor code retrieving string for RecoveryPauseState

This refactoring is going to be useful in an upcoming commit, to avoid
some code duplication with the function pg_get_wal_replay_pause_state(),
that returns a string for the recovery pause state.

Refactoring opportunity noticed while hacking on a different patch.

Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com

Simplify creation of built-in functions with non-default ACLs.

Up to now, to create such a function, one had to make a pg_proc.dat
entry and then modify it with GRANT/REVOKE commands, which we put in
system_functions.sql.  That seems a little ugly though, because it
violates the idea of having a single source of truth about the initial
contents of pg_proc, and it results in leaving dead rows in the
initial contents of pg_proc.

This patch improves matters by allowing aclitemin to work during early
bootstrap, before pg_authid has been loaded.  On the same principle
that we use for early access to pg_type details, put a table of known
built-in role names into bootstrap.c, and use that in bootstrap mode.

To create a built-in function with a non-default ACL, one should write
the desired ACL list in its pg_proc.dat entry, using a simplified
version of aclitemout's notation: omit the grantor (if it is the
bootstrap superuser, which it pretty much always should be) and spell
the bootstrap superuser's name as POSTGRES, similarly to the notation
used elsewhere in src/include/catalog.  This results in entries like

  proacl => '{POSTGRES=X,pg_monitor=X}'

which shows that we've revoked public execute permissions and instead
granted that to pg_monitor.

In addition to fixing up pg_proc.dat entries, I got rid of some
role grants that had been stuck into system_functions.sql,
and instead put them into a new file pg_auth_members.dat;
that seems like a far less random place to put the information.

The correctness of the data changes can be verified by comparing the
initial contents of pg_proc and pg_auth_members before and after.
pg_proc should match exactly, but the OID column of pg_auth_members
will probably be different because those OIDs now get assigned a
little earlier in bootstrap.  (I forced a catversion bump out of
caution, but it wasn't really necessary.)

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/183292bb-4891-4c96-a3ca-e78b5e0e1358@dunslane.net

Be more wary of false matches in initdb's replace_token().

Do not replace the target string unless the occurrence is surrounded
by whitespace or line start/end.  This avoids potential false match
to a substring of a field.  While we've not had trouble with that
up to now, the next patch creates hazards of false matches to
POSTGRES within an ACL field.

There is one call site that needs adjustment, as it was presuming
it could write "::1" and have that match "::1/128".  For all the
others, this restriction is okay and strictly safer.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/183292bb-4891-4c96-a3ca-e78b5e0e1358@dunslane.net

Prefix PruneState->all_{visible,frozen} with set_

The PruneState had members called "all_visible" and "all_frozen" which
reflect not the current state of the page but the state it could be in
once pruning and freezing have been executed. These are then saved in
the PruneFreezeResult so the caller can set the VM accordingly.

Prefix the PruneState members as well as the corresponsding
PruneFreezeResult members with "set_" to clarify that they represent the
proposed state of the all-visible and all-frozen bits for a heap page in
the visibility map, not the current state.

Author: Melanie Plageman <melanieplageman@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk

Add PageGetPruneXid() helper

This is similar to the other page accessors in bufpage.h. It improves
readability and avoids long lines.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/BD8B69E7-26D8-4706-9164-597C6AE57812%40gmail.com

Move commonly used context into PruneState and simplify helpers

heap_page_prune_and_freeze() and many of its helpers use the heap
buffer, block number, and page. Other helpers took the heap page and
didn't use it. Initializing these values once during
prune_freeze_setup() simplifies the helpers' interfaces and avoids any
repeated calls to BufferGetBlockNumber() and BufferGetPage().

While updating PruneState, also reorganize its fields to make the layout
and member documentation more consistent.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/BD8B69E7-26D8-4706-9164-597C6AE57812%40gmail.com

Exit after fatal errors in client-side compression code.

It looks like whoever wrote the astreamer (nee bbstreamer) code
thought that pg_log_error() is equivalent to elog(ERROR), but
it's not; it just prints a message. So all these places tried to
continue on after a compression or decompression error return,
with the inevitable result being garbage output and possibly
cascading error messages. We should use pg_fatal() instead.

These error conditions are probably pretty unlikely in practice,
which no doubt accounts for the lack of field complaints.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/1531718.1772644615@sss.pgh.pa.us
Backpatch-through: 15

oauth: Add TLS support for oauth_validator tests

The oauth_validator tests don't currently support HTTPS, which makes
testing PGOAUTHCAFILE difficult. Add a localhost certificate to
src/test/ssl and make use of it in oauth_server.py.

In passing, explain the hardcoded use of IPv4 in our issuer identifier,
after intermittent failures on NetBSD led to commit 8d9d5843b. (The new
certificate is still set up for IPv6, to make it easier to improve that
behavior in the future.)

Patch by Jonathan Gonzalez V., with some additional tests and tweaks by
me.

Author: Jonathan Gonzalez V. <jonathan.abdiel@gmail.com>
Discussion: https://postgr.es/m/8a296a2c128aba924bff0ae48af2b88bf8f9188d.camel@gmail.com

libpq: Add PQgetThreadLock() to mirror PQregisterThreadLock()

Allow libpq clients to retrieve the current pg_g_threadlock pointer with
PQgetThreadLock(). Single-threaded applications could already do this in
a convoluted way:

    pgthreadlock_t tlock;

    tlock = PQregisterThreadLock(NULL);
    PQregisterThreadLock(tlock);    /* re-register the callback */

    /* use tlock */

But a generic library can't do that without potentially breaking
concurrent libpq connections.

The motivation for doing this now is the libpq-oauth plugin, which
currently relies on direct injection of pg_g_threadlock, and should
ideally not.

Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAOYmi%2BmEU_q9sr1PMmE-4rLwFN%3DOjyndDwFZvpsMU3RNJLrM9g%40mail.gmail.com
Discussion: https://postgr.es/m/CAOYmi%2B%3DMHD%2BWKD4rsTn0v8220mYfyLGhEc5EfhmtqrAb7SmC5g%40mail.gmail.com

oauth: Report cleanup errors as warnings on stderr

Using conn->errorMessage for these "shouldn't-happen" cases will only
work if the connection itself fails. Our SSL and password callbacks
print WARNINGs when they find themselves in similar situations, so
follow their lead.

Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAOYmi%2BmEU_q9sr1PMmE-4rLwFN%3DOjyndDwFZvpsMU3RNJLrM9g%40mail.gmail.com

Fix handling of updated tuples in the MERGE statement

This branch missed the IsolationUsesXactSnapshot() check.  That led to EPQ on
repeatable read and serializable isolation levels.  This commit fixes the
issue and provides a simple isolation check for that.  Backpatch through v15
where MERGE statement was introduced.

Reported-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAPpHfdvzZSaNYdj5ac-tYRi6MuuZnYHiUkZ3D-AoY-ny8v%2BS%2Bw%40mail.gmail.com
Author: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Backpatch-through: 15

Improve validation of recovery_target_xid GUC values.

Previously, the recovery_target_xid GUC values were not sufficiently validated.
As a result, clearly invalid inputs such as the string "bogus", a decimal value
like "1.1", or 0 (a transaction ID smaller than the minimum valid value of 3)
were unexpectedly accepted. In these cases, the value was interpreted as
transaction ID 0, which could cause recovery to behave unexpectedly.

This commit improves validation of recovery_target_xid GUC so that invalid
values are rejected with an error. This prevents recovery from proceeding
with misconfigured recovery_target_xid settings.

Also this commit updates the documentation to clarify the allowed values
for recovery_target_xid GUC.

Author: David Steele <david@pgbackrest.org>
Reviewed-by: Hüseyin Demir <huseyin.d3r@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/f14463ab-990b-4ae9-a177-998d2677aae0@pgbackrest.org

doc: Clarify that COLUMN is optional in ALTER TABLE ... ADD/DROP COLUMN.

In ALTER TABLE ... ADD/DROP COLUMN, the COLUMN keyword is optional. However,
part of the documentation could be read as if COLUMN were required, which may
mislead users about the command syntax.

This commit updates the ALTER TABLE documentation to clearly state that
COLUMN is optional for ADD and DROP.

Also this commit adds regression tests covering ALTER TABLE ... ADD/DROP
without the COLUMN keyword.

Backpatch to all supported versions.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2n6ShLMOnjOtf63TjjgGbgiTVT5OMsSOFmbjGb6Xue1Bw@mail.gmail.com
Backpatch-through: 14

Move definition of XLogRecoveryCtlData to xlogrecovery.h

XLogRecoveryCtlData is the structure that stores the shared-memory state
of WAL recovery, including information such as promotion requests, the
timeline ID (TLI), and the LSNs of replayed records.

This refactoring is independently useful because it allows code outside
of core to access the recovery state in live. It will be used by an
upcoming patch that introduces a SQL function for querying this
information, that can be accessed on a standby once a consistent state
has been reached. This only moves code around, changing nothing
functionally.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com

Fix rare instability in recovery TAP test 004_timeline_switch

This fixes a problem similar to ad8c86d22cbd.  In this case, the test
could fail under the following circumstances:
- The primary is stopped with teardown_node(), meaning that it may not
be able to send all its WAL records to standby_1 and standby_2.
- If standby_2 receives more records than standby_1, attempting to
reconnect standby_2 to the promoted standby_1 would fail because of a
timeline fork.

This race condition is fixed with a simple trick: instead of tearing
down the primary, it is stopped cleanly so as all the WAL records of the
primary are received and flushed by both standby_1 and standby_2.  Once
we do that, there is no need for a wait_for_catchup() before stopping
the node.  The test wants to check that a timeline jump can be achieved
when reconnecting a standby to a promoted standby in the same cluster,
hence an immediate stop of the primary is not required.

This failure is harder to reach than the previous instability of
009_twophase, still the buildfarm has been able to detect this failure
at least once.  I have tried Alexander Lakhin's test trick with the
bgwriter and very aggressive standby snapshots, but I could not
reproduce it directly.  It is reachable, as the buildfarm has proved.

Backpatch down to all supported branches, and this problem can lead to
spurious failures in the buildfarm.

Discussion: https://postgr.es/m/493401a8-063f-436a-8287-a235d9e065fc@gmail.com
Backpatch-through: 14

Change default value of default_toast_compression to "lz4", take two

The default value for default_toast_compression was "pglz".  The main
reason for this choice is that this option is always available, pglz
code being embedded in Postgres.  However, it is known that LZ4 is more
efficient than pglz: less CPU required, more compression on average.  As
of this commit, the default value of default_toast_compression becomes
"lz4", if available.  By switching to LZ4 as the default, users should
see natural speedups on TOAST data reads and/or writes.

Support for LZ4 in TOAST compression was added in Postgres v14, or 5
releases ago.  This should be long enough to consider this feature as
stable.

While at it, quotes are removed from default_toast_compression in
postgresql.conf.sample.  Quotes are not required in this case.  The
in-place value replacement done by initdb if the build supports LZ4
would not use them in the postgresql.conf file added to a
freshly-initialized cluster.

Note that this is a version lighter than 7c1849311e49, that included a
replacement of --with-lz4 by --without-lz4 in configure builds, forcing
a requirement for LZ4 in all environments.  The buildfarm did not like
it, at all.  This commit switches default_toast_compression to lz4 as
default only when --with-lz4 is defined, which should keep the buildfarm
at bay while still allowing users to benefit from LZ4 compression in
TOAST as long as the code is compiled with it.

Author: Euler Taveira <euler@eulerto.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org

Revert "Change default value of default_toast_compression to "lz4""

This reverts commit 7c1849311e49, due to the fact that more than 60% of
the buildfarm members do not have lz4 installed. As we are in the last
commit fest of the development cycle, and that it could take a couple
of weeks to stabilize things, this change is reverted for now.

This commit will be reworked in a lighter version, as
default_toast_compression's default can be changed to "lz4" without the
switch from --with-lz4 to --without-lz4. This approach will keep the
buildfarm at bay, and still allow builds to take advantage of LZ4 in
TOAST by default, as long as the code is compiled with LZ4 support.

A harder requirement based on LZ4 should be achievable at some point,
but it is going to require some work from the buildfarm owners first.
Perhaps this part could be revisited at the beginning of the next
development cycle.

Discussion: https://postgr.es/m/CAOYmi+meTT0NbLbnVqOJD5OKwCtHL86PQ+RZZTrn6umfmHyWaw@mail.gmail.com

pg_restore: add --no-globals option to skip globals

This is a followup to commit 763aaa06f03 Add non-text output formats to
pg_dumpall.

Add a --no-globals option to pg_restore that skips restoring global
objects (roles and tablespaces) when restoring from a pg_dumpall
archive. When -C/--create is not specified, databases that do not
already exist on the target server are also skipped.

This is useful when restoring only specific databases from a pg_dumpall
archive without needing the global objects to be restored first.

Author: Mahendra Singh Thalor <mahi6run@gmail.com>

With small tweaks by me.

Discussion: https://postgr.es/m/CAKYtNArdcc5kx1MdTtTKFNYiauo3=zCA-NB0LmBCW-RU_kSb3A@mail.gmail.com

Improve writing map.dat preamble

Fix code from commit 763aaa06f03

Suggestion from Alvaro Herrera following a bug discovered by Coverity.

Fix casting away const-ness in pg_restore.c

This was intoduced in commit 763aaa06f03

per gripe from Peter Eistentrut.

Author: Mahendra Singh Thalor <mahi6run@gmail.com>

Slightly tweaked by me.

Discussion: https://postgr.es/m/016819c0-666e-42a8-bfc8-2b93fd8d0176@eisentraut.org

Fix estimate_hash_bucket_stats's correction for skewed data.

The previous idea was "scale up the bucketsize estimate by the ratio
of the MCV's frequency to the average value's frequency".  But we
should have been suspicious of that plan, since it frequently led to
impossible (> 1) values which we had to apply an ad-hoc clamp to.
Joel Jacobson demonstrated that it sometimes leads to making the
wrong choice about which side of the hash join should be inner.

Instead, drop the whole business of estimating average frequency, and
just clamp the bucketsize estimate to be at least the MCV's frequency.
This corresponds to the bucket size we'd get if only the MCV appears
in a bucket, and the MCV's frequency is not affected by the
WHERE-clause filters.  (We were already making the latter assumption.)
This also matches the coding used since 4867d7f62 in the case where
only a default ndistinct estimate is available.

Interestingly, this change affects no existing regression test cases.
Add one to demonstrate that it helps pick the smaller table to be
hashed when the MCV is common enough to affect the results.

This leaves estimate_hash_bucket_stats not considering the effects of
null join keys at all, which we should probably improve.  However,
I have a different patch in the queue that will change the executor's
handling of null join keys, so it seems appropriate to wait till
that's in before doing anything more here.

Reported-by: Joel Jacobson <joel@compiler.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Joel Jacobson <joel@compiler.org>
Discussion: https://postgr.es/m/341b723c-da45-4058-9446-1514dedb17c1@app.fastmail.com

Fix yet another bug in archive streamer with LZ4 decompression.

The code path in astreamer_lz4_decompressor_content() that updated
the output pointers when the output buffer isn't full was wrong.
It advanced next_out by bytes_written, which could include previous
decompression output not just that of the current cycle. The
correct amount to advance is out_size. While at it, make the
output pointer updates look more like the input pointer updates.

This bug is pretty hard to reach, as it requires consecutive
compression frames that are too small to fill the output buffer.
pg_dump could have produced such data before 66ec01dc4, but
I'm unsure whether any files we use astreamer with would be
likely to contain problematic data.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/0594CC79-1544-45DD-8AA4-26270DE777A7@gmail.com
Backpatch-through: 15

Don't malloc(0) in EventTriggerCollectAlterTSConfig

Author: Florin Irion <florin.irion@enterprisedb.com>
Discussion: https://postgr.es/m/c6fff161-9aee-4290-9ada-71e21e4d84de@gmail.com

Allow table exclusions in publications via EXCEPT TABLE.

Extend CREATE PUBLICATION ... FOR ALL TABLES to support the EXCEPT TABLE
syntax. This allows one or more tables to be excluded. The publisher will
not send the data of excluded tables to the subscriber.

To support this, pg_publication_rel now includes a prexcept column to flag
excluded relations. For partitioned tables, the exclusion is applied at
the root level; specifying a root table excludes all current and future
partitions in that tree.

Follow-up work will implement ALTER PUBLICATION support for managing these
exclusions.

Author: vignesh C <vignesh21@gmail.com>
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com

Add test for row-locking and multixids with prepared transactions

This is a repro for the issue fixed in commit ccae90abdb. Backpatch to
v17 like that commit, although that's a little arbitrary as this test
would work on older versions too.

Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAA5RZ0twq5bNMq0r0QNoopQnAEv+J3qJNCrLs7HVqTEntBhJ=g@mail.gmail.com
Backpatch-through: 17

Skip prepared_xacts test if max_prepared_transactions < 2

This reduces maintenance overhead, as we no longer need to update the
dummy expected output file every time the .sql file changes.

Discussion: https://www.postgresql.org/message-id/1009073.1772551323@sss.pgh.pa.us
Backpatch-through: 14

Fix rare instability in recovery TAP test 009_twophase

The phase of the test where we want to check that 2PC transactions
prepared on a primary can be committed on a promoted standby relied on
an immediate stop of the primary.  This logic has a race condition: it
could be possible that some records (most likely standby snapshot
records) are generated on the primary before it finishes its shutdown,
without the promoted standby know about them.  When the primary is
recycled as new standby, the test could fail because of a timeline fork
as an effect of these extra records.

This fix takes care of the instability by doing a clean stop of the
primary instead of a teardown (aka immediate stop), so as all records
generated on the primary are sent to the promoted standby and flushed
there.  There is no need for a teardown of the primary in this test
scenario: the commit of 2PC transactions on a promoted standby do not
care about the state of the primary, only of the standby.

This race is very hard to hit in practice, even slow buildfarm members
like skink have a very low rate of reproduction.  Alexander Lakhin has
come up with a recipe to improve the reproduction rate a lot:
- Enable -DWAL_DEBUG.
- Patch the bgwriter so as standby snapshots are generated every
milliseconds.
- Run 009_twophase tests under heavy parallelism.

With this method, the failure appears after a couple of iterations.
With the fix in place, I have been able to run more than 50 iterations
of the parallel test sequence, without seeing a failure.

Issue introduced in 30820982b295, due to a copy-pasto coming from the
surrounding tests.  Thanks also to Hayato Kuroda for digging into the
details of the failure.  He has proposed a fix different than the one of
this commit.  Unfortunately, it relied on injection points, feature only
available in v17.  The solution of this commit is simpler, and can be
applied to v14~v16.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/b0102688-6d6c-c86a-db79-e0e91d245b1a@gmail.com
Backpatch-through: 14

Change default value of default_toast_compression to "lz4", when available

The default value for default_toast_compression was "pglz".  The main
reason for this choice is that this option is always available, pglz
code being embedded in Postgres.  However, it is known that LZ4 is more
efficient than pglz: less CPU required, more compression on average.  As
of this commit, the default value of default_toast_compression becomes
"lz4", if available.  By switching to LZ4 as the default, users should
see natural speedups on TOAST data reads and/or writes.

Support for LZ4 in TOAST compression was added in Postgres v14, or 5
releases ago.  This should be long enough to consider this feature as
stable.

--with-lz4 is removed, replaced by a --without-lz4 to disable LZ4 in the
builds on an option-basis, following a practice similar to readline or
ICU.  References to --with-lz4 are removed from the documentation.

While at it, quotes are removed from default_toast_compression in
postgresql.conf.sample.  Quotes are not required in this case.  The
in-place value replacement done by initdb if the build supports LZ4
would not use them in the postgresql.conf file added to a
freshly-initialized cluster.

For the reference, a similar switch has been done with ICU in
fcb21b3acdcb.  Some of the changes done in this commit are consistent
with that.

Note: this is going to create some disturbance in the buildfarm, in
environments where lz4 is not installed.

Author: Euler Taveira <euler@eulerto.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org

Remove redundant restriction checks in apply_child_basequals

In apply_child_basequals, after translating a parent relation's
restriction quals for a child relation, we simplify each child qual by
calling eval_const_expressions.  Historically, the code then called
restriction_is_always_false and restriction_is_always_true to reduce
NullTest quals that are provably false or true.

However, since commit e2debb643, the planner natively performs
NullTest deduction during constant folding.  Therefore, calling
restriction_is_always_false and restriction_is_always_true immediately
afterward is redundant and wastes CPU cycles.  We can safely remove
them and simply rely on the constant folding to handle the deduction.

Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-vLmGXaUEZyOMacN0BVfqWCt2tM-eDVWdDfJnOQaauGg@mail.gmail.com

Remove obsolete SAMESIGN macro

The SAMESIGN macro was historically used as a helper for manual
integer overflow checks. However, since commit 4d6ad3125 introduced
overflow-aware integer operations, this manual sign-checking logic is
no longer necessary.

The macro remains defined in brin_minmax_multi.c and timestamp.c, but
is not used in either file. This patch removes these definitions to
clean things up.

Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-NL3J3hQ3LzrwV-YUkQC18P+jM7ZiegQyAHzgdZev2qg@mail.gmail.com

Add some tests for CREATE OR REPLACE VIEW with column additions

When working on an already-defined view with matching attributes, CREATE
OR REPLACE VIEW would internally generate an ALTER TABLE command with a
set of AT_AddColumnToView sub-commands, one for each attribute added.

Such a command is stored in event triggers twice:
- Once as a simple command.
- Once as an ALTER TABLE command, as it has sub-commands.

There was no test coverage to track this command pattern in terms of
event triggers and DDL deparsing:
- For the test module test_ddl_deparse, two command notices are issued.
- For event triggers, a CREATE VIEW command is logged twice, which may
look a bit weird first, but again this maps with the internal behavior
of how the commands are built, and how the event trigger code reacts in
terms of commands gathered.

While on it, this adds a test for CREATE SCHEMA with a CREATE VIEW
command embedded in it, case supported by the grammar but not covered
yet.

This hole in the test coverage has been found while digging into what
would be a similar behavior for sequences if adding attributes to them
with ALTER TABLE variants, after the initial relation creation.

Discussion: https://postgr.es/m/aaFG9bqkEn0RhLJG@paquier.xyz

Add read_stream_{pause,resume}()

Read stream users can now pause lookahead when no blocks are currently
available. After resuming, subsequent read_stream_next_buffer() calls
continue lookahead with the previous lookahead distance.

This is especially useful for read stream users with self-referential
access patterns (where consuming already-read buffers can produce
additional block numbers).

Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJLT2JvWLEiBXMbkSSc5so_Y7%3DN%2BS2ce7npjLw8QL3d5w%40mail.gmail.com

doc: Add restart on failure to example systemd file

The documentation previously had a systemd unit file that would not
attempt to recover from process failures such as OOM's, segfaults,
etc.  This commit adds "Restart=on-failure",` which tells systemd to
attempt to restart the process after failure.  This is the recommended
configuration per the systemd documentation: "Setting this to
on-failure is the recommended choice for long-running services".  Many
PostgreSQL users will simply copy/paste what the PostgreSQL
documentation recommends and will probably do their own research and
change the service file to restart on failure, so might as well set
this as the default in the PostgreSQL documentation.

Author: Andrew Jackson <andrewjackson947@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAKK5BkFfMpAQnv8CLs%3Di%3DrZwurtCV_gmfRb0uZi-V%2Bd6wcryqg%40mail.gmail.com

Reduce scope of for-loop-local variables to avoid shadowing

Adjust a couple of for-loops where a local variable was shadowed by
another in the same scope, by renaming it as well as reducing its scope
to the containing for-loop.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CAEoWx2kQ2x5gMaj8tHLJ3=jfC+p5YXHkJyHrDTiQw2nn2FJTmQ@mail.gmail.com

Reduce the scope of volatile qualifiers

Commit c66a7d75e652 introduced a new "cast discards ‘volatile’"
warning (-Wcast-qual) in vac_truncate_clog().

Instead of making use of unvolatize(), remove the warning by reducing the
scope of the volatile qualifier (added in commit 2d2e40e3bef) to only
2 fields.

Also do the same for vac_update_datfrozenxid(), since the intent of
commit f65ab862e3b was to prevent the same kind of race condition that
commit 2d2e40e3bef was fixing.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Suggested-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/aZ3a%2BV82uSfEjDmD%40ip-10-97-1-34.eu-west-3.compute.internal

Add COPY (on_error set_null) option

If ON_ERROR SET_NULL is specified during COPY FROM, any data type
conversion errors will result in the affected column being set to a
null value. A column's not-null constraints are still enforced, and
attempting to set a null value in such columns will raise a constraint
violation error. This applies to a column whose data type is a domain
with a NOT NULL constraint.

Author: Jian He <jian.universality@gmail.com>
Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: "David G. Johnston" <david.g.johnston@gmail.com>
Reviewed-by: Yugo NAGATA <nagata@sraoss.co.jp>
Reviewed-by: torikoshia <torikoshia@oss.nttdata.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/CAKFQuwawy1e6YR4S%3Dj%2By7pXqg_Dw1WBVrgvf%3DBP3d1_aSfe_%2BQ%40mail.gmail.com

doc: Fix sentence of pg_walsummary page

Author: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Discussion: https://postgr.es/m/CAHut+PvfYBL-ppX-i8DPeRu7cakYCZz+QYBhrmQzicx7z_Tj5w@mail.gmail.com
Backpatch-through: 17