and doesn't run out of disk space or encounter other faults that may halt operations.
</para>
- <para>
- If one or more nodes are down in a &bdr; group then <acronym>DDL</acronym>
- locking for <xref linkend="ddl-replication"> will wait indefinitely or until
- cancelled. <acronym>DDL</acronym> locking requires <emphasis>consensus</emphasis>
- across all nodes, not just a quorum, so it must be able to reach all nodes.
- So it's important to monitor for node outages.
- </para>
+ <sect1 id="monitoring-why" xreflabel="Why monitoring matters">
+ <title>Why monitoring matters</title>
- <para>
- Global sequence chunk allocations can also be distrupted if half or more of
- the nodes are down or unreachable. See
- <xref linkend="global-sequence-voting">.
- </para>
+ <para>
+ If one or more nodes are down in a &bdr; group then <acronym>DDL</acronym>
+ locking for <xref linkend="ddl-replication"> will wait indefinitely or until
+ cancelled. <acronym>DDL</acronym> locking requires <emphasis>consensus</emphasis>
+ across all nodes, not just a quorum, so it must be able to reach all nodes.
+ So it's important to monitor for node outages.
+ </para>
- <para>
- The <literal>bdr.bdr_nodes</literal> table keeps track of a node's
- membership in a &bdr; group. A row is inserted or updated in the table
- during the node join process, and during node removal. The 'status' column
- may have the following values:
- <itemizedlist>
- <listitem>
- <para>
- i - The node is doing initial slot creation or an initial dump and load (see init_replica, above)
- </para>
- </listitem>
- <listitem>
- <para>
- c - The node is catching up to the target node and is not yet ready to participate with the &bdr; group.
- </para>
- </listitem>
- <listitem>
- <para>
- k - The node has been 'killed' or removed by the user with the function bdr.bdr_part_by_node_names.
- </para>
- </listitem>
- <listitem>
- <para>
- r - The node is fully ready. Slots may be created on this node and it can participate with the &bdr group.
- </para>
- </listitem>
- <!-- TODO: list incomplete for 0.9 -->
- </itemizedlist>
- </para>
+ <para>
+ Global sequence chunk allocations can also be distrupted if half or more of
+ the nodes are down or unreachable. See
+ <xref linkend="global-sequence-voting">.
+ </para>
- <para>
- Note that the status doesn't indicate whether the node is actually up right
- now. A node may be shut down, isolated from the network, or crashed and still
- appear as <literal>r</literal> in <literal>bdr.bdr_nodes</literal> because it's
- still conceptually part of the BDR group. Check
- <ulink url="http://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW">pg_stat_replication</ulink>
- and
- <ulink url="http://www.postgresql.org/docs/current/static/catalog-pg-replication-slots.html">pg_replication_slots</ulink>
- for the connection and replay status of a node.
- </para>
+ <para>
+ Because <acronym>DDL</acronym> locking and global sequence allocations
+ insert messages into the replication stream, a node that is extremely
+ behind on replay will cause similar disruption to one that is entirely
+ down.
+ </para>
- <warning>
<para>
- Do not directly modify <literal>bdr.bdr_nodes</literal>. Use the provided
- node management functions instead. See <xref linkend="functions-node-mgmt">.
+ Protracted node outages can also cause disk space exhaustion, resulting in
+ other nodes rejecting writes or performing emergency shutdowns. Because
+ every node connects to every other node there is a replication slot for every
+ downstream peer node on each node. Replication slots ensure that an upstream
+ (sending) server will retain enough write-ahead log (<acronym>WAL</acronym>)
+ in <filename>pg_xlog</filename> to resume replay from point the downstream
+ peer (receiver) last replayed on that slot. If a peer stops consuming data on
+ a slot or falls increasingly behind on replay then the server that has that
+ slot will accumulate <acronym>WAL</acronym> until it runs out of disk space
+ on <filename>pg_xlog</filename>. This can happen even if the downstream peer
+ is online and replaying if it isn't able to receive and replay changes as
+ fast as the upstream node generates them.
</para>
- </warning>
- <para>
- Here is an example of a <literal>SELECT</literal> from
- <literal>bdr.bdr_nodes</literal> that indicates that one node is ready
- (<literal>r</literal>), one node has been removed/killed
- (<literal>k</literal>), and one node is initializing (<literal>i</literal>):
- <programlisting>
- SELECT * FROM bdr.bdr_nodes;
- node_sysid | node_timeline | node_dboid | node_status | node_name | node_local_dsn | node_init_from_dsn
- ---------------------+---------------+------------+-------------+-----------+--------------------------+--------------------------
- 6125823754033780536 | 1 | 16385 | r | node1 | port=5598 dbname=bdrdemo |
- 6125823714403985168 | 1 | 16386 | k | node2 | port=5599 dbname=bdrdemo | port=5598 dbname=bdrdemo
- 6125847382076805699 | 1 | 16386 | i | node3 | port=6000 dbname=bdrdemo | port=5598 dbname=bdrdemo
- (3 rows)
- </programlisting>
- </para>
+ <para>
+ It is therefore important to have automated monitoring in place to
+ ensure that if replication slots start falling badly behind the
+ admin is alerted and can take proactive action.
+ </para>
+
+ </sect1>
+
+ <sect1 id="monitoring-node-join-remove" xreflabel="Monitoring node join/removal">
+ <title>Monitoring node join/removal</title>
+
+ <para>
+ Node join and removal is asynchronous in &bdr;. The <xref
+ linkend="functions-node-mgmt"> return immediately, without first
+ ensuring the join or part operation is complete. To see when a join or
+ part operation finishes it is necessary to check the node state indirectly
+ via <xref linkend="catalog-bdr-nodes"> or using helper functions.
+ </para>
+
+ <para>
+ The helper function <xref
+ linkend="function-bdr-node-join-wait-for-ready">, when called, will cause
+ a PostgreSQL session to pause until outstanding node join operations
+ complete. More helpers for node status monitoring will be added over
+ time.
+ </para>
+
+ <para>
+ For other node status monitoring <xref linkend="catalog-bdr-nodes">
+ must be queried directly.
+ </para>
+
+ <para>
+ Here is an example of a <literal>SELECT</literal> from
+ <literal>bdr.bdr_nodes</literal> that indicates that one node is ready
+ (<literal>r</literal>), one node has been removed/killed
+ (<literal>k</literal>), and one node is initializing (<literal>i</literal>):
+ <programlisting>
+ SELECT * FROM bdr.bdr_nodes;
+ node_sysid | node_timeline | node_dboid | node_status | node_name | node_local_dsn | node_init_from_dsn
+ ---------------------+---------------+------------+-------------+-----------+--------------------------+--------------------------
+ 6125823754033780536 | 1 | 16385 | r | node1 | port=5598 dbname=bdrdemo |
+ 6125823714403985168 | 1 | 16386 | k | node2 | port=5599 dbname=bdrdemo | port=5598 dbname=bdrdemo
+ 6125847382076805699 | 1 | 16386 | i | node3 | port=6000 dbname=bdrdemo | port=5598 dbname=bdrdemo
+ (3 rows)
+ </programlisting>
+ </para>
+
+ </sect1>
+
+ <sect1 id="monitoring-peers" xreflabel="Monitoring replication peers">
+ <title>Monitoring replication peers</title>
+
+ <para>
+ As outlined in <xref linkend="monitoring-why"> it is important to monitor
+ the state of peer nodes in a &bdr; group. There are two main views
+ used for this: <literal>pg_stat_replication</literal> to monitor for
+ actively replicating nodes, and <literal>pg_replication_slots</literal>
+ to monitor for replication slot progress.
+ </para>
+
+ <sect2 id="monitoring-connections" xreflabel="Monitoring connected peers">
+ <title>Monitoring connected peers using pg_stat_replication</title>
- <para>
- Administrators may query <literal>pg_catalog.pg_stat_replication</literal> to
- monitor actively replicating connections.
- <warning>
<para>
- This view does <emphasis>not</emphasis> show peers that have a slot but are
- not currently connected, even though such peers are still making the server
- retain WAL. It is important to monitor
- <literal>pg_replication_slots</literal> too.
+ Administrators may query
+ <ulink url="http://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW">pg_catalog.pg_stat_replication</ulink>
+ to monitor actively replicating connections.
+ It shows the pid of the local side of the connection (wal sender process), the
+ application name sent by the peer (for BDR, this is <literal>bdr
+ (sysid,timeline,dboid,)</literal>), and other status information:
+ <programlisting>
+ SELECT * FROM pg_stat_replication;
+ pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
+ -------+----------+---------+--------------------------------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
+ 29045 | 16385 | myadmin | bdr (6127682459268878512,1,16386,):receive | | | -1 | 2015-03-18 21:03:28.717175+00 | | streaming | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0 | async
+ 29082 | 16385 | myadmin | bdr (6127682494973391064,1,16386,):receive | | | -1 | 2015-03-18 21:03:44.665272+00 | | streaming | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0 | async
+ </programlisting>
+ This view shows all active replication connections, not just those used by
+ &bdr;. You will see connections from physical streaming replicas, other
+ logical decoding solutions, etc here as well.
</para>
- </warning>
- It shows the pid of the local side of the connection (wal sender process), the
- application name sent by the peer (for BDR, this is <literal>bdr
- (sysid,timeline,dboid,)</literal>), and other status information:
- <programlisting>
- SELECT * FROM pg_stat_replication;
- pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
- -------+----------+---------+--------------------------------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
- 29045 | 16385 | myadmin | bdr (6127682459268878512,1,16386,):receive | | | -1 | 2015-03-18 21:03:28.717175+00 | | streaming | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0 | async
- 29082 | 16385 | myadmin | bdr (6127682494973391064,1,16386,):receive | | | -1 | 2015-03-18 21:03:44.665272+00 | | streaming | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0 | async
- </programlisting>
- This view shows all replication connections, not just those used by &bdr;.
- </para>
- <para>
- Information about replication slots (both logical and physical) is available
- in the <literal>pg_catalog.pg_replication_slots</literal> view:
- <programlisting>
- SELECT * FROM pg_replication_slots;
- slot_name | plugin | slot_type | datoid | database | active | xmin | catalog_xmin | restart_lsn
- -----------------------------------------+--------+-----------+--------+----------+--------+------+--------------+-------------
- bdr_16386_6127682459268878512_1_16386__ | bdr | logical | 16386 | bdrdemo | t | | 749 | 0/191B130
- bdr_16386_6127682494973391064_1_16386__ | bdr | logical | 16386 | bdrdemo | t | | 749 | 0/191B130
- (2 rows)
- </programlisting>
- If a slot has <literal>active = t</literal>
- then there will be a corresponding <literal>pg_stat_replication</literal> entry
- for the walsender process connected to the slot.
- </para>
- <para>
- This view shows only replication peers that use a slot. Physical streaming
- replication connections that don't use slots will not show up here, only in
- <literal>pg_stat_replication</literal>.
- </para>
+ <para>
+ To tell how far behind a given active connection is, compare its
+ <literal>flush_location</literal> (the replay position up to which
+ it has committed its work) with the sending server's
+ <literal>pg_current_xlog_insert_location()</literal> using
+ <literal>pg_xlog_location_diff</literal>, e.g:
+ <programlisting>
+ SELECT
+ pg_xlog_location_diff(pg_current_xlog_insert_location(), flush_location) AS lag_bytes,
+ pid, application_name
+ FROM pg_stat_replication;
+ </programlisting>
+ This will show lag for all replication consumers, including non-&bdr;
+ ones. To show only &bdr; peers, append
+ <literal>WHERE application_name LIKE 'bdr%'</literal>.
+ </para>
- <para>
- Performance and conflict statistics are maintained for each node by &bdr; in
- the <literal>bdr.pg_stat_bdr</literal> table. This table is <emphasis>not
- replicated</emphasis> between nodes, so each node has separate stats. Each row
- represents the &bdr; apply statistics for a different peer node.
- <programlisting>
- SELECT * FROM bdr.pg_stat_bdr;
- rep_node_id | rilocalid | riremoteid | nr_commit | nr_rollback | nr_insert | nr_insert_conflict | nr_update | nr_update_conflict | nr_delete | nr_delete_conflict | nr_disconnect
- -------------+-----------+----------------------------------------+-----------+-------------+-----------+--------------------+-----------+--------------------+-----------+--------------------+---------------
- 1 | 1 | bdr_6127682459268878512_1_16386_16386_ | 4 | 0 | 6 | 0 | 1 | 0 | 0 | 3 | 0
- 2 | 2 | bdr_6127682494973391064_1_16386_16386_ | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0
- (2 rows)
- </programlisting>
- </para>
+ <warning>
+ <para>
+ <literal>pg_stat_replication</literal> does <emphasis>not</emphasis> show
+ peers that have a slot but are not currently connected, even though such
+ peers are still making the server retain WAL. It is important to monitor
+ <literal>pg_replication_slots</literal> too.
+ </para>
+ </warning>
- <para>
- You can track conflicts that have occurred on a particular node with
- <literal>bdr.bdr_conflict_history</literal>. This catalog is not replicated
- between nodes. This is a technical limitation that may be lifted in a future
- release, but it also saves on unnecessary replication overhead.
- </para>
- <para>
- You can use the conflict history table to determine how rapidly your
- application creates conflicts and where those conflicts occur, allowing you to
- improve the application to reduce conflict rates. It also helps detect cases
- where conflict resolutions may not have produced the desired results, allowing
- you to identify places where a user defined conflict trigger or an application
- design change may be desirable.
- </para>
- <para>
- Row values may optionally be logged for row conflicts. This is controlled by
- the global database-wide option bdr.log_conflicts_to_table. There is no
- per-table control over row value logging at this time. Nor is there any limit
- applied on the number of fields a row may have, number of elements dumped in
- arrays, length of fields, etc, so it may not be wise to enable this if you
- regularly work with multi-megabyte rows that may trigger conflicts.
- </para>
- <para>
- Because the conflict history table contains data on every table in the
- database so each row's schema might be different, if row values are logged
- they are stored as json fields. The json is created with row_to_json, just
- like if you'd called it on the row yourself from SQL. There is no
- corresponding json_to_row function in PostgreSQL at this time, so you'll need
- table-specific code (pl/pgsql, pl/python, pl/perl, whatever) if you want to
- reconstruct a composite-typed tuple from the logged json.
- </para>
+ <para>
+ There is not currently any facility to report how far behind a given node
+ is in elapsed seconds of wall-clock time. So you can't easily tell that
+ node <replaceable>X</replaceable> currently has data that is
+ <replaceable>n</replaceable> seconds older than the original data on node
+ <replaceable>Y</replaceable>. If this is an application requirement the
+ application should write periodic timestamp tick records to a table and
+ check how old the newest tick for a given node is on other nodes.
+ </para>
+
+ </sect2>
+
+ <sect2 id="monitoring-slots" xreflabel="Monitoring replication slots">
+ <title>Monitoring replication slots</title>
+
+ <para>
+ Information about replication slots (both logical and physical) is available
+ in the <literal>pg_catalog.pg_replication_slots</literal> view. This view
+ shows all slots, whether or not there is an active replication connection
+ using them. It looks like:
+ <programlisting>
+ SELECT * FROM pg_replication_slots;
+ slot_name | plugin | slot_type | datoid | database | active | xmin | catalog_xmin | restart_lsn
+ -----------------------------------------+--------+-----------+--------+----------+--------+------+--------------+-------------
+ bdr_16386_6127682459268878512_1_16386__ | bdr | logical | 16386 | bdrdemo | t | | 749 | 0/191B130
+ bdr_16386_6127682494973391064_1_16386__ | bdr | logical | 16386 | bdrdemo | t | | 749 | 0/191B130
+ (2 rows)
+ </programlisting>
+ If a slot has <literal>active = t</literal>
+ then there will be a corresponding <literal>pg_stat_replication</literal> entry
+ for the walsender process connected to the slot.
+ </para>
+
+ <para>
+ This view shows only replication peers that use a slot. Physical streaming
+ replication connections that don't use slots will not show up here, only in
+ <literal>pg_stat_replication</literal>. &bdr; always uses slots so all
+ &bdr; peers will appear here.
+ </para>
+
+ <para>
+ To see how much extra <acronym>WAL</acronym> &bdr; slot is asking the server
+ to keep, in bytes, use a query like:
+ <programlisting>
+ SELECT
+ slot_name, database, active,
+ pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes
+ FROM pg_replication_slots
+ WHERE plugin = 'bdr';
+ </programlisting>
+ </para>
+
+ <para>
+ Retained <acronym>WAL</acronym> isn't additive; if you have three peers, who
+ of which require 500KB of <acronym>WAL</acronym> to be retained and one that
+ requires 8MB, only 8MB is retained. It's like a dynamic version of the
+ <literal>wal_keep_segments</literal> setting (or, in 9.5,
+ <literal>min_wal_size</literal>). So you need to monitor to make sure that
+ the <emphasis>largest</emphasis> amount of retained <acronym>WAL</acronym>
+ doens't exhaust the free space in <filename>pg_xlog</filename> on each node.
+ </para>
+
+ <para>
+ It is normal for <literal>pg_replication_slots.restart_lsn</literal> not to
+ advance as soon as <literal>pg_stat_replication.flush_location</literal>
+ advances on an active connection. The slot restat position does
+ <emphasis>not</emphasis> indicate how old the data you will see on a peer
+ node is.
+ </para>
+
+ </sect2>
+
+ </sect1>
+
+ <sect1 id="monitoring-conflict-stats" xreflabel="Monitoring conflicts">
+ <title>Monitoring conflicts</title>
+
+ <para>
+ <xref linkend="conflicts"> can arise when multiple nodes make changes
+ that affect the same tables in ways that can interact with each other.
+ The &bdr; system should be monitored to ensure that conflicts
+ are identified and, where possible, applicaiton changes are made to
+ eliminate them or make them less frequent.
+ </para>
+
+ <para>
+ Not all conflicts are logged to <xref linkend="catalog-bdr-conflict-history">
+ even when <xref linkend="guc-bdr-log-conflicts-to-table"> is on. Conflicts
+ where &bdr; cannot proactively detect and handle the conflict (like 3-way
+ foreign key conflicts) will result in an <literal>ERROR</literal> message
+ in the PostgreSQL logs and an increment of
+ <xref linkend="catalog-pg-stat-bdr"><literal>.nr_rollbacks</literal>
+ on that node for the connection the conflicting transaction originated from.
+ </para>
+
+ <para>
+ If <literal>pg_stat_bdr.nr_rollbacks</literal> keeps increasing and a node
+ isn't making forward progress, then it's likely there's a divergent conflict
+ or other issue that may need administrator action. Check the log files
+ for that node for details.
+ </para>
+
+ </sect1>
+
+ <sect1 id="monitoring-postgres-stats" xreflabel="PostgreSQL statistics views">
+ <title>PostgreSQL statistics views</title>
+
+ <para>
+ Statistics on table and index usage are updated normally by the downstream
+ master. This is essential for correct function of
+ <ulink url="http://www.postgresql.org/docs/current/static/routine-vacuuming.html">autovacuum</ulink>.
+ If there are no local writes on the downstream master and stats have not
+ been reset these two views should show matching results between upstream and
+ downstream:
+ <itemizedlist>
+ <listitem><simpara><literal>pg_stat_user_tables</literal></simpara></listitem>
+ <listitem><simpara><literal>pg_statio_user_tables</literal></simpara></listitem>
+ </itemizedlist>
+ </para>
+
+ <para>
+ Since indexes are used to apply changes, the identifying indexes on
+ downstream side may appear more heavily used with workloads that perform
+ <literal>UPDATE</literal>s and <literal>DELETE</literal>s than
+ non-identifying indexes are.
+ </para>
+
+ <para>
+ The built-in index monitoring views are:
+ <itemizedlist>
+ <listitem><simpara><literal>pg_stat_user_indexes</literal></simpara></listitem>
+ <listitem><simpara><literal>pg_statio_user_indexes</literal></simpara></listitem>
+ </itemizedlist>
+ </para>
+
+ <para>
+ All these views are discussed in the
+ <ulink url="http://www.postgresql.org/docs/current/static/monitoring-stats.html#MONITORING-STATS-VIEWS-TABLE">
+ PostgreSQL documentation on the statistics views</ulink>.
+ </para>
+
+ </sect1>
</chapter>