From 923fcf3165dac6c1feaac47ff9814da652aa32b7 Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Tue, 24 Mar 2015 17:49:07 +0800 Subject: [PATCH] doc: Skeletal documentation on catalogs, conflicts, more on monitoring --- doc/manual-catalogs-views.sgml | 96 +++++++- doc/manual-conflicts.sgml | 71 +++++- doc/manual-monitoring.sgml | 433 +++++++++++++++++++++------------ 3 files changed, 439 insertions(+), 161 deletions(-) diff --git a/doc/manual-catalogs-views.sgml b/doc/manual-catalogs-views.sgml index 34ab4cf3d1..bc11d04f0e 100644 --- a/doc/manual-catalogs-views.sgml +++ b/doc/manual-catalogs-views.sgml @@ -5,9 +5,99 @@ Views Monitoring - - + + bdr.bdr_nodes + + + The bdr.bdr_nodes table keeps track of a node's + membership in a &bdr; group. A row is inserted or updated in the table + during the node join process, and during node removal. + + + + The 'status' column may have the following values, which are subject to change: + + r- Ready: The node can function fully. Slots may be created on this node and it can participate with the &bdr group. + b- Joining: The node is bootstrapping. This state occurs when join has been initiated but a copy has not yet been begun. + i- Joining: The node is doing initial slot creation or an initial dump and load + c- Joining: The node is catching up to the target node and is not yet ready to participate with the &bdr; group. + o- Joining: The node has caught up with the target node and is waiting for all inbound and outbound slots to be created. + k- Parting/Parted: The node has been 'killed' or removed by the user with the function bdr.bdr_part_by_node_names. + + + + + + Note that the status doesn't indicate whether the node is actually up right + now. A node may be shut down, isolated from the network, or crashed and still + appear as r in bdr.bdr_nodes because it's + still conceptually part of the BDR group. Check + pg_stat_replication + and + pg_replication_slots + for the connection and replay status of a node. See . + + + + + Do not directly modify bdr.bdr_nodes. Use the provided + node management functions instead. See . + + + + + + + + + bdr.pg_stat_bdr + + + Performance and conflict statistics are maintained for each node by &bdr; in + the bdr.pg_stat_bdr table. This table is not + replicated between nodes, so each node has separate stats. Each row + represents the &bdr; apply statistics for a different peer node. + + + + An example listing from this table might look like: + + SELECT * FROM bdr.pg_stat_bdr; + rep_node_id | rilocalid | riremoteid | nr_commit | nr_rollback | nr_insert | nr_insert_conflict | nr_update | nr_update_conflict | nr_delete | nr_delete_conflict | nr_disconnect + -------------+-----------+----------------------------------------+-----------+-------------+-----------+--------------------+-----------+--------------------+-----------+--------------------+--------------- + 1 | 1 | bdr_6127682459268878512_1_16386_16386_ | 4 | 0 | 6 | 0 | 1 | 0 | 0 | 3 | 0 + 2 | 2 | bdr_6127682494973391064_1_16386_16386_ | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 + (2 rows) + + + + + + + + + bdr.bdr_conflict_history + + + bdr.bdr_conflict_history contains an entry for each + conflict generated in the system while + is enabled. + + + + This history table is not replicated between nodes, so + each node has separate conflict history records. This is a technical + limitation that may be lifted in a future release, but it also saves on + unnecessary replication overhead. + + + + It is safe to TRUNCATE this table to save disk space. + + + + + - diff --git a/doc/manual-conflicts.sgml b/doc/manual-conflicts.sgml index ef66d40ff8..467a4e9400 100644 --- a/doc/manual-conflicts.sgml +++ b/doc/manual-conflicts.sgml @@ -1,13 +1,80 @@ Multi-master conflicts + Conflicts + In multi-master use of &bdr; writes to the same or related table(s) from multiple different nodes can result in data conflicts. + - + + + + Types of conflict + + + Many conflict types are possible: + + INSERT/INSERT + INSERT/UPDATE + UPDATE/DELETE + INSERT/DELETE + Transient unhandled conflicts, e.g. foreign key constraint replay order conflicts + Divergent conflicts + + + + + + + + + Conflict logging + + + To make diagnosis and handling of multi-master conflicts easier, &bdr; + supports logging of each conflict incident in a table. + + + + Conflict logging is only enabled when is + true. + + + + You can use the conflict history table to determine how rapidly your + application creates conflicts and where those conflicts occur, allowing you to + improve the application to reduce conflict rates. It also helps detect cases + where conflict resolutions may not have produced the desired results, allowing + you to identify places where a user defined conflict trigger or an application + design change may be desirable. + + + + Row values may optionally be logged for row conflicts. This is controlled by + the global database-wide option . + There is no per-table control over row value logging at this time. Nor is + there any limit applied on the number of fields a row may have, number of + elements dumped in arrays, length of fields, etc, so it may not be wise to + enable this if you regularly work with multi-megabyte rows that may trigger + conflicts. + + + + Because the conflict history table contains data on every table in the + database so each row's schema might be different, if row values are logged + they are stored as json fields. The json is created with + row_to_json, just like if you'd called it on the row + yourself from SQL. There is no corresponding + json_to_row function in PostgreSQL at this time, so + you'll need table-specific code (pl/pgsql, pl/python, pl/perl, whatever) if + you want to reconstruct a composite-typed tuple from the logged json. + + + - diff --git a/doc/manual-monitoring.sgml b/doc/manual-monitoring.sgml index 4ca8018e0f..5aef08614c 100644 --- a/doc/manual-monitoring.sgml +++ b/doc/manual-monitoring.sgml @@ -10,174 +10,295 @@ and doesn't run out of disk space or encounter other faults that may halt operations. - - If one or more nodes are down in a &bdr; group then DDL - locking for will wait indefinitely or until - cancelled. DDL locking requires consensus - across all nodes, not just a quorum, so it must be able to reach all nodes. - So it's important to monitor for node outages. - + + Why monitoring matters - - Global sequence chunk allocations can also be distrupted if half or more of - the nodes are down or unreachable. See - . - + + If one or more nodes are down in a &bdr; group then DDL + locking for will wait indefinitely or until + cancelled. DDL locking requires consensus + across all nodes, not just a quorum, so it must be able to reach all nodes. + So it's important to monitor for node outages. + - - The bdr.bdr_nodes table keeps track of a node's - membership in a &bdr; group. A row is inserted or updated in the table - during the node join process, and during node removal. The 'status' column - may have the following values: - - - - i - The node is doing initial slot creation or an initial dump and load (see init_replica, above) - - - - - c - The node is catching up to the target node and is not yet ready to participate with the &bdr; group. - - - - - k - The node has been 'killed' or removed by the user with the function bdr.bdr_part_by_node_names. - - - - - r - The node is fully ready. Slots may be created on this node and it can participate with the &bdr group. - - - - - + + Global sequence chunk allocations can also be distrupted if half or more of + the nodes are down or unreachable. See + . + - - Note that the status doesn't indicate whether the node is actually up right - now. A node may be shut down, isolated from the network, or crashed and still - appear as r in bdr.bdr_nodes because it's - still conceptually part of the BDR group. Check - pg_stat_replication - and - pg_replication_slots - for the connection and replay status of a node. - + + Because DDL locking and global sequence allocations + insert messages into the replication stream, a node that is extremely + behind on replay will cause similar disruption to one that is entirely + down. + - - Do not directly modify bdr.bdr_nodes. Use the provided - node management functions instead. See . + Protracted node outages can also cause disk space exhaustion, resulting in + other nodes rejecting writes or performing emergency shutdowns. Because + every node connects to every other node there is a replication slot for every + downstream peer node on each node. Replication slots ensure that an upstream + (sending) server will retain enough write-ahead log (WAL) + in pg_xlog to resume replay from point the downstream + peer (receiver) last replayed on that slot. If a peer stops consuming data on + a slot or falls increasingly behind on replay then the server that has that + slot will accumulate WAL until it runs out of disk space + on pg_xlog. This can happen even if the downstream peer + is online and replaying if it isn't able to receive and replay changes as + fast as the upstream node generates them. - - - Here is an example of a SELECT from - bdr.bdr_nodes that indicates that one node is ready - (r), one node has been removed/killed - (k), and one node is initializing (i): - - SELECT * FROM bdr.bdr_nodes; - node_sysid | node_timeline | node_dboid | node_status | node_name | node_local_dsn | node_init_from_dsn - ---------------------+---------------+------------+-------------+-----------+--------------------------+-------------------------- - 6125823754033780536 | 1 | 16385 | r | node1 | port=5598 dbname=bdrdemo | - 6125823714403985168 | 1 | 16386 | k | node2 | port=5599 dbname=bdrdemo | port=5598 dbname=bdrdemo - 6125847382076805699 | 1 | 16386 | i | node3 | port=6000 dbname=bdrdemo | port=5598 dbname=bdrdemo - (3 rows) - - + + It is therefore important to have automated monitoring in place to + ensure that if replication slots start falling badly behind the + admin is alerted and can take proactive action. + + + + + + Monitoring node join/removal + + + Node join and removal is asynchronous in &bdr;. The return immediately, without first + ensuring the join or part operation is complete. To see when a join or + part operation finishes it is necessary to check the node state indirectly + via or using helper functions. + + + + The helper function , when called, will cause + a PostgreSQL session to pause until outstanding node join operations + complete. More helpers for node status monitoring will be added over + time. + + + + For other node status monitoring + must be queried directly. + + + + Here is an example of a SELECT from + bdr.bdr_nodes that indicates that one node is ready + (r), one node has been removed/killed + (k), and one node is initializing (i): + + SELECT * FROM bdr.bdr_nodes; + node_sysid | node_timeline | node_dboid | node_status | node_name | node_local_dsn | node_init_from_dsn + ---------------------+---------------+------------+-------------+-----------+--------------------------+-------------------------- + 6125823754033780536 | 1 | 16385 | r | node1 | port=5598 dbname=bdrdemo | + 6125823714403985168 | 1 | 16386 | k | node2 | port=5599 dbname=bdrdemo | port=5598 dbname=bdrdemo + 6125847382076805699 | 1 | 16386 | i | node3 | port=6000 dbname=bdrdemo | port=5598 dbname=bdrdemo + (3 rows) + + + + + + + Monitoring replication peers + + + As outlined in it is important to monitor + the state of peer nodes in a &bdr; group. There are two main views + used for this: pg_stat_replication to monitor for + actively replicating nodes, and pg_replication_slots + to monitor for replication slot progress. + + + + Monitoring connected peers using pg_stat_replication - - Administrators may query pg_catalog.pg_stat_replication to - monitor actively replicating connections. - - This view does not show peers that have a slot but are - not currently connected, even though such peers are still making the server - retain WAL. It is important to monitor - pg_replication_slots too. + Administrators may query + pg_catalog.pg_stat_replication + to monitor actively replicating connections. + It shows the pid of the local side of the connection (wal sender process), the + application name sent by the peer (for BDR, this is bdr + (sysid,timeline,dboid,)), and other status information: + + SELECT * FROM pg_stat_replication; + pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state + -------+----------+---------+--------------------------------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------ + 29045 | 16385 | myadmin | bdr (6127682459268878512,1,16386,):receive | | | -1 | 2015-03-18 21:03:28.717175+00 | | streaming | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0 | async + 29082 | 16385 | myadmin | bdr (6127682494973391064,1,16386,):receive | | | -1 | 2015-03-18 21:03:44.665272+00 | | streaming | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0 | async + + This view shows all active replication connections, not just those used by + &bdr;. You will see connections from physical streaming replicas, other + logical decoding solutions, etc here as well. - - It shows the pid of the local side of the connection (wal sender process), the - application name sent by the peer (for BDR, this is bdr - (sysid,timeline,dboid,)), and other status information: - - SELECT * FROM pg_stat_replication; - pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state - -------+----------+---------+--------------------------------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------ - 29045 | 16385 | myadmin | bdr (6127682459268878512,1,16386,):receive | | | -1 | 2015-03-18 21:03:28.717175+00 | | streaming | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0 | async - 29082 | 16385 | myadmin | bdr (6127682494973391064,1,16386,):receive | | | -1 | 2015-03-18 21:03:44.665272+00 | | streaming | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0/189D3B8 | 0 | async - - This view shows all replication connections, not just those used by &bdr;. - - - Information about replication slots (both logical and physical) is available - in the pg_catalog.pg_replication_slots view: - - SELECT * FROM pg_replication_slots; - slot_name | plugin | slot_type | datoid | database | active | xmin | catalog_xmin | restart_lsn - -----------------------------------------+--------+-----------+--------+----------+--------+------+--------------+------------- - bdr_16386_6127682459268878512_1_16386__ | bdr | logical | 16386 | bdrdemo | t | | 749 | 0/191B130 - bdr_16386_6127682494973391064_1_16386__ | bdr | logical | 16386 | bdrdemo | t | | 749 | 0/191B130 - (2 rows) - - If a slot has active = t - then there will be a corresponding pg_stat_replication entry - for the walsender process connected to the slot. - - - This view shows only replication peers that use a slot. Physical streaming - replication connections that don't use slots will not show up here, only in - pg_stat_replication. - + + To tell how far behind a given active connection is, compare its + flush_location (the replay position up to which + it has committed its work) with the sending server's + pg_current_xlog_insert_location() using + pg_xlog_location_diff, e.g: + + SELECT + pg_xlog_location_diff(pg_current_xlog_insert_location(), flush_location) AS lag_bytes, + pid, application_name + FROM pg_stat_replication; + + This will show lag for all replication consumers, including non-&bdr; + ones. To show only &bdr; peers, append + WHERE application_name LIKE 'bdr%'. + - - Performance and conflict statistics are maintained for each node by &bdr; in - the bdr.pg_stat_bdr table. This table is not - replicated between nodes, so each node has separate stats. Each row - represents the &bdr; apply statistics for a different peer node. - - SELECT * FROM bdr.pg_stat_bdr; - rep_node_id | rilocalid | riremoteid | nr_commit | nr_rollback | nr_insert | nr_insert_conflict | nr_update | nr_update_conflict | nr_delete | nr_delete_conflict | nr_disconnect - -------------+-----------+----------------------------------------+-----------+-------------+-----------+--------------------+-----------+--------------------+-----------+--------------------+--------------- - 1 | 1 | bdr_6127682459268878512_1_16386_16386_ | 4 | 0 | 6 | 0 | 1 | 0 | 0 | 3 | 0 - 2 | 2 | bdr_6127682494973391064_1_16386_16386_ | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 - (2 rows) - - + + + pg_stat_replication does not show + peers that have a slot but are not currently connected, even though such + peers are still making the server retain WAL. It is important to monitor + pg_replication_slots too. + + - - You can track conflicts that have occurred on a particular node with - bdr.bdr_conflict_history. This catalog is not replicated - between nodes. This is a technical limitation that may be lifted in a future - release, but it also saves on unnecessary replication overhead. - - - You can use the conflict history table to determine how rapidly your - application creates conflicts and where those conflicts occur, allowing you to - improve the application to reduce conflict rates. It also helps detect cases - where conflict resolutions may not have produced the desired results, allowing - you to identify places where a user defined conflict trigger or an application - design change may be desirable. - - - Row values may optionally be logged for row conflicts. This is controlled by - the global database-wide option bdr.log_conflicts_to_table. There is no - per-table control over row value logging at this time. Nor is there any limit - applied on the number of fields a row may have, number of elements dumped in - arrays, length of fields, etc, so it may not be wise to enable this if you - regularly work with multi-megabyte rows that may trigger conflicts. - - - Because the conflict history table contains data on every table in the - database so each row's schema might be different, if row values are logged - they are stored as json fields. The json is created with row_to_json, just - like if you'd called it on the row yourself from SQL. There is no - corresponding json_to_row function in PostgreSQL at this time, so you'll need - table-specific code (pl/pgsql, pl/python, pl/perl, whatever) if you want to - reconstruct a composite-typed tuple from the logged json. - + + There is not currently any facility to report how far behind a given node + is in elapsed seconds of wall-clock time. So you can't easily tell that + node X currently has data that is + n seconds older than the original data on node + Y. If this is an application requirement the + application should write periodic timestamp tick records to a table and + check how old the newest tick for a given node is on other nodes. + + + + + + Monitoring replication slots + + + Information about replication slots (both logical and physical) is available + in the pg_catalog.pg_replication_slots view. This view + shows all slots, whether or not there is an active replication connection + using them. It looks like: + + SELECT * FROM pg_replication_slots; + slot_name | plugin | slot_type | datoid | database | active | xmin | catalog_xmin | restart_lsn + -----------------------------------------+--------+-----------+--------+----------+--------+------+--------------+------------- + bdr_16386_6127682459268878512_1_16386__ | bdr | logical | 16386 | bdrdemo | t | | 749 | 0/191B130 + bdr_16386_6127682494973391064_1_16386__ | bdr | logical | 16386 | bdrdemo | t | | 749 | 0/191B130 + (2 rows) + + If a slot has active = t + then there will be a corresponding pg_stat_replication entry + for the walsender process connected to the slot. + + + + This view shows only replication peers that use a slot. Physical streaming + replication connections that don't use slots will not show up here, only in + pg_stat_replication. &bdr; always uses slots so all + &bdr; peers will appear here. + + + + To see how much extra WAL &bdr; slot is asking the server + to keep, in bytes, use a query like: + + SELECT + slot_name, database, active, + pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes + FROM pg_replication_slots + WHERE plugin = 'bdr'; + + + + + Retained WAL isn't additive; if you have three peers, who + of which require 500KB of WAL to be retained and one that + requires 8MB, only 8MB is retained. It's like a dynamic version of the + wal_keep_segments setting (or, in 9.5, + min_wal_size). So you need to monitor to make sure that + the largest amount of retained WAL + doens't exhaust the free space in pg_xlog on each node. + + + + It is normal for pg_replication_slots.restart_lsn not to + advance as soon as pg_stat_replication.flush_location + advances on an active connection. The slot restat position does + not indicate how old the data you will see on a peer + node is. + + + + + + + + Monitoring conflicts + + + can arise when multiple nodes make changes + that affect the same tables in ways that can interact with each other. + The &bdr; system should be monitored to ensure that conflicts + are identified and, where possible, applicaiton changes are made to + eliminate them or make them less frequent. + + + + Not all conflicts are logged to + even when is on. Conflicts + where &bdr; cannot proactively detect and handle the conflict (like 3-way + foreign key conflicts) will result in an ERROR message + in the PostgreSQL logs and an increment of + .nr_rollbacks + on that node for the connection the conflicting transaction originated from. + + + + If pg_stat_bdr.nr_rollbacks keeps increasing and a node + isn't making forward progress, then it's likely there's a divergent conflict + or other issue that may need administrator action. Check the log files + for that node for details. + + + + + + PostgreSQL statistics views + + + Statistics on table and index usage are updated normally by the downstream + master. This is essential for correct function of + autovacuum. + If there are no local writes on the downstream master and stats have not + been reset these two views should show matching results between upstream and + downstream: + + pg_stat_user_tables + pg_statio_user_tables + + + + + Since indexes are used to apply changes, the identifying indexes on + downstream side may appear more heavily used with workloads that perform + UPDATEs and DELETEs than + non-identifying indexes are. + + + + The built-in index monitoring views are: + + pg_stat_user_indexes + pg_statio_user_indexes + + + + + All these views are discussed in the + + PostgreSQL documentation on the statistics views. + + + -- 2.39.5