Fix bug mistakenly overriding global backend status right after failover.
authorTatsuo Ishii <ishii@postgresql.org>
Fri, 22 Sep 2017 02:50:28 +0000 (11:50 +0900)
committerTatsuo Ishii <ishii@postgresql.org>
Fri, 22 Sep 2017 02:50:28 +0000 (11:50 +0900)
In [pgpool-general: 5728] it is reported that even if failover
disconnects a backend, the status is changed from "down" to "up" in
certain timing. After debugging I found that the backend status in
pgpool_status was changed to down, then changed again by the first
connection from a client after the failover. This happened in
new_connection(), which in charge of creating a new connection to
backend. It checks the local cached status of the backend and if it's
up, then it tries to connect to the backend. In the particular case,
the failover is triggered by failover_if_affected_tuples_mismatch, so
actually the backend is alive and new_connection() succeeds in
establishing connection to the disconnected backend. Then it override
the global status and pgpool_status file.

Fix is, check if the local backend status is obsoleted. If the global
status does not agree the local status, skip the effort to establish
the connection.

In this report the user uses native replication mode, but I think
similar situation can happen in other mode.

src/protocol/pool_connection_pool.c

index a84134cd1f1b053bad049a17fa5d8f4148a7bc6d..320f76d17aba051cd025b9be8c777f096fecb86e 100644 (file)
@@ -845,6 +845,22 @@ static POOL_CONNECTION_POOL *new_connection(POOL_CONNECTION_POOL *p)
                        continue;
                }
 
+               /*
+                * Make sure that the global backend status in the shared memory
+                * agrees the local status checked by VALID_BACKEND. It is possible
+                * that the local status is up, while the global status has been
+                * changed to down by failover.
+                */
+               if (BACKEND_INFO(i).backend_status != CON_UP &&
+                       BACKEND_INFO(i).backend_status != CON_CONNECT_WAIT)
+               {
+                       ereport(DEBUG1,
+                                       (errmsg("creating new connection to backend"),
+                                       errdetail("skipping backend slot %d because global backend_status = %d",
+                                                  i, BACKEND_INFO(i).backend_status)));
+                       continue;
+               }
+
                s = palloc(sizeof(POOL_CONNECTION_POOL_SLOT));
 
                if (create_cp(s, i) == NULL)