Add in documentation about complex failover scenarios, e.g. - the handling
authorChristopher Browne <cbbrowne@ca.afilias.info>
Fri, 16 Jan 2009 17:16:52 +0000 (17:16 +0000)
committerChristopher Browne <cbbrowne@ca.afilias.info>
Fri, 16 Jan 2009 17:16:52 +0000 (17:16 +0000)
of failover where a whole site is lost.

doc/adminguide/complexenv.dia
doc/adminguide/complexenv.png
doc/adminguide/complexfail.dia [new file with mode: 0644]
doc/adminguide/complexfail.png [new file with mode: 0644]
doc/adminguide/failover.sgml
doc/adminguide/slonik_ref.sgml

index 220ae3bc0de7e82786d9b2acb6fd4dc02f538861..cc88d836d6103d17acbdecafe4759f39d74064bc 100644 (file)
Binary files a/doc/adminguide/complexenv.dia and b/doc/adminguide/complexenv.dia differ
index a55d52983a70f96b4796b5cc7de2cb53753732c8..3f729eee808a5b08b1c413c01d9585bc24dcaa7a 100644 (file)
Binary files a/doc/adminguide/complexenv.png and b/doc/adminguide/complexenv.png differ
diff --git a/doc/adminguide/complexfail.dia b/doc/adminguide/complexfail.dia
new file mode 100644 (file)
index 0000000..b0aed46
Binary files /dev/null and b/doc/adminguide/complexfail.dia differ
diff --git a/doc/adminguide/complexfail.png b/doc/adminguide/complexfail.png
new file mode 100644 (file)
index 0000000..3f729ee
Binary files /dev/null and b/doc/adminguide/complexfail.png differ
index 0fa59d5e73097c134bc57cc2ad714f95925463a2..12665b60044770a6eeda086bb5dcb21c0283e126 100644 (file)
@@ -1,4 +1,4 @@
-<!-- $Id: failover.sgml,v 1.28 2008-10-13 19:29:12 devrim Exp $ -->
+<!-- $Id: failover.sgml,v 1.29 2009-01-16 17:16:52 cbbrowne Exp $ -->
 <sect1 id="failover">
 <title>Doing switchover and failover with &slony1;</title>
 <indexterm><primary>failover</primary>
@@ -240,6 +240,81 @@ order after this change. </para> </listitem>
 
 </sect2>
 
+<sect2 id="complexfailover"> <title> Failover With Complex Node Set </title>
+
+<para> Failover is relatively <quote/simple/ if there are only two
+nodes; if a &slony1; cluster comprises many nodes, achieving a clean
+failover requires careful planning and execution. </para>
+
+<para> Consider the following diagram describing a set of six nodes at two sites.
+
+<inlinemediaobject> <imageobject> <imagedata fileref="complexenv.png">
+</imageobject> <textobject> <phrase> Symmetric Multisites </phrase>
+</textobject> </inlinemediaobject></para>
+
+<para> Let us assume that nodes 1, 2, and 3 reside at one data
+centre, and that we find ourselves needing to perform failover due to
+failure of that entire site.  Causes could range from a persistent
+loss of communications to the physical destruction of the site; the
+cause is not actually important, as what we are concerned about is how
+to get &slony1; to properly fail over to the new site.</para>
+
+<para> We will further assume that node 5 is to be the new origin,
+after failover. </para>
+
+<para> The sequence of &slony1; reconfiguration required to properly
+failover this sort of node configuration is as follows:
+</para>
+
+<itemizedlist>
+
+<listitem><para> Resubscribe (using <xref linkend="stmtsubscribeset">
+ech node that is to be kept in the reformation of the cluster that is
+not already subscribed to the intended data provider.  </para>
+
+<para> In the example cluster, this means we would likely wish to
+resubscribe nodes 4 and 6 to both point to node 5.</para>
+
+<programlisting>
+   include &lt;/tmp/failover-preamble.slonik&gt;;
+   subscribe set (id = 1, provider = 5, receiver = 4);
+   subscribe set (id = 1, provider = 5, receiver = 4);
+</programlisting>
+
+</listitem>
+<listitem><para> Drop all unimportant nodes, starting with leaf nodes.</para>
+
+<para> Since nodes 1, 2, and 3 are unaccessible, we must indicate the
+<envar>EVENT NODE</envar> so that the event reaches the still-live
+portions of the cluster. </para>
+
+<programlisting>
+   include &lt;/tmp/failover-preamble.slonik&gt;;
+   drop node (id=2, event node = 4);
+   drop node (id=3, event node = 4);
+</programlisting>
+
+</listitem>
+
+<listitem><para> Now, run <command>FAILOVER</command>.</para>
+
+<programlisting>
+   include &lt;/tmp/failover-preamble.slonik&gt;;
+   failover (id = 1, backup node = 5);
+</programlisting>
+
+</listitem>
+
+<listitem><para> Finally, drop the former origin from the cluster.</para>
+
+<programlisting>
+   include &lt;/tmp/failover-preamble.slonik&gt;;
+   drop node (id=1, event node = 4);
+</programlisting>
+</listitem>
+
+</itemizedlist>
+
 <sect2><title> Automating <command> FAIL OVER </command> </title>
 
 <indexterm><primary>automating failover</primary></indexterm>
index abeabf6d234917dcce4bfd9142ec388252c1f3f8..b7c24a7ed697f41db8d097e75abc2d4bfc66bb0e 100644 (file)
@@ -2575,6 +2575,15 @@ FAILOVER (
     linkend="stmtmoveset"> instead, as that does
     <emphasis>not</emphasis> abandon the failed node.
     </para>
+
+    <para> If there are many nodes in a cluster, and failover includes
+    dropping out additional nodes (<emphasis>e.g.</emphasis> when it
+    is necessary to treat <emphasis>all</emphasis> nodes at a site
+    including an origin as well as subscribers as failed), it is
+    necessary to carefully sequence the actions, as described in <xref
+    linkend="complexfailover">.
+    </para>
+
    </refsect1>
    <refsect1> <title> Version Information </title>
     <para> This command was introduced in &slony1; 1.0 </para>