Codership Oy
http://www.codership.com
<info@codership.com>

DISCLAIMER

THIS SOFTWARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
IN NO EVENT SHALL CODERSHIP OY BE HELD LIABLE TO ANY PARTY FOR ANY DAMAGES
RESULTING DIRECTLY OR INDIRECTLY FROM THE USE OF THIS SOFTWARE.

Trademark Information.

All trademarks are the property of their respective owners.

Licensing Information.

Please see COPYING file that came with this distribution.

Source code can be found at http://www.codership.com/en/downloads/galera


                        GALERA v0.8.x

CONTENTS:
=========
1. WHAT IS GALERA
2. GALERA USE CASES
3. GALERA CONFIGURATION PARAMETERS
4. SPECIAL NOTES


1. WHAT IS GALERA

Galera is a synchronous multi-master replication engine that provides its
service through wsrep API (https://launchpad.net/wsrep). It features optimistic
transaction execution and commit time replication and certification of
writesets.

Since it replicates only final changes to the database it is transparent to
triggers, stored procedures and non-deterministic functions.

Galera nodes are connected to each other in a N-to-N fashion through a group
communication backend which provides automatic reconfiguration in the event of
a node failure or a new node added to cluster:

     ,-------.     ,-------.    ,--------.
     | node1 |-----| node2 |<---| client |
     `-------'  G  `-------'    `--------'
             \     /
            ,-------.    ,--------.
            | node3 |<---| client |
            `-------'    `--------'

Node states are synchronized by replicating transaction changes at commit time.
The cluster is virtually synchronous: this means that each node commits
transactions in exactly the same order, although not necessarily at the same
physical moment. (The latter is not that important as it may seem, since in most
cases DBMS gives no guarantee on when the transaction is actually processed.)
Built-in flow control keeps nodes within fraction of a second from each other,
this is more than enough for most practical purposes.

Main features of a Galera database cluster:

* Truly highly available: no committed transaction is ever lost in case of a
  node crash. All nodes always have consistent state.

* True multi-master: all cluster nodes can handle WRITE load concurrently.

* Highly transparent. (See SPECIAL NOTES below)

* Scalable even with WRITE-intensive applications.

* Automatic synchronization of new nodes.


2. GALERA USE CASES

There is a number of ways how Galera replication can be utilized. They can be
categorized in three groups:

1) Seeking High Availability only. In this case client application connects to
   only one node, the rest serving as hot backups:

   ,-------------.
   | application |
   `-------------'
        | | |        DB backups
      ,-------. ,-------. ,-------.
      | node1 | | node2 | | node3 |
      `-------' `-------' `-------'
       <===== cluster nodes =====>

   In the case of primary node failure or maintenance shutdown application can
   instantly switch to another node without any special failover procedure.

2) Seeking High Availability and improved performance through uniform load
   distribution. If there are several client connections to the database, they
   can be uniformly distributed between cluster nodes resulting in better
   performance. The exact degree of performance improvement depends on
   application's load profile. Note, that transaction rollback rate may also 
   increase.

             ,-------------.
             |   clients   |
             `-------------'
                 | | | |
             ,-------------.
             | application |
             `-------------'
             /      |      \
      ,-------. ,-------. ,-------.
      | node1 | | node2 | | node3 |
      `-------' `-------' `-------'
       <===== cluster nodes =====>

   In the case of a node failure application can keep on using the remaining 
   healthy nodes.

   In this setup application can also be clustered with a dedicated application
   instance per database node, thus achieving HA not only for the database,
   but for the whole application stack:

             ,-------------.
             |   clients   |
             `-------------'
             //     ||     \\
      ,------.   ,------.   ,------.
      | app1 |   | app2 |   | app3 |
      `------'   `------'   `------'
         |          |          |
     ,-------.  ,-------.  ,-------.
     | node1 |  | node2 |  | node3 |
     `-------'  `-------'  `-------'
      <====== cluster nodes ======>

3) Seeking High Availability and improved performance through smart load
   distribution. Uniform load distribution can cause undesirably high rollback
   rate. Directing transactions which access the same set of tables to the
   same node can considerably improve performance by reducing the number of
   rollbacks. Also, if your application can distinguish between read/write and
   read-only transactions, the following configuration may be quite efficient:

             ,---------------------.
             |     application     |
             `---------------------'
       writes /         | reads    \ reads
      ,-------.     ,-------.     ,-------.
      | node1 |     | node2 |     | node3 |
      `-------'     `-------'     `-------'
       <========= cluster nodes =========>


3. GALERA PARAMETERS

3.1 Cluster URL.

Galera can use URL (RFC 3986) syntax for addressing with optional parameters
passed in the URL query part. Galera cluster address looks as follows:

<backend schema>://<cluster address>[?option1=value1[&option2=value2]]

e.g.: gcomm://192.168.0.1:4567?gmcast.listen_addr=0.0.0.0:5678

Currently Galera supports the following backends:

'dummy' - is a bypass backend for debugging/profiling purposes. It does not
connect to or replicate anything and the rest of the URL address string is
ignored.

'gcomm' - is a proprietary Group Communication backend that provides Virtual
Synchrony quality of service. It uses TCP for membership service and TCP
(and UDP multicast as of version 0.8) for data replication.

Normally one would use just the simplest form of the address URL:

gcomm://          - if one wants to start a new cluster.
gcomm://<address> - if one wants to join an existing cluster. In that case
                    <address> is the address of one of the cluster members

3.2 Gcomm options.

There is quite a few gcomm configuration options. Of particular interest to end
user are the following options:

To configure gcomm listen address:
    gmcast.listen_addr.

To configure how fast cluster reacts on node failure or connection loss:
    evs.suspect_timeout
    evs.inactive_timeout
    evs.inactive_check_period
    evs.keepalive_period

To fine-tune performance (especially in high latency networks):
    evs.user_send_window
    evs.send_window

3.2.1 GMCast option group.

All options in this group are prefixed by 'gmcast.' (see example above).

group
    String denoting group name. Max length of string is 16. Peer nodes
    accept GMCast connection only if the group names match. It is set
    automatically from wsrep options.

listen_addr
    Listening address for GMCast. Address is currently passed in URI format
    (for example tcp://192.168.3.1:4567) and it should be passed as the last
    configuration parameter in order to avoid confusion. If parameter value
    is undefined, GMCast starts listening all interfaces at default port
    4567

mcast_addr
    Multicast address in dotted decimal notation, enables using multicast to
    transmit group communication messages. Defaults to none. Must have
    the same value on all nodes.

mcast_port
    Port used for UDP multicast messages. Defaults to listen_addr port.
    Must have same value on all of the nodes.

mcast_ttl
    Time to live for multicast packets. Defaults to 1.

3.2.2 EVS option group.

All options in this group are prefixed by 'evs.'.
 
All values for the timeout options below should follow ISO 8601 standard for
the time interval representation
(e.g. 02:01:37.2 -> PT2H1M37.2S == PT121M37.2S == PT7297.2S)

suspect_timeout
    This timeout controls how long node can remain silent until it is put
    under suspicion. If majority of the current group agree that the node
    is under suspicion, it is discarded from group and new group view is
    formed immediately. If majority of the group does not agree about
    suspicion, <inactive_timeout> is waited until forming of new group will
    be attempted. Default value is 5 seconds.

inactive_timeout
    This timeout control how long node can remain completely silent until it
    is discarded from the group. This is hard limit, unlike <suspect_timeout>,
    and the node is discarded even if it becomes live during the formation of
    the new group (so it is inclusive of <suspect_timeout>).
    Default value is 15 seconds.

inactive_check_period
    This period controls how often node liveness is checked. Default is
    1 second and there is no need to change this unless <suspect_timeout> or
    <inactive_timeout> is adjusted to smaller value. Minimum is 0.1 seconds
    and maximum is <suspect_timeout>/2.

keepalive_period
    This timeout controls how often keepalive messages are sent into network.
    Node liveness is determined with these keepalives, so the value sould be
    considerably smaller than <suspect_timeout>. Default value is 1 second,
    minimum is 0.1 seconds and maximum is <suspect_timeout>/3.

consensus_timeout
    This timeout defines how long forming of new group is attempted.
    If there is no consensus after this time has passed since starting of
    consensus protocol, every node discards all other nodes from the group
    and forming of new group is attempted through singleton groups.
    Default value is 30 seconds, minimum is <inactive_timeout> and maximum
    is <inactive_timeout>*5.

join_retrans_period
    This parameter controls how often join messages are retransmitted
    during group formation. There is usually no need to adjust this value.
    Default value is 0.3 seconds, minimum is 0.1 seconds and maximum is
    <suspect_timeout>/3.

view_forget_timeout
    This timeout controls how long information about known group views is
    maintained. This information is needed to filter out delayed messages
    from previous views that are not live anymore. Default value is
    5 minutes and there is usually no need to change it.

debug_log_mask
    This mask controls what debug information is printed in the logs if
    debug logging is turned on. Mask value is bitwise-OR from values
    gcomm::evs::Proto::DebugFlags. By default only state information is
    printed.

info_log_mask
    This mask controls what info log is printed in the logs.
    Mask value is bitwise-or from values gcomm::evs::Proto::InfoFlags.

stats_report_period
    This parameter controls how often statistics information is printed in
    the log. This parameter has effect only if statistics reporting is
    enabled via Conf::EvsInfoLogMask. Default value is 1 minute.

send_window
    This parameter controls how many messages protocol layer is allowed
    to send without getting all acknowledgements for any of them.
    Default value is 32.

user_send_window
    Like <send_window>, but for messages which sending is initiated by a
    call from the upper layer. Default value is 16.


4. SPECIAL NOTES

4.1 DEADLOCK ON COMMIT

In multi-master mode transaction commit operation may return a deadlock error.
This is a consequence of writeset certification and is a fundamental property
of Galera. If deadlock on commit cannot be tolerated by application, Galera can
still be used on a condition that all write operations to a given table are
performed on the same node. This still has an advantage over the "traditional"
master-slave replication: write load can still be distributed between nodes and
since replication is synchronous, failover is trivial.

4.2 "SPLIT-BRAIN" CONDITION

Galera cluster is fully distributed and does not use any sort of centralized
arbitrator, thus having no single point of failure. However, like any cluster
of that kind it may fall to a dreaded "split-brain" condition where half or
more nodes of the cluster suddenly disappear (e.g. due to network failure).
In general case, having no information about the fate of disappeared nodes
remaining nodes cannot continue to process requests and modify their states.

While such situation is generally considered negligibly probable in a multi-node
cluster (normally nodes fail one at a time), in 2-node cluster a single node
failure can lead to this, thus making 3 nodes a minimum requirement for
a highly-available cluster.