Codership Oy
http://www.codership.com
<info@codership.com>

DISCLAIMER

THIS SOFTWARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
IN NO EVENT SHALL CODERSHIP OY BE HELD LIABLE TO ANY PARTY FOR ANY DAMAGES
RESULTING DIRECTLY OR INDIRECTLY FROM THE USE OF THIS SOFTWARE.

Trademark Information.

All trademarks are the property of their respective owners.

Licensing Information.

Please see COPYING file that came with this distribution.

Source code can be found at http://www.codership.com/en/downloads/galera


                        GALERA v2.x

CONTENTS:
=========
1. WHAT IS GALERA
2. GALERA USE CASES
3. GALERA CONFIGURATION PARAMETERS
4. GALERA ARBITRATOR
5. WRITESET CACHE
6. INCREMENTAL STATE TRANSFER
7. SPECIAL NOTES


1. WHAT IS GALERA

Galera is a synchronous multi-master replication engine that provides its
service through wsrep API (https://launchpad.net/wsrep). It features optimistic
transaction execution and commit time replication and certification of
writesets.

Since it replicates only final changes to the database it is transparent to
triggers, stored procedures and non-deterministic functions.

Galera nodes are connected to each other in a N-to-N fashion through a group
communication backend which provides automatic reconfiguration in the event of
a node failure or a new node added to cluster:

     ,-------.     ,-------.    ,--------.
     | node1 |-----| node2 |<---| client |
     `-------'  G  `-------'    `--------'
             \     /
            ,-------.    ,--------.
            | node3 |<---| client |
            `-------'    `--------'

Node states are synchronized by replicating transaction changes at commit time.
The cluster is virtually synchronous: this means that each node commits
transactions in exactly the same order, although not necessarily at the same
physical moment. (The latter is not that important as it may seem, since in most
cases DBMS gives no guarantee on when the transaction is actually processed.)
Built-in flow control keeps nodes within fraction of a second from each other,
this is more than enough for most practical purposes.

Main features of a Galera database cluster:

* Truly highly available: no committed transaction is ever lost in case of a
  node crash. All nodes always have consistent state.

* True multi-master: all cluster nodes can handle WRITE load concurrently.

* Highly transparent. (See SPECIAL NOTES below)

* Scalable even with WRITE-intensive applications.

* Automatic synchronization of new nodes.


2. GALERA USE CASES

There is a number of ways how Galera replication can be utilized. They can be
categorized in three groups:

1) Seeking High Availability only. In this case client application connects to
   only one node, the rest serving as hot backups:

   ,-------------.
   | application |
   `-------------'
        | | |        DB backups
      ,-------. ,-------. ,-------.
      | node1 | | node2 | | node3 |
      `-------' `-------' `-------'
       <===== cluster nodes =====>

   In the case of primary node failure or maintenance shutdown application can
   instantly switch to another node without any special failover procedure.

2) Seeking High Availability and improved performance through uniform load
   distribution. If there are several client connections to the database, they
   can be uniformly distributed between cluster nodes resulting in better
   performance. The exact degree of performance improvement depends on
   application's load profile. Note, that transaction rollback rate may also 
   increase.

             ,-------------.
             |   clients   |
             `-------------'
                 | | | |
             ,-------------.
             | application |
             `-------------'
             /      |      \
      ,-------. ,-------. ,-------.
      | node1 | | node2 | | node3 |
      `-------' `-------' `-------'
       <===== cluster nodes =====>

   In the case of a node failure application can keep on using the remaining 
   healthy nodes.

   In this setup application can also be clustered with a dedicated application
   instance per database node, thus achieving HA not only for the database,
   but for the whole application stack:

             ,-------------.
             |   clients   |
             `-------------'
             //     ||     \\
      ,------.   ,------.   ,------.
      | app1 |   | app2 |   | app3 |
      `------'   `------'   `------'
         |          |          |
     ,-------.  ,-------.  ,-------.
     | node1 |  | node2 |  | node3 |
     `-------'  `-------'  `-------'
      <====== cluster nodes ======>

3) Seeking High Availability and improved performance through smart load
   distribution. Uniform load distribution can cause undesirably high rollback
   rate. Directing transactions which access the same set of tables to the
   same node can considerably improve performance by reducing the number of
   rollbacks. Also, if your application can distinguish between read/write and
   read-only transactions, the following configuration may be quite efficient:

             ,---------------------.
             |     application     |
             `---------------------'
       writes /         | reads    \ reads
      ,-------.     ,-------.     ,-------.
      | node1 |     | node2 |     | node3 |
      `-------'     `-------'     `-------'
       <========= cluster nodes =========>


3. GALERA PARAMETERS

3.1 Cluster URL.

Galera can use URL (RFC 3986) syntax for addressing with optional parameters
passed in the URL query part. Galera cluster address looks as follows:

<backend schema>://<cluster address>[?option1=value1[&option2=value2]]

e.g.: gcomm://192.168.0.1:4567?gmcast.listen_addr=0.0.0.0:5678

Currently Galera supports the following backends:

'dummy' - is a bypass backend for debugging/profiling purposes. It does not
connect to or replicate anything and the rest of the URL address string is
ignored.

'gcomm' - is a proprietary Group Communication backend that provides Virtual
Synchrony quality of service. It uses TCP for membership service and TCP
(and UDP multicast as of version 0.8) for data replication.

Normally one would use just the simplest form of the address URL:

gcomm://          - if one wants to start a new cluster.
gcomm://<address> - if one wants to join an existing cluster. In that case
                    <address> is the address of one of the cluster members

3.2 Galera parameters.

There is quite a few galera configuration parameters which affect its behavior
and performance. Of particular interest to an end-user are the following:

To configure gcomm listen address:
    gmcast.listen_addr

To configure how fast cluster reacts on node failure or connection loss:
    evs.suspect_timeout
    evs.inactive_timeout
    evs.inactive_check_period
    evs.keepalive_period

To fine-tune performance (especially in high latency networks):
    evs.user_send_window
    evs.send_window

To relax or tighten replication flow control:
    gcs.fc_limit
    gcs.fc_factor

For a full parameter list please see
http://www.codership.com/wiki/doku.php?id=galera_parameters

3.2.1 GMCast parameters group.

All parameters in this group are prefixed by 'gmcast.' (see example above).

group
    String denoting group name. Max length of string is 16. Peer nodes
    accept GMCast connection only if the group names match. It is set
    automatically from wsrep options.

listen_addr
    Listening address for GMCast. Address is currently passed in URI format
    (for example tcp://192.168.3.1:4567) and it should be passed as the last
    configuration parameter in order to avoid confusion. If parameter value
    is undefined, GMCast starts listening all interfaces at default port
    4567

mcast_addr
    Multicast address in dotted decimal notation, enables using multicast to
    transmit group communication messages. Defaults to none. Must have
    the same value on all nodes.

mcast_port
    Port used for UDP multicast messages. Defaults to listen_addr port.
    Must have same value on all of the nodes.

mcast_ttl
    Time to live for multicast packets. Defaults to 1.

3.2.2 EVS parameter group.

All parameters in this group are prefixed by 'evs.'.
 
All values for the timeout options below should follow ISO 8601 standard for
the time interval representation
(e.g. 02:01:37.2 -> PT2H1M37.2S == PT121M37.2S == PT7297.2S)

suspect_timeout
    This timeout controls how long node can remain silent until it is put
    under suspicion. If majority of the current group agree that the node
    is under suspicion, it is discarded from group and new group view is
    formed immediately. If majority of the group does not agree about
    suspicion, <inactive_timeout> is waited until forming of new group will
    be attempted. Default value is 5 seconds.

inactive_timeout
    This timeout control how long node can remain completely silent until it
    is discarded from the group. This is hard limit, unlike <suspect_timeout>,
    and the node is discarded even if it becomes live during the formation of
    the new group (so it is inclusive of <suspect_timeout>).
    Default value is 15 seconds.

inactive_check_period
    This period controls how often node liveness is checked. Default is
    1 second and there is no need to change this unless <suspect_timeout> or
    <inactive_timeout> is adjusted to smaller value. Minimum is 0.1 seconds
    and maximum is <suspect_timeout>/2.

keepalive_period
    This timeout controls how often keepalive messages are sent into network.
    Node liveness is determined with these keepalives, so the value sould be
    considerably smaller than <suspect_timeout>. Default value is 1 second,
    minimum is 0.1 seconds and maximum is <suspect_timeout>/3.

consensus_timeout
    This timeout defines how long forming of new group is attempted.
    If there is no consensus after this time has passed since starting of
    consensus protocol, every node discards all other nodes from the group
    and forming of new group is attempted through singleton groups.
    Default value is 30 seconds, minimum is <inactive_timeout> and maximum
    is <inactive_timeout>*5.

join_retrans_period
    This parameter controls how often join messages are retransmitted
    during group formation. There is usually no need to adjust this value.
    Default value is 0.3 seconds, minimum is 0.1 seconds and maximum is
    <suspect_timeout>/3.

view_forget_timeout
    This timeout controls how long information about known group views is
    maintained. This information is needed to filter out delayed messages
    from previous views that are not live anymore. Default value is
    5 minutes and there is usually no need to change it.

debug_log_mask
    This mask controls what debug information is printed in the logs if
    debug logging is turned on. Mask value is bitwise-OR from values
    gcomm::evs::Proto::DebugFlags. By default only state information is
    printed.

info_log_mask
    This mask controls what info log is printed in the logs.
    Mask value is bitwise-or from values gcomm::evs::Proto::InfoFlags.

stats_report_period
    This parameter controls how often statistics information is printed in
    the log. This parameter has effect only if statistics reporting is
    enabled via Conf::EvsInfoLogMask. Default value is 1 minute.

send_window
    This parameter controls how many messages protocol layer is allowed
    to send without getting all acknowledgements for any of them.
    Default value is 32.

user_send_window
    Like <send_window>, but for messages which sending is initiated by a
    call from the upper layer. Default value is 16.

3.2.3 GCS parameter group

All parameters in this group are prefixed by 'gcs.'.

fc_debug
    Post debug statistics about SST flow control every that many writesets.
    Default: 0.

fc_factor
    Resume replication after recv queue drops below that fraction of
    gcs.fc_limit. Default: 0.5.

fc_limit
    Pause replication if recv queue exceeds that many writesets.
    Default: 16. For master-slave setups this can be increased considerably.

fc_master_slave
    Should we assume that there is only one master in the group? Default: NO.

max_packet_size
    All writesets exceeding that size will be fragmented. Default: 32616.

max_throttle
    How much we can throttle replication rate during state transfer (to avoid
    running out of memory). Set it to 0.0 if stopping replication is acceptable
    for the sake of completing state transfer. Default: 0.25.

recv_q_hard_limit
    Maximum allowed size of recv queue. This should normally be half of
    (RAM + swap). If this limit is exceeded, Galera will abort the server.
    Default: LLONG_MAX.

recv_q_soft_limit
    A fraction of gcs.recv_q_hard_limit after which replication rate will be
    throttled. Default: 0.25.
    The degree of throttling is a linear function of recv queue size and goes
    from 1.0 (“full rate”) at gcs.recv_q_soft_limit to gcs.max_throttle at
    gcs.recv_q_hard_limit. Note that “full rate”, as estimated between 0 and
    gcs.recv_q_soft_limit is a very approximate estimate of a regular
    replication rate.

3.2.4 Replicator parameter group

All parameters in this group are prefixed by 'replicator.'.

commit_order
    Whether we should allow Out-Of-Order committing (improves parallel
    applying performance). Possible settings:
    0 – BYPASS: all commit order monitoring is turned off (useful for
        measuring performance penalty)
    1 – OOOC: allow out of order committing for all transactions
    2 – LOCAL_OOOC: allow out of order committing only for local transactions
    3 – NO_OOOC: no out of order committing is allowed (strict total order
        committing)
    Default: 3.

3.2.5 GCache parameter group

All parameters in this group are prefixed by 'gcache.'.

dir
    Directory where GCache should place its files. Default: working directory.

name
    Name of the main store file (ring buffer). Default: “galera.cache”.

size
    Size of the main store file (ring buffer). This will be preallocated on
    startup. Default: 128Mb.

page_size
    Size of a page in the page store. The limit on overall page store is free
    disk space. Pages are prefixed by “gcache.page”. Default: 128Mb.

keep_pages_size
    Total size of the page store pages to keep for caching purposes. If only
    page storage is enabled, one page is always present. Default: 0.

mem_size
    Size of the malloc() store (read: RAM). For configurations with spare RAM.
    Default: 0.

3.2.6 SSL parameters

All parameters in this group are prefixed by 'socket.'.

ssl_cert
   Certificate file in PEM format.

ssl_key
   A private key for the certificate above, unencrypted, in PEM format.

ssl
   A boolean value to disable SSL even if certificate and key are configured.
   Default: yes (SSL is enabled if ssl_cert and ssl_key are set)

To generate private key/certificate pair the following command may be used:

$ openssl req -new -x509 -days 365000 -nodes -keyout key.pem -out cert.pem

Using short-living (in most web examples - 1 year) certificates is not
advised as it will lead to complete cluster shutdown when certificate expires.

3.2.7 Incremental State Transfer parameters

All parameters in this group are prefixed by 'ist.'.

recv_addr
    Address to receive incremental state transfer at. Setting this parameter
    turns on incremental state transfer. IST will use SSL if SSL is configured
    as described above. No default.


4. GALERA ARBITRATOR

Galera arbitrator found in this package is a small stateless daemon that can
serve as a lightweight cluster member to

 * avoid split brain condition in a cluster with an otherwise even membership.

 * request consistent state snapshots for backup purposes.

Example usage:
                        ,---------.
                        |  garbd  |
                        `---------'
             ,---------.     |     ,---------.
             | clients |     |     | clients |
             `---------'     |     `---------'
                        \    |    /
                         \ ,---. /
                         ('     `)
                        (   WAN   )
                         (.     ,)
                         / `---' \
                        /         \
             ,---------.           ,---------.
             |  node1  |           |  node2  |
             `---------'           `---------'
            Data Center 1         Data Center 2

In this example, if one of the data centers loses WAN connection, the node
that sees arbitrator (and therefore sees clients) will continue the operation.

garbd accepts the same Galera options as the regular Galera node. Note that
at the moment garbd needs to see all replication traffic (although it does not
store it anywhere), so placing it in a location with poor connectivity to the
rest of the cluster may lead to cluster performance degradation.

Arbitrator failure does not affect cluster operation and a new instance can
be reattached to cluster at any time. There can be several arbitrators in the
cluster, although practicality of it is questionable.


5. WRITESET CACHE

Starting with version 1.0 Galera stores writesets in a special cache. It's
purpose is to improve control of Galera memory usage and offload writeset
storage to disk. Galera cache has 3 types of stores:

   1. A permanent in-memory store, where writesets are allocated by a default
      OS memory allocator. It can be useful for systems with spare RAM. It has
      a hard size limit. By default it is disabled (size set to 0).

   2. A permanent ring-buffer file which is preallocated on disk during cache
      initialization. It is intended as the main writeset store. By default
      its size is 128Mb.

   3. An on-demand page store, which allocates memory-mapped page files during
      runtime as necessary. Default page size is 128Mb, but it can be bigger
      if it needs to store a bigger writeset. The size of page store is
      limited by the free disk space. By default page files are deleted when
      not in use, but a limit can be set on the total size of the page files
      to keep. When all other stores are disabled, at least one page file is
      always present on disk.

Allocation algorithm is as follows: all stores are tried in the above order.
If a store does not have enough space to allocate the writeset, then the next
store is tried. Page store should always succeed unless the writeset is bigger
than the available disk space.

By default Galera cache allocates files in the working directory of the
process, but a dedicated location can be specified. For configuration
parameters see GCache group above (p. 3.2.5).

NOTE: Since all cache files are memory-mapped, the process may appear to use
more memory than it actually does.


6. INCREMENTAL STATE TRANSFER (IST)

Galera 2.x introduces a long awaited functionality: incremental state transfer.
The idea is that if
   a) the joining node state UUID is the same as that of the group and
   b) all of the writesets that it missed can be found in the donor's Gcache
then instead of whole state snapshot it will receive the missing writesets and
catch up with the group by replaying them.

For example:
 - local node state is 5a76ef62-30ec-11e1-0800-dba504cf2aab:197222
 - group state is      5a76ef62-30ec-11e1-0800-dba504cf2aab:201913
 - if writeset number 197223 is still in the donor's GCache, it will send
   writests 197223-201913 to joiner instead of the whole state.

IST can dramatically speed up remerging node into cluster. It also non-blocking
on the donor.

The most important parameter for IST (besides 'ist.recv_addr') is GCache size
on donor. The bigger it is, the more writesets can be stored in it and the
bigger seqno gaps can be closed with IST. On the other hand, if GCache is much
bigger than the state size, serving IST may be less efficient than sending
state snapshot.


7. SPECIAL NOTES

7.1 DEADLOCK ON COMMIT

In multi-master mode transaction commit operation may return a deadlock error.
This is a consequence of writeset certification and is a fundamental property
of Galera. If deadlock on commit cannot be tolerated by application, Galera can
still be used on a condition that all write operations to a given table are
performed on the same node. This still has an advantage over the "traditional"
master-slave replication: write load can still be distributed between nodes and
since replication is synchronous, failover is trivial.

7.2 "SPLIT-BRAIN" CONDITION

Galera cluster is fully distributed and does not use any sort of centralized
arbitrator, thus having no single point of failure. However, like any cluster
of that kind it may fall to a dreaded "split-brain" condition where half of
the cluster nodes suddenly disappear (e.g. due to network failure).
In general case, having no information about the fate of disappeared nodes
remaining nodes cannot continue to process requests and modify their states.

While such situation is generally considered negligibly probable in a
multi-node cluster (normally nodes fail one at a time), in 2-node cluster
a single node failure can lead to this, thus making 3 nodes a minimum
requirement for a highly-available cluster.

Galera arbitrator (see above) can serve as an odd stateless cluster node
to help avoid the possibility of an even cluster split.