Codership Oy
http://www.codership.com
<info@codership.com>

DISCLAIMER

THIS SOFTWARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
IN NO EVENT SHALL CODERSHIP OY BE HELD LIABLE TO ANY PARTY FOR ANY DAMAGES
RESULTING DIRECTLY OR INDIRECTLY FROM THE USE OF THIS SOFTWARE.

Trademark Information.

MySQL is a trademark or registered trademark of Sun Microsystems, Inc. or
its subsidiaries in the US and other countries. Other marks are the property
of their respective owners.

Licensing Information.

Please see ./mysql/LICENSE.mysql and ./galera/LICENSE.galera

Project source code can be found at
wsrep API:    https://launchpad.net/wsrep/
MySQL patch:  https://launchpad.net/codership-mysql/
Galera libs:  https://launchpad.net/galera/

ABOUT THIS DOCUMENT

This document covers issues specific to this MySQL/Galera demo distribution.
It does not cover the use or administration of MySQL server per se. The reader
is assumed to know how to install, configure, administer and use MySQL server.


                        MYSQL/GALERA v0.8.x

CONTENTS:
=========
0. WHAT IS MYSQL/GALERA CLUSTER
1. CLUSTER SETUP
   1.1 INSTALLATION
   1.2 CLUSTER URL
   1.3 STARTING THE FIRST NODE OF A CLUSTER
   1.4 STARTING REMAINING NODES
2. USING THE CLUSTER
   2.1 LOADING DATA TO A CLUSTER
   2.2 CONNECTING APPLICATION TO A CLUSTER
   2.3 LOAD BALANCER
   2.4 ADDING NEW NODE TO A CLUSTER
   2.5 A "REFERENCE NODE"
   2.6 "SPLIT-BRAIN" CONDITION
   2.7 SWAP SIZE REQUIREMENTS
3. CONFIGURATION
   3.1 MANDATORY MYSQL OPTIONS
   3.2 OPTIONAL OPTIMIZATIONS
   3.3 WSREP OPTIONS
   3.4 CONFIGURING LOCAL MYSQL CLIENTS
   3.5 FIREWALLS
4. Using MySQL/Galera in Amazon EC2
5. LIMITATIONS


0. WHAT IS MYSQL/GALERA CLUSTER

MySQL/Galera cluster is a synchronous multi-master MySQL server cluster. Cluster
nodes are connected to each other in a N-to-N fashion through a group
communication backend which provides automatic reconfiguration in the event of
a node failure or a new node added to cluster:

     ,--------.    ,--------.    ,--------.
     | mysqld |----| mysqld |<---| client |
     `--------'    `--------'    `--------'
              \    /
            ,--------.    ,--------.
            | mysqld |<---| client |
            `--------'    `--------'

With few exceptions each cluster node can be used like a standalone MySQL server
and supports most of MySQL features including transactions, triggers and stored
procedures.

Node states are synchronized by replicating transaction changes at commit time.
The cluster is virtually synchronous: this means that each node commits
transactions in exactly the same order, although not necessarily at the same
physical moment. (The latter is not that important as it may seem, since in most
cases DBMS gives no guarantee on when the transaction is actually processed.)
Built-in flow control keeps nodes within fraction of a second from each other,
this is more than enough for most practical purposes.

Main features of MySQL/Galera cluster:

* Truly highly available: no committed transaction is ever lost in case of a
  node crash. All nodes always have consistent state.

* True multi-master: all cluster nodes can modify the same table concurrently.

* Highly transparent: the cluster is intended as a drop-in replacement for a 
  single MySQL server. (See LIMITATIONS below)

* Scalable even with WRITE-intensive applications.

This demo distribution contains all software you'll need to setup MySQL/Galera
cluster. It is essentially a self-contained MySQL server installation with its
own configuration file, data directory and preconfigured empty database.
You don't need administrative privileges or to uninstall/disable previously
installed MySQL server to use this distribution.


1. CLUSTER SETUP

To setup MySQL/Galera cluster you will need several networked computers -
one for each mysqld instance you plan to use. For best performance those
computers should be of approximately same configuration: Galera replication is
synchronous and one overloaded machine may slow down the whole cluster. This
however depends on load distribution. The node that does not handle client
connections can be considerably slower.

It takes 3 steps to set up the cluster:

1) Copy this distribution to all prospective nodes of the cluster and unpack it
   to location of your choice.

2) Start the first node to begin a new cluster.

3) Start remaining nodes pointing to the first one.

(NOTE: You can easily set up the cluster on a single computer. However this
makes little sense, as you won't see the the benefits of high availability and
scalability. Hence it is not covered by this document.)


1.1 INSTALLATION

Just copy and unpack the distribution on the prospective cluster nodes to
wherever you have privileges. The distribution was designed to be able to run
on most systems without reconfiguration. It is a self-contained MySQL 
installation and comes with its own data directory and a preconfigured empty 
database with users 'root' (password 'rootpass') and 'test' (password 
'testpass', privileges on schema 'test'). As a result default installation will 
require at least 1Gb of free space for InnoDB files (will be created on first 
start).

This requirement, as well as other MySQL and Galera options can be changed by
editing configuration file which can be found at <INST_ROOT>/mysql/etc/my.cnf.

Please see CONFIGURATION chapter for the details on editable parameters.


1.2 CLUSTER URL

Cluster URL is a connection handle that will be used by a new node to connect
to the rest of the cluster. Its form is backend-specific and backend is
determined by URL schema. Default is 'dummy' which means no replication.
This demo comes with a distributed group communication backend which schema is
'gcomm'.


1.3 STARTING THE FIRST NODE OF A CLUSTER

<INST_ROOT>/mysql-galera is a special MySQL startup script that sets proper
options (including data directory path) for mysqld. If you're running it as a
superuser, you have to make sure there is 'mysql' user in the system and it has
sufficient privileges on the installation directory (see MySQL Manual about
running mysqld as root). 

The first node of a cluster has nowhere to connect to, therefore it has to start
with an empty cluster address (note that it still initializes gcomm backend):

   <INST_ROOT>/mysql-galera -g gcomm:// start


1.4 STARTING REMAINING NODES

To add another node to the cluster it must be given the address of one of the
existing cluster nodes.

Thus, if the first cluster node has IP address 192.168.1.1, then the second will
be started like this:

   <INST_ROOT>/mysql-galera -g gcomm://192.168.1.1 start

The third node can use either the first or the second node address and so on.

It might take few minutes to start mysqld for the first time as it will have to
create required InnoDB files.

For full description of mysql-galera options and commands see:

   <INST_ROOT>/mysql-galera --help


2. USING THE CLUSTER

After you have successfully started all cluster nodes, the cluster is ready to
use.

From the client point of view each cluster node works like a usual MySQL server
- client-side application does not have to be changed in any way. Each node can
be accessed independently and asynchronously. Just direct SQL load to any one or
more of the cluster nodes. For most practical purposes you can treat
MySQL/Galera cluster as a single MySQL server listening on multiple interfaces
with the exception that you might see transaction deadlocks where you previously
didn't.


2.1 LOADING DATA TO CLUSTER

Initially distribution database is empty. You can populate it by loading the
dump of your data to any one of the nodes. It will be automatically replicated
to others. Please note that this release supports only InnoDB storage engine.


2.2 CONNECTING APPLICATION TO CLUSTER

As was mentioned above, for the client application each node looks like a normal
MySQL server and can be used independently. This creates considerable
flexibility in the way the cluster can be utilized. The approaches can be
categorized in three groups:

1) Seeking High Availability only. It is similar to traditional MySQL
   master-slave replication. In this case client application connects to only
   one node, the rest serving as hot backups / read-only slaves:

   ,-------------.
   | application |
   `-------------'
        | | |       DB backups/read-only slaves
      ,-------. ,-------. ,-------.
      | node1 | | node2 | | node3 |
      `-------' `-------' `-------'
       <===== cluster nodes =====>

   In the case of primary node failure application can instantly switch to
   another node without any preparations. This is also a most transparent mode:
   COMMITs will never return deadlocks and table locks can be used too.

2) Seeking High Availability and improved performance through uniform load
   distribution. If there are several client connections to the database, they
   can be uniformly distributed between cluster nodes resulting in better
   performance. The exact degree of performance improvement depends on
   application's SQL profile. Note, that transaction rollback rate may also 
   increase.

             ,-------------.
             |   clients   |
             `-------------'
                 | | | |
             ,-------------.
             | application |
             `-------------'
             /      |      \
      ,-------. ,-------. ,-------.
      | node1 | | node2 | | node3 |
      `-------' `-------' `-------'
       <===== cluster nodes =====>

   In the case of a node failure application can keep on using the remaining 
   healthy nodes.

   In this setup application can also be clustered with a dedicated application
   instance per database node, thus achieving HA not only for the database,
   but for the whole application stack:

             ,-------------.
             |   clients   |
             `-------------'
             //     ||     \\
      ,------.   ,------.   ,------.
      | app1 |   | app2 |   | app3 |
      `------'   `------'   `------'
         |          |          |
     ,-------.  ,-------.  ,-------.
     | node1 |  | node2 |  | node3 |
     `-------'  `-------'  `-------'
      <====== cluster nodes ======>

3) Seeking High Availability and improved performance through smart load
   distribution. Uniform load distribution can cause undesirably high rollback
   rate. Directing transactions which access the same set of tables to the
   same node can considerably improve performance by reducing the number of
   rollbacks. Also, if your application can distinguish between read/write and
   read-only transactions, the following configuration may be quite efficient:

             ,---------------------.
             |     application     |
             `---------------------'
       writes /         | reads    \ reads
      ,-------.     ,-------.     ,-------.
      | node1 |     | node2 |     | node3 |
      `-------'     `-------'     `-------'
       <========= cluster nodes =========>


2.3 LOAD BALANCER

If your application cannot utilize several database servers (most don't) you
will need to use SQL proxy or a TCP load balancer to distribute load between
the MySQL/Galera nodes. This is needed not only to increase performance, but
also for a quick switch in case of a node failure. If performance of your
application is DBMS-bound, you can run the balancer on the same machine as
application/client. Be aware, however, that SQL load balancing might be a CPU
hungry operation: usually SQL traffic consists of many small packets. For best
results we recommend to carefully examine CPU consumption of the balancer and
if needed dedicate a separate machine for it.

Unlike traditional MySQL master-slave cluster, MySQL/Galera cluster does not
require any SQL traffic filtering, it is highly transparent and plain TCP
connection balancer will suffice.

TCP connection balancers that were successfully used with MySQL/Galera:
- Pen (http://siag.nu/pen/)
- GLB (http://www.codership.com/en/downloads/glb) 


2.4 ADDING NEW NODE TO A CLUSTER

With 0.7 series of releases Codership removes the main obstacle towards using 
MySQL/Galera in production: inability to add/replace nodes in the working 
cluster. This distribution features automatic state snapshot transfer to newly 
joined node. Until node receives state snapshot it won't execute any queries. 
Detailed state snapshot transfer sequence diagram can be found in
http://www.codership.com/files/presentations/Galera_OSC_2009.pdf

The process of joining new node into the cluster consists of two phases:

1) State snapshot transfer (SST) from an established cluster member.
   Depending on the SST method neither joining node, nor SST donor can apply
   any transactions for the duration of this phase. Transactions replicated by
   other nodes are buffered in the queue.

2) Catch-up phase when both donor and joiner try to catch up with the cluster
   by applying transactions buffered in the queue. Using them as working nodes
   should be avoided. Duration of this phase depends on the load profile and
   duration of the first phase.

NOTE: Transaction buffering is currently happening in memory, so prepare
      enough swap space.

By default cluster chooses the most suitable node to receive state transfer 
from. There is also an option wsrep_sst_donor to specify desired state snapshot 
source in my.cnf or on the command line. See CONFIGURATION section for 
descriptions of all relevant configuration options. In most situations (like 
cluster on EC2) this distribution should work with default settings.

At this point there is only one state transfer method supported and it is based 
on mysqldump. Although it is relatively slow, it provides complete cloning of 
the donor state, including system tables and thus is most compliant.


2.5 A "REFERENCE NODE"

For practical purposes we recommend to reserve a "reference node" in the
cluster. A "reference node" is a node that does not receive SQL load. Having 
such node in a cluster serves several purposes:

1) Data consistency: since this node does not process any SQL load on its own,
   it has the lowest probability of transaction conflicts and therefore -
   indeterministic conflict resolution. In the event of discovered database
   inconsistencies in the cluster this node will have the most relevant
   database.

2) Data safety: since this node does not process any SQL load on its own, it
   has the lowest probability of failing with catastrophic consequences. In
   the event of total cluster failure (e.g. blackout) this will be the best
   node to restore cluster from.

3) High availability: a reference node can serve as a dedicated state snapshot
   donor. Since it does not serve any clients, they won't experience service
   interruptions and load balancer won't need reconfiguration during SST.

Even with the current TCP-based group communication the overhead of having one
extra silent node is negligible for most loads.


2.6 "SPLIT-BRAIN" CONDITION

Galera cluster is fully distributed and does not use any sort of centralized
arbitrator, thus having no single point of failure. However, like any cluster
of that kind it may fall to a dreaded "split-brain" condition where half or
more nodes of the cluster suddenly disappear (e.g. due to network failure).
In general case, having no information about the fate of disappeared nodes,
remaining nodes cannot continue to process requests and modify their states.

While such situation is generally considered negligibly probable in a multi-node
cluster (normally nodes fail one by one), in 2-node cluster a single node
failure can lead to this, thus making 3 nodes a minimum requirement for
a highly-available cluster.


2.7 SWAP SPACE REQUIREMENTS

While in normal operation MySQL/Galera server requires only slightly more
memory than stock MySQL, there are two situations when extra swap will be
needed:
  - forking processes (state transfer and notification scripts).
  - caching transactions during state transfer.
It is recommentded that swap size is at least 2 times that of RAM.

NOTE: It is a common practice to have 0 swap on Amazon EC2 instances
      by default. Make sure to allocate sufficient amount before running
      MySQL/Galera.


3. CONFIGURATION

Each MySQL/Galera node is configured just like the usual MySQL server, we just
added some configuration variables to my.cnf. In addition some options can be
passed to mysql-galera startup script (see mysql-galera --help).


3.1 MANDATORY MYSQL OPTIONS

binlog_format=ROW
   This option is required to use row-level replication as opposed to
   statement-level. For performance and consistency considerations don't change
   that. As a side effect, binlog, if turned on, can be ROW only. In future this
   option won't have special meaning.

innodb_autoinc_lock_mode=2
   This is a required parameter. Without it INSERTs into tables with
   AUTO_INCREMENT column may fail.
   autoinc lock modes 0 and 1 can cause unresolved deadlock, and make
   system unresponsive.

innodb_locks_unsafe_for_binlog=1
   This setting is required for relaiable parallel applying operation.

Mandatory options are hardcoded both in distribution's my.cnf file and in
mysql-galera script.


3.2 OPTIONAL OPTIMIZATIONS

While not required for correct operation of MySQL/Galera cluster, the
following options may be safely set due to the guarantees of synchronous
replication:

innodb_flush_log_at_trx_commit=0


3.3 WSREP OPTIONS

Here WSREP stands for Write-Set REPlication - a synchronous replication API
that Codership is developing for transactional databases. Galera is a library
that implements WSREP services. Default values are shown.

All options are optional except for wsrep_provider and wsrep_cluster_address.

wsrep_provider=none
   A full path to the library that implements WSREP interface. If none is
   specified, the server behaves almost like a normal mysqld, with slight
   overhead. mysql-galera script automatically substitutes it to point to
   Galera implementation shipped with the distribution. It can be overridden
   with WSREP environment variable.

wsrep_provider_options=
   Provider-specific option string. See http://www.codership.com/wiki for
   details.

wsrep_cluster_address="dummy://"
   Group Communication System address. Depends on the WSREP provider. This
   distribution recognizes "dummy://" and "gcomm://<address>[:port]"
   Default port is 4567.

   mysql> set global wsrep_cluster_address=<ADDRESS>;

   will (re)establish connection to ADDRESS. This can be used to change cluster
   connection in runtime.

wsrep_cluster_name="my_wsrep_cluster"
   Logical cluster name, must be the same for all nodes of the cluster.

wsrep_node_name=
   Human readable node name. Defaults to hostname.

wsrep_slave_threads=1
   Number of threads dedicated to processing of writesets from other nodes.
   For better performance we recommend few per CPU core.

wsrep_dbug_option
   Options for the built-in DBUG library (independent from what MySQL uses).
   Empty by default. Not used in 0.8.

wsrep_local_cache_size=20M
   Amount of RAM reserved for writeset cache in bytes. Excess writesets are
   stored on the hard drive in MySQL data directory

wsrep_ws_persistency=0
   Save writesets in binary files. For debugging purposes only.
   Not used in 0.8.

wsrep_debug=0
   Enable debug-level logging.

wsrep_convert_LOCK_to_trx=0
   Implicitly convert locking sessions into transactions inside mysqld. By
   itself it does not mean support for locking sessions, but it prevents the
   database from going into logically inconsistent state. Disabled by default
   because of possible memory issues with DB dumps that contain LOCK statements.

wsrep_retry_autocommit=1
   Retry autocommit queries and single statement transactions should they fail
   certification test. This is analogous to rescheduling an autocommit query
   should it go into deadlock with other transactions in the database lock
   manager.

wsrep_auto_increment_control=1
   Automatically adjust auto_increment_increment and auto_increment_offset
   variables based on the number of nodes in the cluster. Significantly reduces
   certification conflic rate for INSERTS.

wsrep_drupal_282555_workaround=1
   MySQL seems to have an obscure bug when INSERT into table with
   AUTO_INCREMENT column with NULL value for that column can fail with a 
   duplicate key error. When this option is on, it retries such INSERTs.
   Required for stable Drupal operation. Documented at:
      http://bugs.mysql.com/bug.php?id=41984
      http://drupal.org/node/282555

wsrep_sst_method=mysqldump
   What method to use to copy database state to a newly joined node. Currently
   supported methods:
   - mysqldump: generally slow (except on small datasets), but most tested
   - rsync:     the fastest method, especially on large datasets
   - rsync_wan: same as rsync, but uses deltaxfer to minimize network traffic.

wsrep_sst_receive_address=
   Address at which this node wants to receive state snapshot. Defaults to
   mysqld bind address, and if that is not specified (0.0.0.0) - to the first
   IP of eth0 + mysqld bind port.

wsrep_sst_auth=
   Authentication information needed for state transfer. Depends on the state
   transfer method. For mysqldump-based SST it is 
   <mysql_root_user>:<mysql_root_password>
   and should be the same on all nodes - it is used to authenticate with both
   state snapshot receiver and state snapshot donor. In this distribution it is
   preconfigured to "root:rootpass".

wsrep_sst_donor=
   A name of the node which should serve as state snapshot donor. This allows
   to control which node will serve state snapshot request. By default the
   most suitable node is chosen by GCS.


3.4 CONFIGURING LOCAL MYSQL CLIENTS

This MySQL/Galera distribution runs mysqld in the "sandbox". Thus mysql clients
won't find mysqld socket at default system location. Running mysql client
without explicitly specifying socket or port (via --socket or --host/--port
options) may, therefore, result in the following:

$ mysql -uroot -prootpass
ERROR 2002 (HY000): Can't connect to local MySQL server through socket
'/var/run/mysqld/mysqld.sock' (2)

Most applications that use libmysqlclient to connect to MySQL server can be
instructed to look in the correct place by adding a following section to
system-wide my.cnf file:

[client]
socket = <INST_ROOT>/mysql/var/mysqld.sock

3.5 FIREWALLS

If there are any firewalls used, they should be configured to allow connections
between the nodes at the following ports:
3306 - for mysqldump state snapshot transfer
4567 - for replication traffic

E.g. to configure iptables to allow connections from a local subnet:

iptables -A INPUT -i eth0 -p tcp -m tcp \
         --source 192.168.0.1/24 --dport 3306 -j ACCEPT
iptables -A INPUT -i eth0 -p tcp -m tcp \
         --source 192.168.0.1/24 --dport 4567 -j ACCEPT

Substitute real values for IP address of your node and netmask. Better yet,
use VPN.

4. Using MySQL/Galera distribution in Amazon EC2

MySQL/Galera works anywhere TCP/IP works. Therefore using MySQL/Galera
distribution in Amazon EC2 environment is no different than in LAN. Just launch
several instances of your favorite AMI, copy and unpack the distribution,
and start the servers. Don't forget to use external addresses if your nodes are 
running in different accessibility zones (obviously running in different 
accessibility zones degrades performance somewhat).

NOTE: this distribution may be binary incompatible with some older Linux
      distributions. Please use CentOS 5.0 or newer.


5. LIMITATIONS

1) Currently replication works only with InnoDB storage engine. Any writes to 
   tables of other types, including system (mysql.*) tables are not replicated. 
   However, DDL statements are replicated in statement level, and changes
   to mysql.* tables will get replicated that way. 
   So, you can safely issue: CREATE USER..., 
   but issuing: INSERT INTO mysql.user..., will not be replicated.

2) DELETE operation is not supported on tables without primary keys.
   Rows in tables without primary keys may appear in different order on
   different nodes. As a result SELECT...LIMIT... may return slightly different
   sets.

3)  Unsupported queries:
    * LOAD DATA size is limited to ~1Gb
    * lock functions (GET_LOCK(), RELEASE_LOCK()... )

4) Locking sessions (LOCK TABLES...UNLOCK) are not supported in multi-master
   mode. However if there's only one node that executes locking sessions, then
   it'll work.

5) Transaction isolation level should be REPEATABLE_READ (the default).
   Galera implements implicit snapshot isolation for cluster wide
   transactions.

6) Due to cluster level optimistic concurrency control, transaction issuing
   COMMIT may still be aborted at that stage. There can be two transactions 
   writing to same rows and committing in separate cluster nodes, and only one
   of the them can successfully commit. The failing one, will be aborted.
   For cluster level aborts, MySQL/galera cluster gives back deadlock error 
   code (Error: 1213 SQLSTATE: 40001  (ER_LOCK_DEADLOCK)).

7) Query log cannot be directed to table. If you enable query logging,
   you must forward the log to a file:
       log_output = FILE
   Use general_log and general_log_file to choose query logging and the 
   log file name

8) XA transactions can not be supported due to possible rollback on commit.