Codership Oy http://www.codership.com DISCLAIMER THIS SOFTWARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT SHALL CODERSHIP OY BE HELD LIABLE TO ANY PARTY FOR ANY DAMAGES RESULTING DIRECTLY OR INDIRECTLY FROM THE USE OF THIS SOFTWARE. Trademark Information. All trademarks are the property of their respective owners. Licensing Information. Please see COPYING file that came with this distribution. Source code can be found at http://www.codership.com/en/downloads/galera GALERA v0.8.x CONTENTS: ========= 1. WHAT IS GALERA 2. GALERA USE CASES 3. GALERA CONFIGURATION PARAMETERS 4. SPECIAL NOTES 1. WHAT IS GALERA Galera is a synchronous multi-master replication engine that provides its service through wsrep API (https://launchpad.net/wsrep). It features optimistic transaction execution and commit time replication and certification of writesets. Since it replicates only final changes to the database it is transparent to triggers, stored procedures and non-deterministic functions. Galera nodes are connected to each other in a N-to-N fashion through a group communication backend which provides automatic reconfiguration in the event of a node failure or a new node added to cluster: ,-------. ,-------. ,--------. | node1 |-----| node2 |<---| client | `-------' G `-------' `--------' \ / ,-------. ,--------. | node3 |<---| client | `-------' `--------' Node states are synchronized by replicating transaction changes at commit time. The cluster is virtually synchronous: this means that each node commits transactions in exactly the same order, although not necessarily at the same physical moment. (The latter is not that important as it may seem, since in most cases DBMS gives no guarantee on when the transaction is actually processed.) Built-in flow control keeps nodes within fraction of a second from each other, this is more than enough for most practical purposes. Main features of a Galera database cluster: * Truly highly available: no committed transaction is ever lost in case of a node crash. All nodes always have consistent state. * True multi-master: all cluster nodes can handle WRITE load concurrently. * Highly transparent. (See SPECIAL NOTES below) * Scalable even with WRITE-intensive applications. * Automatic synchronization of new nodes. 2. GALERA USE CASES There is a number of ways how Galera replication can be utilized. They can be categorized in three groups: 1) Seeking High Availability only. In this case client application connects to only one node, the rest serving as hot backups: ,-------------. | application | `-------------' | | | DB backups ,-------. ,-------. ,-------. | node1 | | node2 | | node3 | `-------' `-------' `-------' <===== cluster nodes =====> In the case of primary node failure or maintenance shutdown application can instantly switch to another node without any special failover procedure. 2) Seeking High Availability and improved performance through uniform load distribution. If there are several client connections to the database, they can be uniformly distributed between cluster nodes resulting in better performance. The exact degree of performance improvement depends on application's load profile. Note, that transaction rollback rate may also increase. ,-------------. | clients | `-------------' | | | | ,-------------. | application | `-------------' / | \ ,-------. ,-------. ,-------. | node1 | | node2 | | node3 | `-------' `-------' `-------' <===== cluster nodes =====> In the case of a node failure application can keep on using the remaining healthy nodes. In this setup application can also be clustered with a dedicated application instance per database node, thus achieving HA not only for the database, but for the whole application stack: ,-------------. | clients | `-------------' // || \\ ,------. ,------. ,------. | app1 | | app2 | | app3 | `------' `------' `------' | | | ,-------. ,-------. ,-------. | node1 | | node2 | | node3 | `-------' `-------' `-------' <====== cluster nodes ======> 3) Seeking High Availability and improved performance through smart load distribution. Uniform load distribution can cause undesirably high rollback rate. Directing transactions which access the same set of tables to the same node can considerably improve performance by reducing the number of rollbacks. Also, if your application can distinguish between read/write and read-only transactions, the following configuration may be quite efficient: ,---------------------. | application | `---------------------' writes / | reads \ reads ,-------. ,-------. ,-------. | node1 | | node2 | | node3 | `-------' `-------' `-------' <========= cluster nodes =========> 3. GALERA PARAMETERS 3.1 Cluster URL. Galera can use URL (RFC 3986) syntax for addressing with optional parameters passed in the URL query part. Galera cluster address looks as follows: ://[?option1=value1[&option2=value2]] e.g.: gcomm://192.168.0.1:4567?gmcast.listen_addr=0.0.0.0:5678 Currently Galera supports the following backends: 'dummy' - is a bypass backend for debugging/profiling purposes. It does not connect to or replicate anything and the rest of the URL address string is ignored. 'gcomm' - is a proprietary Group Communication backend that provides Virtual Synchrony quality of service. It uses TCP for membership service and TCP (and UDP multicast as of version 0.8) for data replication. Normally one would use just the simplest form of the address URL: gcomm:// - if one wants to start a new cluster. gcomm://
- if one wants to join an existing cluster. In that case
is the address of one of the cluster members 3.2 Gcomm options. There is quite a few gcomm configuration options. Of particular interest to end user are the following options: To configure gcomm listen address: gmcast.listen_addr. To configure how fast cluster reacts on node failure or connection loss: evs.suspect_timeout evs.inactive_timeout evs.inactive_check_period evs.keepalive_period To fine-tune performance (especially in high latency networks): evs.user_send_window evs.send_window 3.2.1 GMCast option group. All options in this group are prefixed by 'gmcast.' (see example above). group String denoting group name. Max length of string is 16. Peer nodes accept GMCast connection only if the group names match. It is set automatically from wsrep options. listen_addr Listening address for GMCast. Address is currently passed in URI format (for example tcp://192.168.3.1:4567) and it should be passed as the last configuration parameter in order to avoid confusion. If parameter value is undefined, GMCast starts listening all interfaces at default port 4567 mcast_addr Multicast address in dotted decimal notation, enables using multicast to transmit group communication messages. Defaults to none. Must have the same value on all nodes. mcast_port Port used for UDP multicast messages. Defaults to listen_addr port. Must have same value on all of the nodes. mcast_ttl Time to live for multicast packets. Defaults to 1. 3.2.2 EVS option group. All options in this group are prefixed by 'evs.'. All values for the timeout options below should follow ISO 8601 standard for the time interval representation (e.g. 02:01:37.2 -> PT2H1M37.2S == PT121M37.2S == PT7297.2S) suspect_timeout This timeout controls how long node can remain silent until it is put under suspicion. If majority of the current group agree that the node is under suspicion, it is discarded from group and new group view is formed immediately. If majority of the group does not agree about suspicion, is waited until forming of new group will be attempted. Default value is 5 seconds. inactive_timeout This timeout control how long node can remain completely silent until it is discarded from the group. This is hard limit, unlike , and the node is discarded even if it becomes live during the formation of the new group (so it is inclusive of ). Default value is 15 seconds. inactive_check_period This period controls how often node liveness is checked. Default is 1 second and there is no need to change this unless or is adjusted to smaller value. Minimum is 0.1 seconds and maximum is /2. keepalive_period This timeout controls how often keepalive messages are sent into network. Node liveness is determined with these keepalives, so the value sould be considerably smaller than . Default value is 1 second, minimum is 0.1 seconds and maximum is /3. consensus_timeout This timeout defines how long forming of new group is attempted. If there is no consensus after this time has passed since starting of consensus protocol, every node discards all other nodes from the group and forming of new group is attempted through singleton groups. Default value is 30 seconds, minimum is and maximum is *5. join_retrans_period This parameter controls how often join messages are retransmitted during group formation. There is usually no need to adjust this value. Default value is 0.3 seconds, minimum is 0.1 seconds and maximum is /3. view_forget_timeout This timeout controls how long information about known group views is maintained. This information is needed to filter out delayed messages from previous views that are not live anymore. Default value is 5 minutes and there is usually no need to change it. debug_log_mask This mask controls what debug information is printed in the logs if debug logging is turned on. Mask value is bitwise-OR from values gcomm::evs::Proto::DebugFlags. By default only state information is printed. info_log_mask This mask controls what info log is printed in the logs. Mask value is bitwise-or from values gcomm::evs::Proto::InfoFlags. stats_report_period This parameter controls how often statistics information is printed in the log. This parameter has effect only if statistics reporting is enabled via Conf::EvsInfoLogMask. Default value is 1 minute. send_window This parameter controls how many messages protocol layer is allowed to send without getting all acknowledgements for any of them. Default value is 32. user_send_window Like , but for messages which sending is initiated by a call from the upper layer. Default value is 16. 4. SPECIAL NOTES 4.1 DEADLOCK ON COMMIT In multi-master mode transaction commit operation may return a deadlock error. This is a consequence of writeset certification and is a fundamental property of Galera. If deadlock on commit cannot be tolerated by application, Galera can still be used on a condition that all write operations to a given table are performed on the same node. This still has an advantage over the "traditional" master-slave replication: write load can still be distributed between nodes and since replication is synchronous, failover is trivial. 4.2 "SPLIT-BRAIN" CONDITION Galera cluster is fully distributed and does not use any sort of centralized arbitrator, thus having no single point of failure. However, like any cluster of that kind it may fall to a dreaded "split-brain" condition where half or more nodes of the cluster suddenly disappear (e.g. due to network failure). In general case, having no information about the fate of disappeared nodes remaining nodes cannot continue to process requests and modify their states. While such situation is generally considered negligibly probable in a multi-node cluster (normally nodes fail one at a time), in 2-node cluster a single node failure can lead to this, thus making 3 nodes a minimum requirement for a highly-available cluster.