Codership Oy http://www.codership.com DISCLAIMER THIS SOFTWARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT SHALL CODERSHIP OY BE HELD LIABLE TO ANY PARTY FOR ANY DAMAGES RESULTING DIRECTLY OR INDIRECTLY FROM THE USE OF THIS SOFTWARE. Trademark Information. All trademarks are the property of their respective owners. Licensing Information. Please see COPYING file that came with this distribution. Source code can be found at http://www.codership.com/en/downloads/galera GALERA v2.x CONTENTS: ========= 1. WHAT IS GALERA 2. GALERA USE CASES 3. GALERA CONFIGURATION PARAMETERS 4. GALERA ARBITRATOR 5. WRITESET CACHE 6. INCREMENTAL STATE TRANSFER 7. SPECIAL NOTES 1. WHAT IS GALERA Galera is a synchronous multi-master replication engine that provides its service through wsrep API (https://launchpad.net/wsrep). It features optimistic transaction execution and commit time replication and certification of writesets. Since it replicates only final changes to the database it is transparent to triggers, stored procedures and non-deterministic functions. Galera nodes are connected to each other in a N-to-N fashion through a group communication backend which provides automatic reconfiguration in the event of a node failure or a new node added to cluster: ,-------. ,-------. ,--------. | node1 |-----| node2 |<---| client | `-------' G `-------' `--------' \ / ,-------. ,--------. | node3 |<---| client | `-------' `--------' Node states are synchronized by replicating transaction changes at commit time. The cluster is virtually synchronous: this means that each node commits transactions in exactly the same order, although not necessarily at the same physical moment. (The latter is not that important as it may seem, since in most cases DBMS gives no guarantee on when the transaction is actually processed.) Built-in flow control keeps nodes within fraction of a second from each other, this is more than enough for most practical purposes. Main features of a Galera database cluster: * Truly highly available: no committed transaction is ever lost in case of a node crash. All nodes always have consistent state. * True multi-master: all cluster nodes can handle WRITE load concurrently. * Highly transparent. (See SPECIAL NOTES below) * Scalable even with WRITE-intensive applications. * Automatic synchronization of new nodes. 2. GALERA USE CASES There is a number of ways how Galera replication can be utilized. They can be categorized in three groups: 1) Seeking High Availability only. In this case client application connects to only one node, the rest serving as hot backups: ,-------------. | application | `-------------' | | | DB backups ,-------. ,-------. ,-------. | node1 | | node2 | | node3 | `-------' `-------' `-------' <===== cluster nodes =====> In the case of primary node failure or maintenance shutdown application can instantly switch to another node without any special failover procedure. 2) Seeking High Availability and improved performance through uniform load distribution. If there are several client connections to the database, they can be uniformly distributed between cluster nodes resulting in better performance. The exact degree of performance improvement depends on application's load profile. Note, that transaction rollback rate may also increase. ,-------------. | clients | `-------------' | | | | ,-------------. | application | `-------------' / | \ ,-------. ,-------. ,-------. | node1 | | node2 | | node3 | `-------' `-------' `-------' <===== cluster nodes =====> In the case of a node failure application can keep on using the remaining healthy nodes. In this setup application can also be clustered with a dedicated application instance per database node, thus achieving HA not only for the database, but for the whole application stack: ,-------------. | clients | `-------------' // || \\ ,------. ,------. ,------. | app1 | | app2 | | app3 | `------' `------' `------' | | | ,-------. ,-------. ,-------. | node1 | | node2 | | node3 | `-------' `-------' `-------' <====== cluster nodes ======> 3) Seeking High Availability and improved performance through smart load distribution. Uniform load distribution can cause undesirably high rollback rate. Directing transactions which access the same set of tables to the same node can considerably improve performance by reducing the number of rollbacks. Also, if your application can distinguish between read/write and read-only transactions, the following configuration may be quite efficient: ,---------------------. | application | `---------------------' writes / | reads \ reads ,-------. ,-------. ,-------. | node1 | | node2 | | node3 | `-------' `-------' `-------' <========= cluster nodes =========> 3. GALERA PARAMETERS 3.1 Cluster URL. Galera can use URL (RFC 3986) syntax for addressing with optional parameters passed in the URL query part. Galera cluster address looks as follows: ://[?option1=value1[&option2=value2]] e.g.: gcomm://192.168.0.1:4567?gmcast.listen_addr=0.0.0.0:5678 Currently Galera supports the following backends: 'dummy' - is a bypass backend for debugging/profiling purposes. It does not connect to or replicate anything and the rest of the URL address string is ignored. 'gcomm' - is a proprietary Group Communication backend that provides Virtual Synchrony quality of service. It uses TCP for membership service and TCP (and UDP multicast as of version 0.8) for data replication. Normally one would use just the simplest form of the address URL: gcomm:// - if one wants to start a new cluster. gcomm://
- if one wants to join an existing cluster. In that case
is the address of one of the cluster members 3.2 Galera parameters. There is quite a few galera configuration parameters which affect its behavior and performance. Of particular interest to an end-user are the following: To configure gcomm listen address: gmcast.listen_addr To configure how fast cluster reacts on node failure or connection loss: evs.suspect_timeout evs.inactive_timeout evs.inactive_check_period evs.keepalive_period To fine-tune performance (especially in high latency networks): evs.user_send_window evs.send_window To relax or tighten replication flow control: gcs.fc_limit gcs.fc_factor For a full parameter list please see http://www.codership.com/wiki/doku.php?id=galera_parameters 3.2.1 GMCast parameters group. All parameters in this group are prefixed by 'gmcast.' (see example above). group String denoting group name. Max length of string is 16. Peer nodes accept GMCast connection only if the group names match. It is set automatically from wsrep options. listen_addr Listening address for GMCast. Address is currently passed in URI format (for example tcp://192.168.3.1:4567) and it should be passed as the last configuration parameter in order to avoid confusion. If parameter value is undefined, GMCast starts listening all interfaces at default port 4567 mcast_addr Multicast address in dotted decimal notation, enables using multicast to transmit group communication messages. Defaults to none. Must have the same value on all nodes. mcast_port Port used for UDP multicast messages. Defaults to listen_addr port. Must have same value on all of the nodes. mcast_ttl Time to live for multicast packets. Defaults to 1. 3.2.2 EVS parameter group. All parameters in this group are prefixed by 'evs.'. All values for the timeout options below should follow ISO 8601 standard for the time interval representation (e.g. 02:01:37.2 -> PT2H1M37.2S == PT121M37.2S == PT7297.2S) suspect_timeout This timeout controls how long node can remain silent until it is put under suspicion. If majority of the current group agree that the node is under suspicion, it is discarded from group and new group view is formed immediately. If majority of the group does not agree about suspicion, is waited until forming of new group will be attempted. Default value is 5 seconds. inactive_timeout This timeout control how long node can remain completely silent until it is discarded from the group. This is hard limit, unlike , and the node is discarded even if it becomes live during the formation of the new group (so it is inclusive of ). Default value is 15 seconds. inactive_check_period This period controls how often node liveness is checked. Default is 1 second and there is no need to change this unless or is adjusted to smaller value. Minimum is 0.1 seconds and maximum is /2. keepalive_period This timeout controls how often keepalive messages are sent into network. Node liveness is determined with these keepalives, so the value sould be considerably smaller than . Default value is 1 second, minimum is 0.1 seconds and maximum is /3. consensus_timeout This timeout defines how long forming of new group is attempted. If there is no consensus after this time has passed since starting of consensus protocol, every node discards all other nodes from the group and forming of new group is attempted through singleton groups. Default value is 30 seconds, minimum is and maximum is *5. join_retrans_period This parameter controls how often join messages are retransmitted during group formation. There is usually no need to adjust this value. Default value is 0.3 seconds, minimum is 0.1 seconds and maximum is /3. view_forget_timeout This timeout controls how long information about known group views is maintained. This information is needed to filter out delayed messages from previous views that are not live anymore. Default value is 5 minutes and there is usually no need to change it. debug_log_mask This mask controls what debug information is printed in the logs if debug logging is turned on. Mask value is bitwise-OR from values gcomm::evs::Proto::DebugFlags. By default only state information is printed. info_log_mask This mask controls what info log is printed in the logs. Mask value is bitwise-or from values gcomm::evs::Proto::InfoFlags. stats_report_period This parameter controls how often statistics information is printed in the log. This parameter has effect only if statistics reporting is enabled via Conf::EvsInfoLogMask. Default value is 1 minute. send_window This parameter controls how many messages protocol layer is allowed to send without getting all acknowledgements for any of them. Default value is 32. user_send_window Like , but for messages which sending is initiated by a call from the upper layer. Default value is 16. 3.2.3 GCS parameter group All parameters in this group are prefixed by 'gcs.'. fc_debug Post debug statistics about SST flow control every that many writesets. Default: 0. fc_factor Resume replication after recv queue drops below that fraction of gcs.fc_limit. Default: 0.5. fc_limit Pause replication if recv queue exceeds that many writesets. Default: 16. For master-slave setups this can be increased considerably. fc_master_slave Should we assume that there is only one master in the group? Default: NO. max_packet_size All writesets exceeding that size will be fragmented. Default: 32616. max_throttle How much we can throttle replication rate during state transfer (to avoid running out of memory). Set it to 0.0 if stopping replication is acceptable for the sake of completing state transfer. Default: 0.25. recv_q_hard_limit Maximum allowed size of recv queue. This should normally be half of (RAM + swap). If this limit is exceeded, Galera will abort the server. Default: LLONG_MAX. recv_q_soft_limit A fraction of gcs.recv_q_hard_limit after which replication rate will be throttled. Default: 0.25. The degree of throttling is a linear function of recv queue size and goes from 1.0 (“full rate”) at gcs.recv_q_soft_limit to gcs.max_throttle at gcs.recv_q_hard_limit. Note that “full rate”, as estimated between 0 and gcs.recv_q_soft_limit is a very approximate estimate of a regular replication rate. 3.2.4 Replicator parameter group All parameters in this group are prefixed by 'replicator.'. commit_order Whether we should allow Out-Of-Order committing (improves parallel applying performance). Possible settings: 0 – BYPASS: all commit order monitoring is turned off (useful for measuring performance penalty) 1 – OOOC: allow out of order committing for all transactions 2 – LOCAL_OOOC: allow out of order committing only for local transactions 3 – NO_OOOC: no out of order committing is allowed (strict total order committing) Default: 3. 3.2.5 GCache parameter group All parameters in this group are prefixed by 'gcache.'. dir Directory where GCache should place its files. Default: working directory. name Name of the main store file (ring buffer). Default: “galera.cache”. size Size of the main store file (ring buffer). This will be preallocated on startup. Default: 128Mb. page_size Size of a page in the page store. The limit on overall page store is free disk space. Pages are prefixed by “gcache.page”. Default: 128Mb. keep_pages_size Total size of the page store pages to keep for caching purposes. If only page storage is enabled, one page is always present. Default: 0. mem_size Size of the malloc() store (read: RAM). For configurations with spare RAM. Default: 0. 3.2.6 SSL parameters All parameters in this group are prefixed by 'socket.'. ssl_cert Certificate file in PEM format. ssl_key A private key for the certificate above, unencrypted, in PEM format. ssl A boolean value to disable SSL even if certificate and key are configured. Default: yes (SSL is enabled if ssl_cert and ssl_key are set) To generate private key/certificate pair the following command may be used: $ openssl req -new -x509 -days 365000 -nodes -keyout key.pem -out cert.pem Using short-living (in most web examples - 1 year) certificates is not advised as it will lead to complete cluster shutdown when certificate expires. 3.2.7 Incremental State Transfer parameters All parameters in this group are prefixed by 'ist.'. recv_addr Address to receive incremental state transfer at. Setting this parameter turns on incremental state transfer. IST will use SSL if SSL is configured as described above. No default. 4. GALERA ARBITRATOR Galera arbitrator found in this package is a small stateless daemon that can serve as a lightweight cluster member to * avoid split brain condition in a cluster with an otherwise even membership. * request consistent state snapshots for backup purposes. Example usage: ,---------. | garbd | `---------' ,---------. | ,---------. | clients | | | clients | `---------' | `---------' \ | / \ ,---. / (' `) ( WAN ) (. ,) / `---' \ / \ ,---------. ,---------. | node1 | | node2 | `---------' `---------' Data Center 1 Data Center 2 In this example, if one of the data centers loses WAN connection, the node that sees arbitrator (and therefore sees clients) will continue the operation. garbd accepts the same Galera options as the regular Galera node. Note that at the moment garbd needs to see all replication traffic (although it does not store it anywhere), so placing it in a location with poor connectivity to the rest of the cluster may lead to cluster performance degradation. Arbitrator failure does not affect cluster operation and a new instance can be reattached to cluster at any time. There can be several arbitrators in the cluster, although practicality of it is questionable. 5. WRITESET CACHE Starting with version 1.0 Galera stores writesets in a special cache. It's purpose is to improve control of Galera memory usage and offload writeset storage to disk. Galera cache has 3 types of stores: 1. A permanent in-memory store, where writesets are allocated by a default OS memory allocator. It can be useful for systems with spare RAM. It has a hard size limit. By default it is disabled (size set to 0). 2. A permanent ring-buffer file which is preallocated on disk during cache initialization. It is intended as the main writeset store. By default its size is 128Mb. 3. An on-demand page store, which allocates memory-mapped page files during runtime as necessary. Default page size is 128Mb, but it can be bigger if it needs to store a bigger writeset. The size of page store is limited by the free disk space. By default page files are deleted when not in use, but a limit can be set on the total size of the page files to keep. When all other stores are disabled, at least one page file is always present on disk. Allocation algorithm is as follows: all stores are tried in the above order. If a store does not have enough space to allocate the writeset, then the next store is tried. Page store should always succeed unless the writeset is bigger than the available disk space. By default Galera cache allocates files in the working directory of the process, but a dedicated location can be specified. For configuration parameters see GCache group above (p. 3.2.5). NOTE: Since all cache files are memory-mapped, the process may appear to use more memory than it actually does. 6. INCREMENTAL STATE TRANSFER (IST) Galera 2.x introduces a long awaited functionality: incremental state transfer. The idea is that if a) the joining node state UUID is the same as that of the group and b) all of the writesets that it missed can be found in the donor's Gcache then instead of whole state snapshot it will receive the missing writesets and catch up with the group by replaying them. For example: - local node state is 5a76ef62-30ec-11e1-0800-dba504cf2aab:197222 - group state is 5a76ef62-30ec-11e1-0800-dba504cf2aab:201913 - if writeset number 197223 is still in the donor's GCache, it will send writests 197223-201913 to joiner instead of the whole state. IST can dramatically speed up remerging node into cluster. It also non-blocking on the donor. The most important parameter for IST (besides 'ist.recv_addr') is GCache size on donor. The bigger it is, the more writesets can be stored in it and the bigger seqno gaps can be closed with IST. On the other hand, if GCache is much bigger than the state size, serving IST may be less efficient than sending state snapshot. 7. SPECIAL NOTES 7.1 DEADLOCK ON COMMIT In multi-master mode transaction commit operation may return a deadlock error. This is a consequence of writeset certification and is a fundamental property of Galera. If deadlock on commit cannot be tolerated by application, Galera can still be used on a condition that all write operations to a given table are performed on the same node. This still has an advantage over the "traditional" master-slave replication: write load can still be distributed between nodes and since replication is synchronous, failover is trivial. 7.2 "SPLIT-BRAIN" CONDITION Galera cluster is fully distributed and does not use any sort of centralized arbitrator, thus having no single point of failure. However, like any cluster of that kind it may fall to a dreaded "split-brain" condition where half of the cluster nodes suddenly disappear (e.g. due to network failure). In general case, having no information about the fate of disappeared nodes remaining nodes cannot continue to process requests and modify their states. While such situation is generally considered negligibly probable in a multi-node cluster (normally nodes fail one at a time), in 2-node cluster a single node failure can lead to this, thus making 3 nodes a minimum requirement for a highly-available cluster. Galera arbitrator (see above) can serve as an odd stateless cluster node to help avoid the possibility of an even cluster split.