|o||"MySQL" is a registered trademark of Oracle Corporation.|
|o||"Percona XtraDB Cluster" and "Percona Server" are registered trademarks of Percona LLC.|
|o||"MariaDB" and "MariaDB Galera Cluster" are registered trademarks of Monty Program Ab.|
Galera Cluster for MySQL is a synchronous multi-master database cluster, based on synchronous replication and Oracles MySQL/InnoDB. When Galera Cluster is used, reads and writes can be directed to any node, and any individual node can be lost without interruption in operations and without the need to handle complicated failover procedures.
On a high level, Galera Cluster consists of the standard MySQL server and the standard InnoDB storage engine, using the Galera Replication Plugin for replication. To be more specific, the MySQL replication plugin API has been extended to provide all the information and hooks required for true multi-master, synchronous replication. This API is called the Write Set Replication API, wsrep API.
Galera Cluster provides certification based replication. The idea is that the transactions to be replicated—the write set—not only contains the database rows to replicate, but also includes information on all the locks that were held by the database during the transaction. Each node certifies the replicated write set against other write sets in the applier queue, and if there are no conflicting locks, the write set can be applied. At this point, the transaction is considered committed, after which each node continues to apply it to the tablespace.
This approach is also called virtually synchronous replication as it is logically synchronous, but actual writing and committing to the tablespace happens independently and, thus, asynchronously on each node.
Galera Cluster is a significant step in terms of improving MySQL high availability. The MySQL ecosystem has offered users various ways to achieve high availability, but typically all alternatives have provided only some of the following features, but not all, thus making the choice of a high availability solution an exercise in tradeoffs.
Galera Cluster provides all of the following features:
o True multi-master—Read and write to any node at any time. o Synchronous replication—No slave lag, no data is lost at node crash. o Tightly coupled—All nodes hold the same state. No diverged data between nodes allowed. o Multi-threaded slave—For better performance. For any workload. o No master-slave failover operations or use of VIP. o Hot standby—No downtime during failover (since there is no failover). o Automatic node provisioning—No need to manually back up the database and copy it to the new node. o Supports InnoDB storage engine. o Tranparent to applications—Required no (or minimal) changes) to the application. o No read and write splitting needed.
The result is a high availability solution that is both robust
in terms of data integrity and high performance with instant
Galera Cluster Cloud Implementations
Another main benefit of Galera Cluster is good cloud support. Automatic node provisioning makes elastic scale-out and scale-in operations painless. Galera Cluster has proven to perform extremely well in the cloud, such as when using multiple small node instances, across multiple data centers (for example, AWS zones) or even over Wider Area Networks.
Galera Arbitrator If the expense of adding, for example, a third datacenter is too high, you can use the Galera arbitrator. An arbitrator is a member of the cluster which participates in voting, but not in actual replication.
See Galera Arbitrator for more details.
Galera Replication Plugin Galera Replication Plugin is a general purpose replication plugin for any transactional system. It can be used to create a synchronous multi-master replication solution to achieve high availability and scale-out.
See Galera Replication Plugin for more details.
Global Transaction ID To keep the state identical on all nodes, the wsrep API uses global transaction IDs (GTID), which are used to both:
o Identify the state change o Identify the state itself by the ID of the last state change
The GTID consists of:
|o||A state UUID, which uniquely identifies the state and the sequence of changes it undergoes|
|o||An ordinal sequence number (seqno, a 64-bit signed integer) to denote the position of the change in the sequence|
See wsrep API for more details.
|Incremental State Transfer (IST)|
In incremental state transfer, a node only receives the missing write
sets and catch up with the group by replaying them. See also the
definition for State Snapshot Transfer (SST).
See Incremental State Transfer (IST) for more details.
In addition to single node failures, the cluster may be split into
several components due to network failure.
In such a situation, only one of the components can continue to
modify the database state to avoid history divergence. This component
is called the Primary Component (PC).
See Primary Component for more details.
|Rolling Schema Upgrade|
The rolling schema upgrade is a DDL
processing method, where the DDL
will only be processed locally at the node. The node is desynchronized
from the cluster for the duration of the DDL
processing in a way that it does not block the rest of the nodes.
When the DDL processing is complete,
the node applies the delayed replication events and synchronizes back
with the cluster.
See Rolling Schema Upgrade for more details.
|State Snapshot Transfer (SST)|
State Snapshot Transfer refers to a full data copy from one cluster
node (donor) to the joining node (joiner). See also the
definition for Incremental State Transfer (IST).
See State Snapshot Transfer (SST) for more details.
|Total Order Isolation|
By default, DDL statements are
processed by using the Total Order Isolation (TOI) method. In TOI,
the query is replicated to the nodes in a statement form before
executing on master. The query waits for all preceding transactions
to commit and then gets executed in isolation on all nodes simultaneously.
See Total Order Isolation for more details.
|Writeset Cache (GCache)|
Galera stores write sets in a special cache called Writeset Cache (GCache).
In short, GCache is a memory allocator for write sets and its primary purpose
is to minimize the write set footprint on the RAM.
See Writeset Cache (GCache) for more details.
The wsrep API is a generic replication plugin interface for databases.
The API defines a set of application callbacks and replication plugin calls.
See wsrep API for more details.
Galera Cluster for MySQL consists of two software packages:
o Galera wsrep provider o MySQL server patched with the wsrep API
This chapter describes the Galera Cluster
release numbering schemes and software packages.
Release Numbering Schemes
Galera Cluster software packages have their own release numbering schemes as follows:
o Galera wsrep provider—The Galera wsrep provider release numbering scheme is as follows:
wsrep API main version.Galera wsrep provider version
For example, release number 23.2.4 indicates that the Galera wsrep provider is meant to be used with wsrep API version 23.x.x and the Galera wsrep provider version is 2.4.
o MySQL server patched with the wsrep API—The patched MySQL server release numbering scheme is as follows:
MySQL version-wsrep API version
For example, release number 5.5.29-23.7.3 indicates that the patched MySQL server release contains MySQL version 5.5.29 and wsrep API version 23.7.3.
Galera Cluster software packages for different Linux distributions are available as follows:
o Galera wsrep provider: o A 32-bit binary for Debian based distributions o A 64-bit binary for Debian based distributions o A 32-bit binary for RHEL/CentOS 5 o A 64-bit binary for RHEL/CentOS 5 o A 32-bit binary for RHEL/CentOS 6 o A 64-bit binary for RHEL/CentOS 6 o A source code package
You can download these packages from: https://launchpad.net/galera/+download.
|o||MySQL server patched with the wsrep API:|
|o||A 32-bit binary for Debian based distributions|
|o||A 64-bit binary for Debian based distributions|
|o||A 32-bit binary for RHEL/CentOS 5|
|o||A 64-bit binary for RHEL/CentOS 5|
|o||A 32-bit binary for RHEL/CentOS 6|
|o||A 64-bit binary for RHEL/CentOS 6|
|o||A 32-bit generic tarball for system-wide installations|
|o||A 64-bit generic tarball for system-wide installations|
|o||The separate wsrep API patches without the MySQL server|
You can download these packages from: https://launchpad.net/codership-mysql/+download.
Third Party Galera Cluster Implementations
There are two products that implement Galera Cluster:
o Percona XtraDB Cluster—Percona XtraDB Cluster is a high availability and high scalability solution for MySQL users. Percona XtraDB Cluster integrates Percona Server with the Galera library of high availability solutions in a single product package.
For more information, see http://www.percona.com/software/percona-xtradb-cluster.
o MariaDB Galera Cluster—MariaDB Galera Cluster uses the Galera library for the replication implementation. To interface with the Galera Replication Plugin, MariaDB is enhanced to support the replication API definition in the wsrep API project.
For more information, see https://kb.askmonty.org/en/what-is-mariadb-galera-cluster/.
Database replication refers to the frequent copying of data from one node, that is, a database on one server, to a node on another server. A database replication system is, in other words, a distributed database in which all nodes share the same level of information. This system is also known as a database cluster.
The database client (such as a Web browser or a computer application) does not see the database replication system in any way, but experiences close to native DBMS behavior.
Many database management systems replicate databases. The most common replication method is to create a master/slaves relationship between the original data set (master) and its copies (slaves). The master logs the updates to the data, and propagates the logs through a network to the slaves. The slaves, for their part, receive a stream of updates from the master and apply the changes accordingly.
A master/slaves relationship is depicted in the figure below: [image] Master/Slave Replication.UNINDENT
Another common replication method is to create a multi-master replication system, where updates can be submitted to any database node, and then propagated through a network to other database nodes. In other words, all database nodes act as masters, and no logs or update success indications are sent.
A multi-master replication system is depicted in the figure below:
Asynchronous and Synchronous Replication
From a more technical perspective, asynchronous and synchronous replication protocols differ in the way they propagate database transactions to other nodes in the database cluster:
o Synchronous replication uses an approach known as eager replication. In eager replication, the nodes keep all replicas synchronized by updating all replicas in a single transaction. In other words, when a transaction commits, all nodes have the same value. o Asynchronous replication uses an approach known as lazy replication. In lazy replication, the master database asynchronously propagates replica updates to other nodes after the replicating transaction commits. In other words, when a transaction commits, nodes can have different values at least for a short while.
Theoretically, synchronous replication has several advantages over asynchronous replication:
|o||Synchronous replication provides highly available clusters and guarantees 24/7 service availability, as:|
|o||There is no data loss when one of the nodes crashes|
|o||Data replicas are always consistent|
|o||There are no complex, time-consuming failovers|
|o||Transactions can be executed on all nodes in parallel to increase performance.|
|o||Synchronous replication can guarantee causality across the whole cluster. For example, a SELECT S issued after transaction T will always see the effects of the transaction even if it were executed on another node.|
Traditionally, eager replication protocols coordinated nodes
one operation at a time, using 2-phase commit or distributed
locking. In a system with x nodes and y operations, a
throughput of z transactions per second requires
x × y × z messages per second. When
the number of nodes increases, transaction response times,
conflict probability and deadlock rates grow exponentially.
This has led to a situation where asynchronous replication
remains the dominant means for database performance scalability
and availability. Widely adopted open-source databases such as
MySQL or PostgreSQL only offer asynchronous replication
Solving the Issues in Synchronous Replication
To solve the problems in the traditional synchronous replication systems and approaches, researchers all around the world have, in the last couple of years, suggested alternative approaches to synchronous database replication. In addition to theory, prototype implementations have shown a lot of promise. The most important improvements these studies have brought about are:
o Group communication—Group communication is a high level abstraction that defines patterns for the communication of database nodes to guarantee the consistency of replicated data. o Write sets—Write sets bundle database writes into a single writeset message, thus avoiding the coordination of nodes one operation at a time. o Database State Machine—In the Database State Machine, read-only transactions are processed locally on a database site, and update transactions are first executed locally on a database site (on shadow copies), and then broadcast as a read set to the other database sites for certification and possibly commit. o Transaction Re-ordering—Transaction re-ordering means that transactions are re-ordered before they are committed and broadcast to the other database sites. Reordering increases the number of transactions that successfully pass the certification test.
Galera Clusters certification based replication builds on these approaches.
As we learned in the introduction, synchronous replication uses eager replication, where the nodes keep all other nodes synchronized by updating all replicas in a single transaction. In other words, when a transaction commits, all nodes have the same value. This takes place by using writeset replication over group communication.
The Galera Cluster replication architecture software entities are
o DBMS—A Database Management System (DBMS), such as MySQL. o wsrep API—The wsrep API defines the interface and the responsibilities for the DBMS and replication provider. The wsrep API consists of: o wsrep hooks—The wsrep integration in the DBMS engine. o dlopen—The dlopen() function makes the wsrep provider available to wsrep hooks.
|o||Galera Replication Plugin—The Galera Replication Plugin provides the wsrep service functionality.|
|o||Group communication plugins—Galera can use various group communication systems. We have used, for example, gcomm and Spread ( http://www.spread.org/).|
The wsrep API is a generic replication plugin interface for databases. The API defines a set of application callbacks and replication plugin calls.
The wsrep API is used in a replication model where an application, such as a database server, has a state. In practice, the state refers to the contents of the database. When the database is used, clients modify the database contents and the database state changes. This changing of the state is represented as a series of atomic changes (transactions). In a database cluster, all nodes always have the same state, which they synchronize with each other by replicating and applying state changes in the same serial order.
From a more technical perspective, the state change process is as follows:
1. A state change takes place on the database. 2. The wsrep hooks within the database populate the write sets for the other database nodes in the cluster. 3. The wsrep provider functions are made available for the wsrep hooks through the dlopen() function. 4. The Galera Replication Plugin handles the writeset certification and replication to the other database nodes in the cluster.
At the receiving end, the application process takes place by high priority transaction(s).
To keep the state identical on all nodes, the wsrep API uses a Global Transaction ID (GTID), which is used to both:
|o||Identify the state change|
|o||Identify the state itself by the ID of the last state change|
The GTID consists of:
|o||A state UUID, which uniquely identifies the state and the sequence of changes it undergoes|
|o||An ordinal sequence number (seqno, a 64-bit signed integer) to denote the position of the change in the sequence|
By using the GTID, you can
|o||Compare the application states|
|o||Establish the order of state changes|
|o||Determine whether the change was applied or not|
|o||Whether the change is applicable at all to a given state (in short, whether it is all-powerful)|
In a human-readable format, the GTID might look like this:
Galera Replication Plugin implements the wsrep API and operates as the wsrep provider. From a more technical perspective, it consists of:
o Certification layer—The certification layer prepares the write sets and performs the certification. o Replication layer—The replication layer manages the replication protocol and provides the total ordering capability. o Group communication framework—The group communication framework provides a plugin architecture for various group communication systems.
The group communication framework provides a plugin architecture for various group communication systems.
Galera Cluster is built on top of a proprietary group communication system layer which implements virtual synchrony QoS. Virtual synchrony unifies the data delivery and cluster membership service, which provides clear formalism for message delivery semantics.
Virtual Synchrony guarantees consistency, but not temporal synchrony, which is required for smooth multi-master operation. For this purpose, Galera implements its own runtime-configurable temporal flow control, which keeps nodes synchronized to a fraction of second.
The group communication framework also provides total ordering of messages from multiple sources, which is used to build Global Transaction IDs in a multi-master cluster.
At the transport level, Galera Cluster is a symmetric undirected graph, where all database nodes are connected with each other over a TCP connection. By default, TCP is used for both message replication and the cluster membership service, but also UDP multicast can be used for replication in a LAN.
Certification based replication uses group communication and transaction ordering techniques to achieve synchronous replication. Transactions execute optimistically in a single node (or replica) and, at commit time, run a coordinated certification process to enforce global consistency. Global coordination is achieved with the help of a broadcast service that establishes a global total order among concurrent transactions.
Pre-requisites for certification based replication:
o The database is transactional (that is, it can rollback uncommitted changes). o Each replication event changes the database atomically (that is, the series of database operations either all occur, or nothing occurs). o Replicated events are globally ordered (that is, applied on all instances in the same order).
Certification based replication is described in the software sequence diagram below: [image] Certification Based Replication.UNINDENT
The main idea is that a transaction is executed conventionally until the commit point, under the assumption that there will be no conflict. This is called optimistic execution. When the client issues a COMMIT command (but before the actual commit has happened), all changes made to the database by the transaction and the primary keys of changed rows are collected into a writeset. This writeset is then replicated to the rest of the nodes. After that, the writeset undergoes a deterministic certification test (using the collected primary keys) on each node (including the writeset originator node) which determines if the writeset can be applied or not.
If the certification test fails, the writeset is dropped and the original transaction is rolled back. If the test succeeds, the transaction is committed and the writeset is applied on the rest of the nodes.
The certification test implemented in Galera Cluster depends on the global ordering of transactions. Each transaction is assigned a global ordinal sequence number during replication. Thus, when a transaction reaches the commit point, the sequence number of the last transaction it did not conflict with, is known. The interval between those two numbers is an uncertainty land: transactions in this interval have not seen the effects of each other. Therefore, all transactions in this interval are checked for primary key conflicts with the transaction in question. The certification test fails if a conflict is detected.
Since the procedure is deterministic and all replicas receive
transactions in the same order, all nodes reach the same decision
about the outcome of the transaction. The node that started the
transaction can then notify the client application if the
transaction has been committed or not.
Isolation guarantees that transactions are processed in a reliable manner. To be more specific, it ensures that concurrently running transactions do not interfere with each other. In this way, isolation ensures data consistency. If the transactions were not isolated, one transaction could modify data that other transactions are reading. This would lead to data inconsistency.
The four isolation levels are, from lowest to highest:
o READ-UNCOMMITTED—On this isolation level, transactions can see changes to data made by other transactions that are not committed yet. In other words, transactions can read data that may not eventually exist, as the other transactions can always roll-back the changes without commit. This is called a dirty read. READ-UNCOMMITTED has actually no real isolation at all. o READ-COMMITTED—On this isolation level, dirty reads are impossible, as uncommitted changes are invisible to other transactions until the transaction is committed. However, at this isolation level, SELECT clauses use their own snapshots of committed data, committed before the SELECT clause was executed. As a result, the same SELECT clause, when run multiple times within the same transaction, can return different result sets. This is called a non-repeatable read. o REPEATABLE-READ—On this isolation level, non-repeatable reads are impossible, as the snapshot for the SELECT clause is taken the first time the SELECT clause is executed during the transaction. This snapshot is used throughout the entire transaction for this SELECT clause and it always returns the same result set. This level does not take into account changes to data made by other transactions, regardless of whether they have been committed or not. In this way, reads are always repeatable. o SERIALIZABLE—This isolation level place locks on all records that are accessed within a transaction. SERIALIZABLE also locks the resource in a way that records cannot be appended to the table being operated on by the transaction. This level prevents a phenomenon known as a phantom read. A phantom read occurs when, within a transaction, two identical queries are executed, and the rows returned by the second query are different from the first.
Galera Cluster uses transaction isolation on two levels:
|o||Locally, that is, on each node, transaction isolation works as with native InnoDB. You can use all levels. The default isolation level for InnoDB is REPEATABLE-READ.|
At the cluster level, between transactions processing at separate
nodes, Galera Cluster implements a transaction level called SNAPSHOT ISOLATION.
The SNAPSHOT ISOLATION level is between the REPEATABLE READ
and SERIALIZABLE levels.
The SERIALIZABLE transaction isolation level is not supported in a multi-master use case, , not in the STATEMENT nor in the ROW format. This is due to the fact that Galera replication does not carry a transaction read set. Also, the SERIALIZABLE transaction isolation level is vulnerable for multi-master conflicts. It holds read locks and any replicated write to a read locked row will cause the transaction to abort. Hence, it is recommended not to use it in Galera Cluster.
There are two different node provisioning methods:
o State Snapshot Transfer (SST), which transfers the entire node state as it is (hence "snapshot"). o Incremental State Transfer (IST), which only transfers the results of transactions missing from the joining node.
These methods are described in the chapters below.
State Snapshot Transfer (SST)
State Snapshot Transfer (SST) refers to a full data copy from one cluster node (donor) to the joining node (joiner). SST is used when a new node joins the cluster. To get synchronized with the cluster, the new node has to transfer data from a node that is already part of the cluster. In Galera replication, you can choose from two conceptually different ways to transfer a state from one MySQL server to another:
o mysqldump—mysqldump requires the receiving server to be fully initialized and ready to accept connections before the transfer.
mysqldump is a blocking method as the donor node becomes READ-ONLY while data is being copied from one node to another (SST applies the FLUSH TABLES WITH READ LOCK command for the donor).
mysqldump is also the slowest SST method and, in a loaded cluster, this can be an issue.
o rsync / rsync_wan—rsync / rsync_wan copies data files directly. This requires that the receiving server is initialized after the transfer. Methods such as rsync, rsync_wan, xtrabackup and other methods fall into this category.
These methods are faster than mysqldump, but they have certain limitations. They can only be used on server startup, the receiving server must be configured very similarly to the donor (for example, the innodb_file_per_table value must be the same, and so on).
Some of these methods, for example xtrabackup, can be made non-blocking on donor. These methods are supported through a scriptable SST interface.
SEE ALSO: Chapter Comparison of State Snapshot Transfer Methods
You can configure the state snapshot transfer method with the wsrep_sst_method variable. For example:
Galera Cluster supports a functionality known as incremental state transfer. Incremental state transfer means that if:
1. the joining node state UUID is the same as that of the group, and 2. all of the missed writesets can be found in the donor Gcache
then, instead of whole state snapshot, a node will receive the missing write sets and catch up with the group by replaying them.
For example, if the local node state is:
...and the group state is:
...and if writeset number 197223 is still in the donors writeset cache (GCache), it will send write sets 197223-201913 to the joiner instead of the whole state.
IST can dramatically speed up the remerging of a node to the cluster. It is also non-blocking on the donor.
Perhaps the most important parameter for IST is the GCache size
on the donor. The bigger it is, the more write sets can be
stored in it, and the bigger seqno gaps can be closed with
IST. On the other hand, if the GCache is much bigger than
the state size, serving IST may be less efficient than
sending a state snapshot.
Writeset Cache (GCache)
Galera Cluster stores write sets in a special cache called Writeset Cache (GCache). In short, GCache is a memory allocator for write sets and its primary purpose is to minimize the write set footprint on the RAM. Galera Cluster also improves the offload writeset storage to disk.
GCache has three types of stores:
1. A permanent in-memory store, where write sets are allocated by the default OS memory allocator. This store can be useful in systems that have spare RAM. The store has a hard size limit. By default, it is disabled. 2. A permanent ring-buffer file, which is preallocated on disk during cache initialization. This store is intended as the main writeset store. By default, its size is 128Mb. 3. An on-demand page store, which allocates memory-mapped page files during runtime as necessary. The default page size is 128Mb, but it can also be bigger if it needs to store a bigger writeset.
The size of the page store is limited by the free disk space. By default, page files are deleted when not in use, but a limit can be set on the total size of the page files to keep. When all other stores are disabled, at least one page file is always present on disk.
SEE ALSO: GCache related parameter descriptions in chapter Galera Parameters
The allocation algorithm attempts to store write sets in the above order. If the first store does not have enough space to allocate the writeset, the allocation algorithm attempts to store it on the next store. The page store always succeeds, unless the writeset is bigger than the available disk space.
By default, GCache allocates files in the working directory of the process, but a dedicated location can be specified (see chapter Galera Parameters.
Since all cache files are memory-mapped, the process may
appear to use more memory than it actually does.
Scriptable State Snapshot Transfer
Galera Cluster has an interface to customize state snapshot transfer through an external script. The script assumes that the storage engine initialization on the receiving node takes place only after the state transfer is complete. In short, this transfer copies the contents of the source data directory to the destination data directory (with possible variations).
As of wsrep API patch level 23.7, SST parameters are named. Individual scripts can use the wsrep_sst_common.sh file, which contains common functions for parsing argument lists, logging errors, and so on. There is no constraint on the order or number of parameters. New parameters can be added and any parameter can be ignored by the script.
These parameters are always passed to any state transfer script:
o role o address o auth o datadir o defaults-file o parent
These parameters are passed to the state transfer script by the state transfer process:
o socket—The local server (donor) socket for communications, if is required. o gtid—The Global Transaction ID in format: <uuid>:<seqno>. o bypass—This parameter specifies whether the actual data transfer should be skipped and only the GTID should be passed to the receiving server (to go straight to incremental state transfer).
These parameters are only passed to the wsrep_sst_mysqldump:
o user—The MySQL user to connect to both remote and local servers. The user must be the same on both servers. o password—MySQL user password. o host—The remote server (receiver) host address. o port—The remote server (receiver) port. o local-port—The local server (donor) port.
Flow control allows a cluster node to instruct the other nodes in the cluster when it needs to pause replication and when it is ready to continue replication. This prevents any node in the cluster from lagging too far behind the others in applying replication.
Read also this Percona article: http://www.mysqlperformanceblog.com/2013/05/02/galera-flow-control-in-percona-xtradb-cluster-for-mysql/
This chapter describes the flow control of node states and the possible node state changes.
To ensure temporal synchrony and consistency (as opposed to logical which is provided by virtual synchrony), Galera Cluster implements several forms of flow control, depending on the node state. The node states are described in the chapters below.
OPEN and PRIMARY
In either of these states, the node is not considered to be a part of the cluster. It is not allowed to replicate, apply or cache any write sets. No flow control.
JOINER and DONOR
In general, the node cannot apply any writesets and it must cache them. There is no reasonable way to keep the node synchronized with the cluster (except for stopping all replication). However, it is possible to limit the replication rate to make sure that the writeset cache does not exceed the configured size. The writeset cache size is controlled by the following variables:
o gcs.recv_q_hard_limit sets the maximum writeset cache size (in bytes). The parameter value depends on the amount of RAM, swap size, and performance considerations. The default value is SSIZE_MAX - 2Gb on 32-bit systems. On 64-bit systems there is practically no limit for the maximum writeset cache size. If this limit is exceeded and gcs.max_throttle is not 0.0, the node will abort with an out-of-memory error. If gcs.max_throttle is 0.0, replication in the cluster will be stopped. o gcs.max_throttle is the smallest fraction of the normal replication rate that we can tolerate in the cluster. 1.0 means that no replication rate throttling is allowed. 0.0 means that a complete replication stop is possible. The default value is 0.25. o gcs.recv_q_soft_limit is a fraction of the gcs.recv_q_hard_limit and serves to estimate the average replication rate. When it is exceeded, the average replication rate (in bytes) during this period is calculated. After that, the replication rate is decreased linearly with the cache size in a way that at gcs.recv_q_hard_limit it reaches gcs.max_throttle × (average replication rate). The default value is 0.25.
NOTE: The average replication rate estimated here can be way off from the sustained one.
To sum up, the writeset cache grows semi-logarithmically with time
after the gcs.recv_q_soft_limit and the time needed for the
state transfer to complete.
A JOINED node can apply writesets. In this state, flow control
makes sure that the node can eventually catch-up with the cluster,
specifically that its writeset cache never grows. Thus, the
cluster-wide replication rate is limited by the rate at which
the node can apply the writesets. Since applying a writeset is
usually several times faster than processing a transaction,
it hardly ever affects the performance of the cluster, except
at the very beginning, when the buffer pool on the node is empty.
Using parallel applying can significantly speed it up.
In the SYNCED state, the node flow control attempts to keep the slave queue to a minimum. This is controlled by the following configuration variables:
gcs.fc_limit—When the slave queue exceeds this limit,
replication is paused. It is essential for multi-master
configurations that this limit is low, as the certification
conflict rate is proportional to the slave queue length.
In master-slave setups, this value can be considerably higher to reduce flow control intervention. The default value is 16.
|o||gcs.fc_factor—When the slave queue goes below gcs.fc_limit × gcs.fc_factor, replication is resumed. The default value is 0.5.|
NOTE: While it is critical for multi-master operation to have as small slave queue as possible, the slave queue length is not so critical for master-slave setups, since, depending on the application and hardware, even 1K of writesets may be applied in a fraction of a second. The slave queue length has no effect on master-slave failover.
Since Galera nodes process transactions asynchronously
with regards to each other, the amount of replication
data cannot be anticipated in any way. Hence,
the Galera Cluster
flow control is reactive, that is, it only affects after
certain limits are exceeded. It cannot prevent exceeding
these limits or make any guarantees about by how much
these limits will be exceeded. For example, if
gcs.recv_q_hard_limit is set to 100Mb, it can
still be exceeded by a 1Gb writeset.
Node State Changes
The Galera Cluster node state machines handle different state changes on different Galera Cluster layers. At the top layer, there are node state changes as depicted in the figure below: [image] Galera Cluster Node State Changes.UNINDENT
In the figure:
|1.||The node establishes a connection to a Primary Component.|
|2.||The state transfer request of the node succeeds. The node starts to cache the write sets.|
|3.||The node receives a state snapshot. Now it has all cluster data and can start applying the cached writesets. Flow control is switched on to ensure eventual slave queue decrease.|
|4.||The node completes catching up with the cluster (the slave queue is empty). Flow control is switched on to keep the slave queue empty. The MySQL wsrep_ready status variable is set to 1 and the node is allowed to process transactions.|
|5.||The node receives a state transfer request. Flow control is relaxed to JOINER. The node caches the write sets it cannot apply.|
|6.||The node completes the state transfer to JOINER.|
NOTE: To keep the graph easier to read, the following transitions were omitted from it:
|o||At any time, cluster configuration change event can send the node to PRIMARY or OPEN state, for example SYNCED→OPEN, when a node loses connection to primary component due to network partition.|
|o||If the node does not need a state transfer (for example, a node restart in an idle cluster) it goes straight from the PRIMARY state to the JOINED state.|
In addition to single node failures, the cluster may be split into several components due to network failure. A component is a set of nodes, which are connected to each other, but not to nodes in other components.
In such a situation, only one of the components can continue to modify the database state to avoid history divergence. This component is called the Primary Component (PC). In normal operation, the Galera Cluster is a PC. When cluster partitioning happens, Galera Cluster invokes a special quorum algorithm to select a PC that guarantees that there is no more than one primary component in the cluster.
SEE ALSO: Chapter Galera Arbitrator
The current number of nodes in the cluster defines the current cluster size. There is no configuration setting that would define the list of all possible cluster nodes. As a result, every time a node joins the cluster, the total size of the cluster is increased and when a node leaves the cluster (gracefully) the size is decreased.
The cluster size determines the required votes to achieve quorum. A quorum vote is carried out when a node does not respond and is suspected to no longer be part of the cluster. This no response timeout is defined by the evs.suspect_timeout setting in the wsrep_provider_options (default 5 sec).
If a node is determined to be disconnected, the remaining nodes cast a quorum vote. If a majority from the total nodes connected from before the disconnect remains, that partition remains up. In the case of a network partition, there will be nodes active on both sides of the network disconnect. In this case, only the quorum will continue, the partition(s) without quorum will enter the non-Primary state and attempt to connect to the Primary Component. [image]
As quorum requires a majority, you cannot have automatic failover in a two node cluster. The failure of one node will cause the remaining node to go non-Primary. Furthermore, a cluster with an even number of nodes has a risk of a potential split brain condition; if network connectivity is lost somwhere between partitions, and the number of nodes is split exactly in half, neither partition would retain quorum, and both would go to non-Primary, as depicted in the figure below. [image]
For automatic failover, use at least three nodes. The same applies on other infrastructure levels. For example:
|o||A cluster on a single switch should have 3 nodes|
|o||A cluster spanning switches should be spread across at least 3 switches|
|o||A cluster spanning networks should be spread across at least 3 networks|
|o||A cluster spanning data centers should spread across at least 3 data centers|
To prevent the risk of a split-brain situation within a cluster that has an even number of nodes, partition the cluster in a way that one component always forms the Primary cluster section. For example (P = Primary, NP = Non-Primary):
4 -> 3(P) + 1(NP) 6 -> 4(P) + 2(NP) 6 -> 5(P) + 1(NP)
In these partitioning examples, it is extremely rare that the
number of nodes would be split exactly in half.
Galera Cluster supports a weighted quorum, where each node can be assigned a weight in the 0 to 255 range, with which it will participate in quorum calculations.
The quorum calculation formula is:
(sum(p_i x w_i) - sum(l_i x w_i))/2 < sum(m_i x w_i)
|o||p_i—Members of the last seen primary component|
|o||l_i—Members that are known to have left gracefully|
|o||m_i—Current component members|
In other words, the quorum is preserved if (and only if) the sum weight of the nodes in a new component strictly exceeds half of that of the preceding Primary Component, minus the nodes which left gracefully.
Node weight can be customized by using the pc.weight Galera parameter. By default, the node weight is 1, which translates into the traditional "node count" behavior.
NOTE: The node weight can be changed in runtime simply by setting the pc.weight parameter. The new weight is applied when a message carrying a weight is delivered. At the moment, there is no mechanism to notify on application of the new weight, but it will just "eventually" happen, when the message is delivered.
WARNING: If a group partitions at the moment when the weight change message is delivered, all partitioned components that deliver weight change messages in the transitional view will become non-primary components. Partitions that deliver messages in the regular view will go through the quorum computation with the applied weight when the following transitional view is delivered. In other words, there is a corner case where the entire cluster can end up in a non-primary component, if the weight changing message is sent at the moment when the partitioning takes place.
Recovery from such a situation should be done by either
waiting for a re-merge or by inspecting which partition
is most advanced and by bootstrapping it as a new primary
Weighted Quorum Examples
See below for some weighted quorum examples and use cases:
o Weighted quorum for three nodes:
n1: weight 2 n2: weight 1 n3: weight 0
Killing nodes n2 and n3 simultaneously preserves primary component on n1. Killing n1 makes n2 and n3 become non-primary components.
Weighted quorum for a simple master-slave scenario:
n1: weight 1 n2: weight 0
If master n1 dies, n2 will end up become a non-primary component. However, if n2 dies, n1 will continue as the primary component. If the network connection between n1 and n2 fails, n1 will continue as the primary component and n2 will become a non-primary component.
Weighted quorum for a master and multiple slaves scenario:
n1: weight 1 n2: weight 0 n3: weight 0 ... nn: weight 0
If n1 dies, all remaining nodes end up as non-primary components. If any other node dies, the primary component is preserved. In the case of network partitioning, n1 will always remain as a primary component.
Weighted quorum for a primary and secondary site scenario:
n1: weight 2 n2: weight 2 n3: weight 1 n4: weight 1
Site 1 has nodes n1 and n2, site 2 has nodes n3 and n4. Setting node
weights as above guarantees that nodes at site 1 remain the primary
component if site 2 goes down or if the network between the sites
fails. Also, either n1 or n2 can crash without the rest of the nodes
becoming non-primary components.
Galera Cluster guarantees node consistency regardless of where and when the query is issued. In other words, you are free to choose a load-balancing approach that best suits your purposes. If you decide to place the load balancing mechanism between the database and the application, you can consider, for example, the following tools:
o HAProxy - HAProxy is an open source TCP/HTTP load balancer. o Pen - Pen is another open source TCP/HTTP load balancer. Pen performs better than HAProxy on SQL traffic. o Galera Load Balancer - Galera Load Balancer was inspired by Pen, but is limited to balancing generic TCP connections only.
A single Galera Cluster cluster node can fail to operate for a variety of reasons, such as:
o A hardware failure o A software crash o Loss of network connectivity
All of these causes are generalized behind the concept of a node
Single Node Failure Detection
The only sign of a node failure is a loss of connection to the node process as seen by another node. The node is considered failed when it is no longer a member of the cluster Primary Component, that is, when the members of the primary component no longer see it. From the perspective of the failed node (unless it has crashed itself), it has lost connection to the primary component.
Third-party node monitoring tools, such as ping/Heartbeat/Pacemaker, may be grossly off in their estimates on the node failure, as they do not participate in the Galera Cluster group communication and are not aware of the primary component. Monitor the Galera Cluster node status only by polling the wsrep_local_state status variable or use a notification script.
Node connection liveness is determined from the last time a network packet was received from the node. This is checked every evs.inactive_check_period. If the node has no messages to send for more than evs.keepalive_period, it will emit heartbeat beacons instead. If no packets were received from the node for the evs.suspect_timeout, the node is declared suspected. When all members of the component see the node as suspected, it is pronounced inactive (failed). If no messages were received from the node for more than evs.inactive_timeout, it is pronounced inactive regardless of the consensus. The component stays non-operational until all members agree on the membership. If the members cannot reach consensus on the liveness of a node, the network is too unstable for the cluster to operate.
The relation between the option values is:
evs.keepalive_period <= evs.inactive_check_period <= evs.suspect_timeout <= evs.inactive_timeout <= evs.consensus_timeout
An unresponsive node, which fails to send messages or
heartbeat beacons on time due to, for example, heavy
swapping, may also be pronounced failed. Thus it will not
lock the operation of the rest of the cluster. If such
behaviour is undesirable, increase the timeouts.
Trade-Off Between Availability and Partition Tolerance
Within the CAP theorem, Galera Cluster emphasizes data safety and consistency, which leads to a trade-off between cluster availability and partition tolerance. To be more specific, in unstable networks (such as WAN) low evs.suspect_timeout/evs.inactive_timeout values may result in false node failure detections, whereas higher values will result in longer availability outages in the case of a real node failure. Essentially, the evs.suspect_timeout defines the minimum time needed to detect a failed node, during which the cluster will be unavailable due to the consistency constraint.
If one of the nodes in the cluster fails, the other nodes will continue to operate as usual. When the failed node comes up again, it automatically synchronizes with the other nodes before it is allowed back into the cluster. No data is lost when a node fails.
See chapter Node Provisioning and Recovery for more information on manual node recover.
A split-brain situation is a cluster failure where database nodes in the cluster begin to operate autonomously from each other. Data can get irreparably corrupted as two different database nodes update the data independently.
Like any quorum-based system, Galera Cluster is subject to the split-brain condition when the quorum algorithm fails to select a primary component. This can happen, for example, in a cluster without a backup switch if the main switch fails. However, the most likely split-brain situation is when a single node fails in a two-node cluster. Thus, it is strongly advised that the minimum Galera Cluster configuration is three nodes.
Galera Cluster is designed to avoid split-brain situations. If a cluster is split into two partitions of equal size, both of the split partitions end up in a non-primary component (unless explicitly configured otherwise). In this situation, proceed as described in chapter Node Resetting the Quorum.
A failure in state transfer renders the receiving node unusable. If a state transfer failure is detected, the receiving node will abort.
Restarting the node after a mysqldump failure may require manual restoring of the administrative tables. The rsync method does not have this issue, since it does not need the server to be in operational state.
The Galera Replication Plugin is a synchronous multi-master replication plug-in for MySQL, MariaDB and Percona XtraDB cluster. Galera Replication Plugin features are, for example:
o Unconstrained parallel applying, also known as parallel replication o Multicast replication o Automatic node provisioning
The primary focus of the Galera Replication Plugin is data consistency: the transactions are either applied on every node, or not at all. In short, the Galera Replication Plugin keeps databases synchronized provided that they were properly configured and synchronized at the beginning.
The Galera Replication Plugin differs from the regular MySQL Replication by, for example, addressing a number of issues including write conflicts when writing on multiple masters, replication lag and slaves being out of sync with the master.
See below for an example of a typical Galera Replication cluster: [image] Galera Replication Cluster.UNINDENT
Applications can write to any node in a Galera Replication cluster, and transaction commits (RBR events) are then applied on all servers, through certification-based replication.
Certification-based replication is an alternative approach to synchronous database replication using group communication and transaction ordering techniques.
The minimal recommended Galera Replication cluster consists of three nodes.
This chapter lists the Galera Cluster prerequisites.
The Galera Cluster software prerequisites are:
o The MySQL server o wsrep API o Galera Replication Plugin o A standard installation of a Linux distribution of your choice o DEB and RPM specific applications are listed in chapter Installing Galera Cluster for MySQL
Chapter Versioning Information
The Galera Cluster hardware prerequisites are:
o Typical server hardware capable of running the MySQL server (for each node) o A 100 Mbps Ethernet card (for each node) o A switch, hub, or router to provide network connectivity (for the cluster)
NOTE: We recommend running Galera Cluster on its own subnet for security and performance reasons.
If you want to install Galera Cluster, proceed as follows:
1. Download the write set replication patches for MySQL as a binary package for your Linux distribution from ( https://launchpad.net/codership-mysql/+download). 2. Download Galera replication Plugin, a generic synchronous multi-master replication plugin for transactional applications from ( https://launchpad.net/galera). 3. Verify the downloads using the MD5 sums that Launchpad generates. 4. Follow the Linux distribution specific instructions in the chapters below.
NOTE: In the examples below, MySQL authentication options are omitted for brevity.
If you want to create a more sophisticated setup right at the
beginning, see chapter Configuring Galera Cluster for MySQL.
Installing Galera Cluster for MySQL on DEB-based Distributions
This chapter describes how to install Galera Cluster on Debian and Debian-derived distributions.
Upgrade from mysql-server-5.0 to mysql-wsrep is not supported. Upgrade to mysql-server-5.5.
If you are installing over an existing MySQL installation, mysql-server-wsrep will conflict with the mysql-server-5.5 package. Remove the package as follows:
$ sudo apt-get remove mysql-server-5.5 mysql-server-core-5.5
The mysql-server-wsrep package requires packages psmisc and mysql-client-5.1.47 (or later). The MySQL 5.1 packages can be found in backports repositories. For more information on configuring and using Debian or Ubuntu backports, see:
For example, the installation of the required packages on Debian Lenny proceeds as follows:
$ sudo apt-get install psmisc $ sudo apt-get -t lenny-backports install mysql-client-5.1
To install Galera Cluster, proceed as follows:
Install the write set replication patches:
$ sudo dpkg -i <mysql-server-wsrep DEB>
Configure the write set replication patches to use the
Galera Replication Plugin as a wsrep provider:
|3.||Start the MySQL server.|
This chapter describes how to install Galera Cluster on CentOS and similar RPM-based distributions.
If you are migrating from an existing MySQL installation, there are two options:
o If youre already using official MySQL-server-community 5.5.x RPM from Oracle: # rpm -e mysql-server
|o||If you are upgrading from the stock mysql-5.0.77 on CentOS:|
Make sure that the following packages are not installed:
# rpm --nodeps --allmatches -e mysql-server mysql-test mysql-bench
|2.||Install the official MySQL-shared-compat-5.5.x from http://dev.mysql.com/downloads/mysql/5.1.html|
To install Galera Cluster, proceed as follows:
Install the write set replication patches:
# rpm -Uvh <MySQL-server-wsrep RPM>
Configure the write set replication patches to use the
Galera Replication Plugin as a wsrep provider:
|3.||Start the MySQL server.|
If the installation fails due to missing dependencies,
install the missing packages (for example, yum install perl-DBI)
Installing Additional Packages
Install also the following additional packages (if not yet installed):
o MySQL-client-community for connecting to the server and for the mysqldump-based SST (for RPM-based distributions only). o rsync for the rsync-based SST. o xtrabackup and nc (netcat) for the xtrabackup-based SST.
If youre upgrading a previous MySQL installation, upgrade the system tables as follows:
1. Start mysqld. 2. Run the mysql_upgrade command.
See the MySQL documentation in case of errors. The errors are
usually uncritical and can be ignored unless specific functionality
You can install Galera Cluster for MySQL on cloud servers in the same way as on your own hardware servers. The cloud installation principle is depicted in the figure below. For practical installation instructions, see chapter Installing Galera Cluster for MySQL. [image]
This chapter presents the mandatory and recommended settings for Galera Cluster installation and use. It may be possible to start the cluster after only setting the wsrep_provider and wsrep_cluster_address variables. However, the best results can be achieved by fine-tuning the configuration to best match the use case.
SEE ALSO: Chapter Galera Parameters.
Unless you are upgrading an already installed mysql-wsrep package, you must configure the installation to prepare the server for operation.
Edit the my.cnf configuration file as follows:
o Make sure that the system-wide my.cnf file does not bind mysqld to 127.0.0.1. To be more specific, if you have the following line in the [mysqld] section, comment it out:
#bind-address = 127.0.0.1
Make sure that the system-wide my.cnf file contains the line below:
Edit the /etc/mysql/conf.d/wsrep.cnf configuration file as follows:
When a new node joins the cluster, it will have to receive a state
snapshot from one of the peers. This requires a privileged MySQL
account with access from the rest of the cluster. Set the mysql
login/password pair for SST in the /etc/mysql/conf.d/wsrep.cnf
configuration file as follows:
Restart the MySQL server and connect to it as root to grant privileges to the SST account. Furthermore, empty users confuse MySQL authentication matching rules. Delete them:
$ mysql -e "SET wsrep_on=OFF; DELETE FROM mysql.user WHERE user=;" $ mysql -e "SET wsrep_on=OFF; GRANT ALL ON *.* TO wsrep_sst@% IDENTIFIED BY wspass";
The MySQL-wsrep server must be accessible from other cluster members through its client listening socket and through the wsrep provider socket. See your distribution and wsrep provider documentation for details. For example, on CentOS you could use these settings:
# iptables --insert RH-Firewall-1-INPUT 1 --proto tcp --source <my IP>/24 --destination <my IP>/32 --dport 3306 -j ACCEPT # iptables --insert RH-Firewall-1-INPUT 1 --proto tcp --source <my IP>/24 --destination <my IP>/32 --dport 4567 -j ACCEPT # iptables --insert RH-Firewall-1-INPUT 1 --proto tcp --source <my IP>/24 --destination <my IP>/32 --dport 4568 -j ACCEPT
If there is a NAT firewall between the nodes, configure it to allow
direct connections between the nodes (for example, through port forwarding).
If you have SELinux enabled, it may block mysqld from carrying out the required operations. Disable SELinux or configure it to allow mysqld to run external programs and open listen sockets at unprivileged ports (that is, things that an unprivileged user can do). See SELinux documentation for more information.
To disable SELinux, proceed as follows:
1. run setenforce 0 as root. 2. set SELINUX=permissive in /etc/selinux/config
AppArmor is always included in Ubuntu. It may prevent mysqld from opening additional ports or run scripts. See AppArmor documentation for more information on its configuration.
To disable AppArmor, proceed as follows:
$ cd /etc/apparmor.d/disable/ $ sudo ln -s /etc/apparmor.d/usr.sbin.mysqld $ sudo service apparmor restart
See below for an example my.cnf file:
[mysqld] # 1. Mandatory settings: these settings are REQUIRED for proper cluster operation query_cache_size=0 binlog_format=ROW default_storage_engine=innodb innodb_autoinc_lock_mode=2
# 2. Optional mysqld settings: your regular InnoDB tuning and such datadir=/mnt/mysql/data innodb_buffer_pool_size=28G innodb_log_file_size=100M innodb_file_per_table innodb_flush_log_at_trx_commit=2
# 3. wsrep provider configuration: basic wsrep options wsrep_provider=/usr/lib64/galera/libgalera_smm.so wsrep_provider_options="gcache.size=32G; gcache.page_size=1G" wsrep_cluster_address=gcomm://192.168.0.1,192.168.0.2,192.168.0.3 wsrep_cluster_name=my_galera_cluster wsrep_node_address=192.168.0.2 wsrep_node_name=node2 wsrep_sst_method=xtrabackup wsrep_sst_auth=root:rootpa$$
# 4. additional "frequently used" wsrep settings wsrep_node_incoming_address=192.168.10.2 wsrep_sst_donor=node3 wsrep_slave_threads=16
In the example above, there are 11 wsrep configuration variables. This is usually all that is needed for clustering.
Always customize the settings in section 3
before taking the cluster into production.
You must give values to the settings below:
o query_cache_size=0—This value disables the query cache. The query cache is disabled as, in the typical high concurrency environments, InnoDB scalability outstrips the query cache. It is not recommended to enable the query cache. o binlog_format=ROW—This variable sets the binary logging format to use row-level replication as opposed to statement-level replication. Do not change this value, as it affects performance and consistency. As a side effect to using this value, binlog, if turned on, can be ROW only. o default_storage_engine=InnoDB—InnoDB is a high-reliability and high-performance storage engine for MySQL. Starting with MySQL 5.5, it is the default MySQL storage engine. o innodb_autoinc_lock_mode=2—This variable sets the lock mode to use for generating auto-increment values. Value 2 sets the interleaved lock mode. Without this parameter, INSERT``s into tables with an ``AUTO_INCREMENT column may fail. Lock modes 0 and 1 can cause unresolved deadlocks and make the system unresponsive.
For better performance, you can give values to the settings below:
o datadir=/mnt/mysql/data—The MySQL data directory. o innodb_buffer_pool_size=28G—The size in bytes of the buffer pool, that is, the memory area where InnoDB caches table and index data. o innodb_log_file_size=100M—The size in bytes of each log file in a log group. o innodb_file_per_table—When innodb_file_per_table is enabled, InnoDB stores the data and indexes for each newly created table in a separate .ibd file, rather than in the system tablespace. o innodb_flush_log_at_trx_commit—This parameter improves performance. The parameter defines how often the log buffer is written out to the log file and how often the log file is flushed onto disk. When the value is 2, the log buffer is written out to the file at each commit, but the flush to disk operation is not performed on it, but it takes place once per second.
Compared with the default value 1, you can achieve better performance by setting the value to 2, but an operating system crash or a power outage can erase the last second of transactions. However, this risk is handled by synchronous replication—you can always recover the node from another node.
WARNING: With innodb_flush_log_at_trx_commit=2, some transactions can be lost if the entire cluster goes down, for example, due to a datacenter power outage.
The basic wsrep provider settings are:
o wsrep_provider=/usr/lib64/galera/libgalera_smm.so—The path to the Galera Replication Plugin. o wsrep_cluster_address=gcomm://192.168.0.1,192.168.0.2,192.168.0.3—The cluster connection URL. See chapter Starting a Cluster. o wsrep_provider_options="gcache.size=32G; gcache.page_size=1G"—A string of provider options passed directly to provider. o wsrep_cluster_name=my_galera_cluster—The logical cluster name. If a node tries to connect to a cluster with a different name, connection fails o wsrep_node_address=192.168.0.2—An option to explicitly specify the network address of the node if autoguessing for some reason does not produce desirable results. o wsrep_node_name=node2—The logical node name for convenience. o wsrep_sst_method=xtrabackup—The method used for state snapshot transfers. o wsrep_sst_auth=root:rootpa$$—A string with authentication information for state snapshot transfer.
For better performance, you can also give values to the settings below:
|o||wsrep_node_incoming_address=192.168.10.2—The address at which the server expects client connections. This parameter is intended for integration with load balancers.|
|o||wsrep_sst_donor=node3—The name of the server that should be used as a source for state transfer. Give the donor node name as configured with the wsrep_node_name parameter on the desired donor.|
|o||wsrep_slave_threads=16—How many threads to use for applying slave writsets.|
In normal operation, a Galera Cluster node does not consume much more memory than a regular MySQL server. Additional memory is consumed for the certification index and uncommitted write sets, but usually this is not noticeable in a typical application. However, writeset caching during state transfer makes an exception.
When a node is receiving a state transfer, it cannot process and apply incoming write sets because it has no state to apply them to yet. Depending on the state transfer mechanism (for example, mysqldump), the node that sends the state transfer may not be able to apply write sets. Instead, the node must cache the write sets for a catch-up phase. The Writeset Cache (GCache) is used to cache write sets on memory-mapped files on disk. These files are allocated as needed. In other words, the limit for the cache is the available disk space. Writing on disk reduces memory consumption.
However, if you want to adjust flow control settings, adjust the Galera Cluster parameters below:
o gcs.recv_q_hard_limit—the maximum allowed size of recv queue. This should normally be half of (RAM + swap). If this limit is exceeded, Galera Cluster will abort the server o gcs.recv_q_soft_limit—A fraction of gcs.recv_q_hard_limit after which replication rate will be throttled. o gcs.max_throttle—How much we can throttle the replication rate during state transfer (to avoid running out of memory).
This chapter describes how to start up a cluster with three nodes. We use host names host1, host2 and host3 in the command examples.
Before you start, ensure that you have:
o Three database hosts with the Galera Cluster installed. o No firewalls between the hosts. o Selinux or apparmor disabled or running in permissive mode. o Defined the wsrep_provider parameter value.
Chapter Using Galera Cluster with SElinux
Creating a MySQL Client Connection Between the Nodes
After the MySQL and Galera Cluster installations, the MySQL servers are running as three independent databases. They are not connected to each other as a cluster. Furthermore, the Galera Replication Plugin is not loaded in the configuration.
To connect the MySQL servers to each other as a cluster, you must create MySQL client connections between the nodes. In this way, the nodes can carry out state snapshot transfers with each other.
In this example, you use the default state snapshot transfer method, that is, mysqldump. Grant root privileges in the MySQL user database on all nodes to allow this, as follows:
mysql GRANT ALL ON *.* TO root@host1; GRANT ALL ON *.* TO root@host2; GRANT ALL ON *.* TO root@host3; exit
To create and bootstrap the first cluster node, you must set up the group communication structure for the cluster nodes. Proceed as follows:
1. Power up the servers that will join the cluster. 2. Start the first mysqld server with an empty cluster address URL:
/etc/init.d/mysql start --wsrep-cluster-address="gcomm://"
WARNING: Only use an empty gcomm address when you want to create a new cluster. Never use it when you want to reconnect to an existing one.
Check that the startup succeeded:
mysql -e "SHOW VARIABLES LIKE wsrep_cluster_address"
Immediately after startup, open the my.cnf configuration file
in a text editor and change the value of wsrep_cluster_address
to point to the other two nodes:
NOTE: Do not restart MySQL at this point.
NOTE: You can also use IP addresses.
|5.||To add the second and third node to the cluster, see chapter Adding Nodes to a Cluster below.|
To add a new node to an existing cluster, proceed as follows:
1. Power up the server that will join the cluster. Do not start the mysqld server yet. 2. Define the wsrep provider and the host names for the other mysqld servers in the cluster. Specify these parameters in the my.cnf configuration file as follows:
NOTE: You can also use IP addresses.
This command implies to the starting mysqld server that there an existing cluster to connect to.
|3.||(Optional) If the node will be part of a weighted quorum, set the initial node weight to zero. In this way, it can be guaranteed that if the joining node fails before it gets synchronized, it does not have effect in the quorum computation that follows. See chapter Weighted Quorum.|
Start the mysqld server:
|5.||The new node connects to the defined cluster members. It will automatically retrieve the cluster map and reconnect to the rest of the nodes.|
Carry out the procedure above the node3. The only difference is that you must define host host1 and host2 for it in step 2 as follows:
As soon as all cluster members agree on the membership, state
exchange will be initiated. In state exchange, the new node is
informed of the cluster state. If the node state differs from
the cluster state (which is normally the case), the new node
requests for a state snapshot from the cluster and installs
it. After this, the new node is ready for use.
Testing that the Cluster Works
You can test that Galera Cluster works as follows:
1. Connect to MySQL on any node:
|2.||Verify that all nodes have connected to each other by checking the following status variables:|
show status like wsrep_%;
+----------------------------+--------------------------------------+ | Variable_name | Value | +----------------------------+--------------------------------------+ ... | wsrep_local_state_comment | Synced (6) | | wsrep_cluster_size | 3 | | wsrep_ready | ON | +----------------------------+--------------------------------------+
In the example above:
- The ``wsrep_local_state_comment`` value *Synced* indicates that the node is connected to the cluster and operational. - The ``wsrep_cluster_size`` value *3* indicates that there are three nodes in the cluster. - The ``wsrep_ready`` value *ON* indicates that this node is connected to the cluster and able to handle transactions.
|3.||Create a test table and insert data. On host1, open a MySQL prompt and issue commands:|
CREATE DATABASE galeratest; use galeratest CREATE TABLE t (id INT PRIMARY KEY auto_increment, msg TEXT); INSERT INTO t (msg) VALUES ("Hello my dear cluster"); INSERT INTO t (msg) VALUES ("Hello again");
|4.||Check that the data was replicated correctly. On host2, open a MySQL prompt and issue commands:|
use galeratest SELECT * FROM t;
+----+-----------------------+ | id | msg | +----+-----------------------+ | 3 | Hello my dear cluster | | 6 | Hello again | +----+-----------------------+
|5.||The results above indicate that the cluster works.|
You can also test Galera Cluster by simulating various failure situations on three nodes as follows:
o To simulate a crash of a single mysqld process, run the command below on one of the nodes: killall -9 mysqld
|o||To simulate a network disconnection, use iptables or netem to block all TCP/IP traffic to a node.|
|o||To simulate an entire server crash, run each mysqld in a virtualized guest, and abrubtly terminate the entire virtual instance.|
If you have three or more Galera Cluster
nodes, the cluster should be able to survive the simulations.
You can test Galera Cluster for split-brain situations on a two node cluster as follows:
o Disconnect the network connection between the cluster nodes. The quorum is lost, and the nodes do not serve requests. o Re-connect the network connection between the cluster nodes. The quorum remains lost, and the nodes do not serve requests. o Run the command below on one of the servers:
mysql> SET GLOBAL wsrep_provider_options=pc.bootstrap=1;
This command resets the quorum and the cluster is recovered.
Galera Cluster for MySQL URL
The syntax for the Galera Cluster URL address where the nodes connect to, is shown below:
<backend schema>://<cluster address>[?option1=value1[&option2=value2]]
|o||<backend schema>—Refers to the Galera Cluster schema. Galera Cluster supports two schemata:|
|o||dummy—This schema is a pass-through backend for testing and profiling purposes. The schema does not connect to anywhere. Any values given with it will be ignored.|
|o||gcomm—This schema is a real group communication backend used for production purposes. This backend takes an address and has a number of parameters that can be set throught the option list (or through wsrep_provider_options as of version 0.8.x).|
|o||<cluster address>—The cluster adress must be:|
|o||An address of any current cluster member, if you want to connect the node to an existing cluster, or|
|o||A comma-separated list of possible cluster members. It is assumed that the list members can belong to no more than one Primary Component. Or|
|o||An empty string, if you want this node to be the first in a new cluster (that is, there are no pre-existing nodes to connect to).|
NOTE: As of version 2.2, Galera Cluster supports a comma-separated list of cluster members in the cluster address, such as:
WARNING: Only use an empty gcomm:// address when you want to create a new cluster. Never use it when your intention is to reconnect to an existing one. Furthermore, never leave it hard coded in any configuration files.
NOTE: One way to avoid editing the my.cnf configuration file to to remove gcomm:// is to start all cluster nodes with the following URL:
The pc.wait_prim=no option makes the node to wait for a primary component indefinitely. Then bootstrap the primary component by setting pc.bootstrap=1 on any other node.
However, you can only use the pc.wait_prim=no option with mysqldump SST, as the MySQL parser must be initialized before SST, to pass the pc.bootstrap=1 option.
options—The option list can be used to set backend parameters,
such as the listen address and timeout values. In version .7.x, this was
the only way to customize Galera Cluster behavior. The parameter values
set in the option list are not durable and must be resubmitted on every
connection to the cluster. As of version 0.8, customized parameters can
be made durable by seting them in wsrep_provider_options.
The parameters specified in the URL take precedence over parameters specified elsewhere (for example wsrep_provider_options in the my.cnf configuration file).
Parameters that you can set through the option list are evs.*, pc.* and gmcast.*.
The option list can be optionally followed by a list of key=value queries according to the URL standard.
NOTE: If the listen address and port are not set in the parameter list, gcomm will listen on all interfaces. The listen port will be taken from the cluster address. If it is not specified in the cluster address, the default port 4567 will be used.
Occarsionally, you may have to restart the entire Galera Cluster. This may happen, for example in the case of a power failure where every Galera Cluster node is shut down and you have no mysqld process at all. If this happens, proceed as follows:
1. Identify the node with the most advanced node state ID. See chapter Identifying the Most Advanced Node. 2. Start the node as the first node of the cluster. 3. Start the rest of the nodes as usual.
You can identify the node with the most advanced node state ID by comparing the Global Transaction ID values on different nodes.
Open the grastate.dat file, which, by default, is stored in the MySQL datadir.
If the grastate.dat file looks like the example below, you have found the most advanced node state ID:
# GALERA saved state version: 2.1 uuid: 5ee99582-bb8d-11e2-b8e3-23de375c1d30 seqno: 8204503945773 cert_index:
However, if the grastate.dat file looks like the example below, the node has crashed:
# GALERA saved state version: 2.1 uuid: 5ee99582-bb8d-11e2-b8e3-23de375c1d30 seqno: -1 cert_index:
To find the sequence number of the last committed transaction, run mysqld with the --wsrep-recover option. This option will recover the InnoDB table space to a consistent state, print the corresponding GTID into the error log and exit. In the error log, you can see something like this:
130514 18:39:13 [Note] WSREP: Recovered position: 5ee99582-bb8d-11e2-b8e3-23de375c1d30:8204503945771
This is the state ID. Edit the grastate.dat file and update the seqno field manually or let mysqld_safe automatically recover it and pass it to the mysqld next time you start it.
If the grastate.dat file looks like the example below, the node has either crashed during execution of a non-transactional operation (such as ALTER TABLE) or aborted due to a database inconsistency.
# GALERA saved state version: 2.1 uuid: 00000000-0000-0000-0000-000000000000 seqno: -1 cert_index:
You still can recover the ID of the last committed transaction
from InnoDB as described above. However, the recovery is rather
meaningless as the node state is probably corrupted and may not
even be functional. If there are no other nodes with a well defined
state, a thorough database recovery procedure (similar to that
on a standalone MySQL server) must be performed on one of the
nodes, and this node should be used as a seed node for new cluster.
If this is the case, there is no need to preserve the state ID.
Galera Cluster for MySQL consists of nodes. We recommend that you have at least three nodes in your cluster. Each cluster node is a regular MySQL server and you convert your existing MySQL server into a node and use that server as the cluster base.
Due to its synchronous multi-master qualities, the Galera Cluster can be seen as a single server listening at many interfaces. To give you an idea of what Galera Cluster is capable of, we will consider a typical N-tier application and discuss different benefits and considerations of deploying Galera Cluster in that context.
The figure below depicts an N-tier application cluster without clustering. Since the upper tiers are usually stateless, you can start up as many instances as necessary without any concern for synchronization; the instances store their data in the data tier.
This is a simple and easy-to-manage solution. However, the DBMS server at the data tier is a SPoF and a performance bottleneck. [image] No Clustering.UNINDENT
The figure below depicts an N-tier application cluster using whole stack clustering, where each stack has a dedicated database server. [image] Whole Stack Cluster.UNINDENT
This is a simple and easy-to-manage solution, especially if you can install the whole stack in one physical machine. Due to the direct connection from the application tier to the DBMS, the solution has low latencies.
Whole stack clustering has the following challenges:
|o||DBMS failure fails the whole stack.|
|o||Inefficient resource usage:|
|o||Having a dedicated database server for each application stack is overuse.|
|o||Poor resource consolidation. One server with a 7Gb buffer pool is much faster than two servers with 4Gb buffer pools.|
|o||Increased unproductive overhead: each server duplicates the work of others.|
|o||Increased rollback rate due to cluster-wide conflicts.|
|o||Inflexibility. There is no way to limit the number of master nodes, or perform intelligent load-balancing.|
Despite a long list of considerations, this setup can be usable
for several applications.
To address the shortcomings of the whole stack cluster scheme, we can cluster the data tier separately and present it to the application as a single virtual server. The figure below depicts this solution. [image] DBMS-tier Cluster.UNINDENT
In this solution, the failure of one node does not affect the rest of the cluster. Furthermore, resources are consolidated better and the setup is flexible: nodes can be assigned different roles by using intelligent load balancing.
DBMS-tier clustering has the following challenges:
|o||The structure is complex. A load balancer is involved and it must be backed up in case of failures. This typically means that you must have two more servers and a failover solutions between them.|
|o||Management is more complex. The load balancer must be configured and reconfigured, when a DBMS node is added to or removed from the cluster.|
|o||Connections to the DBMS servers are not direct. If you place a proxy between the application and the server, you end up with increased latencies for each query. This can easily become a performance bottleneck (load-balancing servers should be very powerful).|
|o||An attempt to spread this setup over several datacenters may cancel resource consolidation benefits, as each datacenter will have to have at least two DBMS servers.|
We can improve the DBMS-tier cluster by placing a dedicated load balancer on each application server. The figure below depicts this solution. [image] DBMS-tier Cluster with Distributed Load Balancing.UNINDENT
In DBMS-tier cluster with distributed load balancing the load balancer is no longer a single point of failure. Furthermore, the load balancer scales with the application cluster and is unlikely to become a bottlenecks. Finally, the client-server communication latencies are lower.
DBMS-tier clustering with distributed load balancing has the following challenges:
|o||There are N load balancers to manage and reconfigure when the database cluster configuration changes.|
We can also create a hybrid setup where we can aggregate several application stacks to use a single DBMS server. The figure below depicts this solution. [image] DBMS-tier Cluster with Distributed Load Balancing.UNINDENT
This solution improves the resource utilization of the whole
stack cluster and still maintains its relative simplicity and
direct DBMS connections benefits. This is how a DBMS-tier cluster
with distributed load balancing would look like, if we only use
one DBMS node per datacenter.
This can be a good setup for sites, which are not very big, but
still are hosted at more than one datacenter.
Node Provisioning and Recovery
If the state of a new or failed node differs from the state of the cluster Primary Component it needs to be synchronized. As a result, new node provisioning and failed node recovery are essentially the same process of joining a node to the cluster PC.
The initial node state ID is read from the grastate.txt file in wsrep_data_dir, where it is saved every time the node is gracefully shut down. If the node crashes in the Total Order Isolation mode, its database state is unknown and its initial Galera Cluster node state is undefined (00000000-0000-0000-0000-000000000000:-1). 
 In normal transaction processing, only the seqno part of the GTID remains undefined (-1), and the UUID part remains valid. In this case, the node can be recovered through IST.
When a node joins the primary component, it compares its state ID to that of the PC and if they do not match, the node requests state transfer from the cluster.
There are two possibilities to select the state transfer donor:
o Automatic—The group communication layer determines the state donor from the available members of the primary component. o Manual—The state donor name is specified with the wsrep_sst_donor parameter on startup.
In the latter case, if a node with that name is not a part of the primary component, state transfer fails and the joining node aborts. Use the same donor name as set in the wsrep_node_name parameter on the donor node.
NOTE: State transfer is a heavy operation not only on the joining node, but also on the donor. The state donor may not be able to serve client requests. Thus, when possible, select the donor manually, based on network proximity. Configure the load balancer to transfer client connections to the other nodes for the duration of state transfer.
During the state transfer the joining node caches writesets received from other nodes in a slave queue and applies them after the state transfer is over, to catch up with the current primary component state. Since the state snapshot always has a state ID, it is easy to determine which writesets are already contained in the snapshot and should be discarded.
During the catch-up phase, flow control ensures that the slave
queue gets shorter (that is, the cluster replication rate will
be limited to the writeset application rate on the catching node).
However, there is no guarantee on how soon the node will catch up.
When the node catches up, its status becomes SYNCED and
it will accept client connections.
Comparison of State Snapshot Transfer Methods
There are two different node provisioning methods:
o State Snapshot Transfer (SST), which transfers the entire node state as it is (hence "snapshot"). o Incremental State Transfer (IST), which only transfers the results of transactions missing from the joining node.
You can choose the SST method (mysqldump, rsync, or xtrabackup), whereas IST will be automatically chosen by the donor node, when it is available. The SST methods are compared in this chapter.
There is no single best state snapshot transfer method; the method must be chosen depending on the situation. Fortunately, the choice only must be done on the receiving node; the donor will serve whatever is requested, as long as it has support for it.
See the table below for a summary on the the differences between the state snapshot transfer methods:
Method Speed Blocks the donor? Available on live node? Logical/Physical Requires root access to MySQL server? mysqldump slow yes yes logical both donor and joiner rsync fastest yes no physical none xtrabackup fast For a short time no physical donor only
When comparing the different state snapshot transfer methods, the division between a logical state snapshot and a physical state snapshot is important, especially from the perspective of configuration:
Physical state snapshot
Pluses: Physical state snapshot is the fastest to transfer, as by definition it does not involve a server on either end. It just physically copies data from the disk at one node to the disk on the other. It does not depend on the joining node database being in a working condition: it just writes all over it. This is a good way to restore a corrupted data directory.
Minuses: Physical state snapshot requires the receptor node to have the same data directory layout and the same storage engine configuration as the donor. For example, InnoDB should have the same file-per-table, compression, log file size and similar settings. Furthermore, a server with initialized storage engines cannor receive physical state snapshots. This means that:
|o||The node in need of a SST must restart the server.|
|o||The server is inaccessible to the mysql client until the SST is complete, since the server cannot perform authentication without storage engines.|
Logical state snapshot
Pluses: A running server can receive a logical state transfer (in fact, only a fully initialized server can receive a logical state transfer). Logical state transfer does not require a receptor node to have the same configuration as the donor node, allowing to upgrade storage engine options. You can, for example, migrate from the Antelope to the Barracuda file format, start using compression or resize, or place iblog* files to another partition.
Minuses: A logical state transfer is as slow as mysqldump. The receiving server must be prepared to accept root connections from potential donor nodes and the receiving server must have a non-corrupted database.
Mysqldump requires the receiving node to have a fully functional database (which can be empty) and the same root credentials as the donor has. It also requires root access from other nodes. Mysqldump is several times slower than other methods on sizable databases, but may be faster if the database is very small (smaller than the log files, for example). It is also sensitive to the mysqldump tool version; it must be the most recent. It is not uncommon for several mysqldump binaries to be found in the system. Mysqldump can fail if an older mysqldump tool version is incompatible with the newer server.
The main advantage of mysqldump is that a state snapshot can be transferred to a working server. That is, the server can be started standalone and then be instructed to join a cluster from the MySQL client command line. It also can be used to migrate from older database formats to newer.
Sometimes mysqldump is the only option. For example, when upgrading from a MySQL 5.1 cluster with a built-in InnoDB to MySQL 5.5 with an InnoDB plugin.
The mysqldump script only runs on the sending side and pipes the mysqldump output to the MySQL client connected to the receiving server.
Rsync-based state snapshot transfer is the fastest. It has all pluses and minuses of the physical snapshot transfer and, in addition, it blocks the donor for the whole duration of transfer. However, on terabyte-scale databases, it was found to be considerably (1.5-2 times) faster than xtrabackup. This is several hours faster. Rsync does not depend on MySQL configuration or root access. This makes it probably the easiest method to configure.
Rsync also has the rsync-wan modification that engages the rsync delta transfer algorithm. However, this method is more IO intensive and should only be used when the network throughput is the bottleneck, that is usually the case in conjunction with wide area networks.
The rsync script runs on both sending and receiving sides. On the receiving side, it starts the rsync in server mode and waits for a connection from the sender. On the sender side, it starts the rsync in client mode and sends the contents of the MySQL data directory to the joining node.
The most frequently encountered issue with this method is having incompatible rsync versions on the donor and on the receiving server.
Xtrabackup-based state snapshot transfer is probably the most popular choice. As rsync, it has the pluses and minuses of the physical snapshot. However, xtrabackup is a virtually non-blocking method on the donor. It only blocks the donor for a very short period of time to copy MyISAM tables, such as system tables. If these tables are small, the blocking time is very short. This naturally happens at the cost of speed: xtrabackup can be considerably slower than rsync.
As xtrabackup must copy a large amount of data in the shortest possible time, it may noticeably degrade the donor performance.
The most frequently encountered problem with xtrabackup is its configuration. xtrabackup requires that certain options be set in the my.cnf file (for example datadir) and a local root access to the donor server. Refer to the xtrabackup manual for more details.
In the case of a network failure, failure of more than half of the cluster nodes, or a split-brain situation, your node(s) may no longer deem that they are part of the Primary Component of the cluster. In other words, they suspect that there is another primary component in the cluster which they have no connection to.
If this is the case, all nodes return an Unknown command to every query.
If you know that no other nodes of your cluster play the primary component role, you can reset the quorum by rebootstrapping the primary component as follows:
1. Choose the most up-to-date node. You can check this by reading the output command:
SHOW STATUS LIKE wsrep_last_committed
Choose the node with the highest value.
2. Once you have chosen the most up-to-date node, run the command below on it:
SET GLOBAL wsrep_provider_options=pc.bootstrap=yes
3. The quorum is reset and:
o The component with this node as a member will become the primary component o All nodes the component will synchronize to the most up-to-date node, and o The cluster starts to accept SQL requests again.
You can monitor wsrep-related status variables in the Galera Cluster by using the standard wsrep queries. As all wsrep-related status variables are prefixed with wsrep, you can query them all by using the command below:
mysql> SHOW VARIABLES LIKE wsrep_%;
You can also define the notification command wsrep_notify_cmd to be invoked when the cluster membership or node status changes. This command can also communicate the event to a monitoring agent. For more information on the wsrep_notify_cmd command, see chapter wsrep_notify_cmd.
NOTE: You can also use Nagios for monitoring the Galera Cluster.
For more information, see http://www.fromdual.com/galera-cluster-nagios-plugin-en.
Checking the Cluster Integrity
When checking the cluster integrity, the first thing you want to know is whether the node belongs to the right cluster. You can check this by checking the value of the variable below:
This variable value must be the same on all cluster nodes. The nodes with different wsrep_cluster_state_uuid values are not connected to the cluster.
Once you have checked whether the node belongs to the right cluster, you want to check whether the node belongs to the same component. You can check this by checking the value of the variable below:
This variable value must be the same on all cluster nodes. If the nodes have different wsrep_cluster_conf_id values, they are partitioned. This is a temporary condition and should be resolved when network connectivity between the nodes is restored.
You can also view the number of nodes in the cluster by checking the value of the variable below:
If the shown number of nodes is equal to the expected number of nodes, all cluster nodes are connected. You can check this variable on one node only.
Finally, check the primary status of the cluster component to which the node is connected to:
If this variable value differs from Primary, there is a partition in the cluster and this component is currently unoperational (due to multiple membership changes and the loss of quorum). A split-brain condition is also possible.
If no node in the cluster is connected to the Primary Component (that is, all nodes belong to the same component, which is a non-primary component), attempt to reset the quorum as explained in chapter Resetting the Quorum.
If you cannot reset the quorum, the cluster must be manually rebootstrapped. If this is the case,
|1.||Shut down all nodes.|
|2.||Restart all nodes starting with the most advanced node. To find out the most advanced node, check the wsrep_last_committed status variable.|
NOTE: Manual bootstrapping has the downside that gcache contents are lost and no IST is possible, which would probably be the fastest state transfer method in this kind of case.
This situation is very unlikely. If, however, there is another primary
cluster component, there is a loss of connectivity between the nodes.
Troubleshoot the problem and restore connectivity. After restoration,
the nodes from the non-primary component will automatically reconnect
and resynchronize themselves with the primary component.
Checking the Node Status
When checking the node status, the first thing you want to know is whether the node is ready to accept SQL load. You can check this by checking the value of the variable below:
If the value is true, the node can accept SQL load. If not, check the value of the variable below:
If the value is OFF, the node has not yet connected to any of the cluster components. This may be due to misconfiguration (for example, the configuration contains an invalid wsrep_cluster_address and/or wsrep_cluster_name). Check the error log for proper diagnostics.
If the node is connected but wsrep_ready = OFF, check the value of the variable below:
In a primary component, the variable value is typically Joining, Waiting for SST, Joined, Synced or Donor. If wsrep_ready = OFF and the state comment is Joining, Waiting for SST or Joined, the node is still syncing with the cluster.
In a non-primary component, the node state comment should be
Initialized. Any other states are transient and momentary.
Checking the Replication Health
NOTE: Status variables and variables in the chapters below are differential and reset on every SHOW STATUS command. To view the value for the current moment, execute two SHOW STATUS commandson the node with an interval of ~1 minute. The output of the last invocation will correspond to the current moment.
When checking the replication health, the first thing you want to know is how much slave lag is slowing down the cluster. You can check this by checking the value of the variable below:
If variable value range is from 0.0 to 1.0 and it indicates the fraction of time the replication was paused since last the SHOW STATUS command. Value 1.0 refers to a complete stop. This value should be as close to 0.0 as possible. The main way to improve the value is to increase the wsrep_slave_threads value and to exclude the slow nodes out of cluster.
The optimal value for the wsrep_slave_threads, for its part, is suggested by the value of the variable below:
This variable indicates how many transactions may be applied in parallel
on average. There is no reason to assign the wsrep_slave_threads
value much higher than this. This value can also be quite high, even in
the hundreds. Use common sense and discretion when you define the value
Detecting Slow Network Issues
If you have a slow network, check the value of the variable below:
If the variable value is high, the network link can be the bottleneck.
If this is the case, the cause can be at any layer, from the physical
layer to the operating system configuration.
Galera Cluster supports SSL for the encryption of replication traffic. Authentication is not supported. SSL is a cluster-wide option and must be enabled either on all of the nodes or none at all.
To use SSL, you must generate a private certificate/key pair for the cluster, for example, by the following command:
$ openssl req -new -x509 -days 365000 -nodes -keyout key.pem -out cert.pem
NOTE: It is crucial to generate a certificate/key pair, which is valid for a long time. When the certificate expires, there will be no way to update the cluster without complete shutdown. Thus, use a large value for the the -days parameter.
Copy this certificate/key pair to the /etc/mysql directory on all of the nodes. Copy the files over a secure channel between the nodes.
Take the certificate/key pair to use by specifying the following Galera Cluster options:
socket.ssl_cert = <path_to_cert_file>; socket.ssl_key = <path_to_key_file>
Other SSL configuration parameters include socket.ssl_compression and socket.ssl_cipher. See Galera Parameters for details.
NOTE: You cannot use a mixed cluster where some nodes have SSL and some do not. We recommend configuring SSL when you are setting up a new cluster. If you must add SSL support on a production system, you must rebootstrap the cluster and accept a brief outage.
The Galera Cluster SSL support only
covers Galera Cluster communication. Since state
snapshot transfer happens outside of Galera Cluster,
protect it separately.
You can use, for example, the internal SSL support in the MySQL
client or the stunnel program to protect rsync traffic.
By default, Galera Cluster may require all or some of the following ports to be open between the nodes:
o 3306—MySQL client connections and mysqldump SST o 4567—Galera Cluster replication traffic o 4568—IST o 4444—all SSTs besides mysqldump
For example, in a LAN environment the iptables configuration on each node may look as follows:
# iptables -A INPUT -i eth0 -p tcp -m tcp --source 192.168.0.1/24 --dport 3306 -j ACCEPT # iptables -A INPUT -i eth0 -p tcp -m tcp --source 192.168.0.1/24 --dport 4567 -j ACCEPT # iptables -A INPUT -i eth0 -p tcp -m tcp --source 192.168.0.1/24 --dport 4568 -j ACCEPT # iptables -A INPUT -i eth0 -p tcp -m tcp --source 192.168.0.1/24 --dport 4444 -j ACCEPT
In a WAN environment, this setup may be tedious to manage. Alternatively, with not much loss of security, you can simply open a full range of ports between trusted hosts:
# iptables -A INPUT -p tcp -s 22.214.171.124 -j ACCEPT # iptables -A INPUT -p tcp -s 126.96.36.199 -j ACCEPT
The IP addresses in the example are for demonstration purposes only.
Use the real values from your nodes and netmask in your iptables
Differences from a Standalone MySQL Server
Galera Cluster for MySQL has the following differences to a standalone MySQL server:
o Replication only works with the InnoDB storage engine. Any writes to tables of other types, including system (mysql.*) tables are not replicated. However, DDL statements are replicated at the statement level, and changes to mysql.* tables will get replicated that way. In other words, you can safely issue command CREATE USER... or GRANT..., but issuing INSERT INTO mysql.user... will not be replicated.
In general, non-transactional engines cannot be supported in multi-master replication.
o Rows in tables without primary key may appear in different order on different nodes. As a result, command SELECT...LIMIT... may return slightly different sets. The DELETE operation is also unsupported on tables without primary key.
Do not use tables without a primary key. It is always possible to add an AUTO_INCREMENT column to a table without breaking your application.
o Unsupported queries: o LOCK/UNLOCK TABLES cannot be supported in master-master replication. o Lock functions (GET_LOCK(), RELEASE_LOCK()...)
Query logs cannot be directed to a table. If you enable query logging, you must
forward the log to a file:
log_output = FILE
Use general_log and general_log_file to choose query logging and log file name.
|o||Do not use a query cache.|
|o||XA transactions cannot be supported due to possible rollback on commit.|
|o||Transaction size. While Galera Cluster does not explicitly limit the transaction size, a writeset is processed as a single memory-resident buffer and, as a result, extremely large transactions (for example, LOAD DATA) may adversely affect node performance. To avoid that, the wsrep_max_ws_rows and wsrep_max_ws_size variables limit the transaction rows to 128 K and the transaction size to 1 Gb, by default. If necessary, you can increase those limits.|
Due to cluster level optimistic concurrency control, a transaction issuing
a COMMIT may still be aborted at that stage. There can be two transactions
writing the to same rows and committing in separate cluster nodes, and only one
of the them can successfully commit. The failing one will be aborted.
For cluster level aborts, Galera Cluster gives back a deadlock error:
code (Error: 1213 SQLSTATE: 40001 (ER_LOCK_DEADLOCK))
In this case, restart the failing transaction.
|o||Windows OS is not supported.|
|o||Do not use binlog-do-db and binlog-ignore-db. These binary log options are only supported for DML, but not for DDL. Using these options will create a discrepancy and replication will abort.|
|o||Do not use server system variables character_set_server, utf16 or utf32 or ucs2 if you choose rsync as a state transfer method. The server will crash.|
This chapter describes three different ways to upgrade Galera Cluster.
You can carry out a rolling upgrade on a Galera Cluster by applying the following steps on each cluster node:
NOTE: Transfer all clients connections from the node to be upgraded to the other nodes for the time of migration.
|1.||Shutdown the node|
|2.||Upgrade the software|
|3.||Restart the node|
|4.||Wait until the node gets synchronized with the cluster|
TIP: If the upgraded node has been or will be part of a weighted quorum, set the initial node weight to zero. In this way, it can be guaranteed that if the joining node fails before it gets synchronized, it does not have effect in the quorum computation that follows.
The main advantage of a rolling upgrade is that if something goes wrong with the upgrade, the other nodes are still operational and you have time to troubleshoot the problem.
However, rolling upgrade has some issues which deserve consideration:
|o||Upgrading an individual node can take some time, during which the cluster operates at a lower capacity:|
|o||Unless incremental state transfer is used, the node resorts to full state snapshot transfer, which can take a long time depending on the database size and state transfer method.|
|o||During that time, the node will accumulate a very long catch-up replication event queue, which it will have to replay to get synchronized with the cluster. At the same time, the cluster adds more and more events to the queue.|
|o||Unless xtrabackup or rsync+LVM state transfer methods are used, the state snapshot donor node will be also blocked for the duration of the state transfer. Xtrabackup or rsync+LVM state transfer do not block the donor, but may slow it down. In practice, the cluster will lack 2 nodes for the duration of state transfer and 1 node for the duration of the catch-up phase.|
|o||If there are few nodes in the cluster and it operates close to its maximum capacity, taking out 2 nodes can lead to a situation where the cluster cannot serve all requests, or execution times may increase, making the service less available.|
|o||If there are several nodes in the cluster, it would take a long time to upgrade the whole cluster.|
|o||Depending on the load balancing mechanism, you may have to instruct the load balancer not to direct requests to the joining and donating nodes.|
|o||Every time a new node is joining a cluster, cluster performance drops until the node buffer pool warms up. Parallel applying helps in this.|
A bulk upgrade upgrades all nodes in an idle cluster to avoid time-consuming state transfers. Bulk upgrade produces a short but complete service outage. You can carry out a bulk upgrade on a Galera Cluster as follows:
1. Stop all load on the cluster 2. Shut down all the nodes 3. Upgrade software 4. Restart the nodes. The nodes will merge to the cluster without state transfers, in a matter of seconds. 5. Resume the load on the cluster
NOTE: You can carry out steps 2-3-4 on all nodes in parallel, therefore reducing the service outage time to virtually the time needed for a single server restart.
WARNING: Always use this method for a two-node cluster upgrade, as the rolling upgrade would result in a much longer service outage.
The main advantage of the bulk upgrade is that, for huge databases, it
is faster and results in better availability than the rolling upgrade.
The main drawback of the bulk upgrade is that it relies on the upgrade
and the restart will be quick. However shutting down an InnoDB may take
up a few minutes (as it flushes dirty pages), and if something goes wrong
during the upgrade, there is hardly any time to troubleshoot and fix the
problem. Therefore, do not upgrade all nodes at once, but try it first
on a single node.
If only a Galera provider upgrade is required, the bulk upgrade can be further optimized to only take a few seconds. The following is an example for a 64-bit CentOS (or RHEL):
1. Issue the commands below on every node:
# rpm -e galera # rpm -i <new galera rpm>
|2.||Stop all load on the cluster|
|3.||Issue the commands below on every node:|
mysql> SET GLOBAL wsrep_provider=none; mysql> SET GLOBAL wsrep_provider=/usr/lib64/galera/libgalera_smm.so;
|4.||Issue the command below on node 1 (or any node):|
mysql> SET GLOBAL wsrep_cluster_address=gcomm://
|5.||Issue the command below on the other nodes:|
mysql> SET GLOBAL wsrep_cluster_address=gcomm://node1
|6.||Resume the load on the cluster|
Reloading the provider and connecting to the cluster takes typically less than 10 seconds; there is virtually no service outage.
In the provider-only upgrade, the warmed up
InnoDB buffer pool is fully preserved and the
cluster will continue to operate at full speed
as soon as the load is resumed.
A schema upgrade refers to any DDL statement run for the database. DDL statements change the database structure and are non-transactional.
DDL statements are processed in two different methods in Galera Cluster. These methods are described in the chapters below.
NOTE: See also the pt-online-schema-change command in Percona Toolkit for MySQL: http://www.percona.com/software/percona-toolkit.
By default, DDL statements are processed by using the Total Order Isolation (TOI) method. In TOI, the query is replicated to the nodes in a statement form before executing on master. The query waits for all preceding transactions to commit and then gets executed in isolation on all nodes simultaneously. When using the TOI method, the cluster has a part of the database locked for the duration of the DDL processing (in other words, the cluster behaves like a single server).
The isolation can take place at the following levels:
1. At the server level, where no other transactions can be applied concurrently (for CREATE SCHEMA, GRANT and similar queries). 2. At the schema level, where no transaction accessing the schema can be applied concurrently (for CREATE TABLE and similar queries). 3. At the table level, where no transaction accessing the table can be applied concurrently (for ALTER TABLE and similar queries).
TOI queries have several particularities that must been taken into consideration:
|o||From the perspective of certification:|
|o||TOI transactions never conflict with preceding transactions, since they are only executed after all preceding transactions were committed. Hence, their certification interval is of zero length. This means that TOI transactions will never fail certification and are guaranteed to be executed.|
|o||Certification takes place on a resource level. For example, for server-level isolation this means any transaction that has a TOI query in its certification interval will fail certification.|
|o||The system replicates the TOI query before execution and there is no way to know whether it succeeds or fails. Thus, error checking on TOI queries is switched off.|
|o||The method is simple, predictable and guarantees data consistency.|
|o||The disadvantage is that the cluster behaves like a single server, potentially preventing high-availability for the duration of DDL execution.|
As of wsrep patch 5.5.17-22.3, you can choose whether to use the traditional total order isolation method or the rolling schema upgrade method. You can choose the rolling schema upgrade method by using the global parameter wsrep_OSU_method.
The rolling schema upgrade is a DDL processing method, where the DDL will only be processed locally at the node. The node is desynchronized from the cluster for the duration of the DDL processing in a way that it does not block the rest of the nodes. When the DDL processing is complete, the node applies the delayed replication events and synchronizes back with the cluster.
To upgrade a schema cluster-wide, the DDL must be manually executed at each node in turn. When the rolling schema upgrade proceeds, a part of the cluster will have the old schema structure and a part of the cluster will have the new schema structure.
WARNING: While the rolling schema upgrade has the advantage of blocking only one node at a time, it is potentially unsafe, and may fail if the new and old schema definitions are incompatible at the replication event level. Execute operations such as CREATE ATBLE and DROP TABLE in TOI.
In normal operation, a Galera Cluster node does not consume much more memory than a regular MySQL server. Additional memory is consumed for the certification index and uncommitted write sets, but usually this is not noticeable in a typical application. However, writeset caching during state transfer makes an exception.
When a node is receiving a state transfer, it cannot process and apply incoming write sets because it has no state to apply them to yet. Depending on a state transfer mechanism (for example, mysqldump), the node that sends the state transfer may not be able to apply write sets. Instead, the node must cache the write sets for a catch-up phase. The Writeset Cache (GCache) is used to cache write sets on memory-mapped files on disk. These files are allocated as needed. In other words, the limit for the cache is the available disk space. Writing on disk reduces memory consumption.
SEE ALSO: Chapter Optional Memory Settings
If you must frequently handle large transactions, such as transactions caused by the DELETE command that may delete millions of rows from a table at once, we recommend using the Percona toolkits pt-archiver command. For example: pt-archiver --source h=dbhost,D=keystone,t=token --purge --where "expires < NOW()" --primary-key-only --sleep-coef 1.0 --txn-size 500
This tool deletes rows efficiently. For more information on the tool,
Backing up Cluster Data
Galera Cluster backups can be performed just as regular MySQL backups, using a backup script. Since all the cluster nodes are identical, backing up one node backs up the entire cluster.
However, such backups will have no global transaction IDs associated with them. You can use these backups to recover data, but they cannot be used to recover a Galera Cluster node to a well-defined state. Furthermore, the backup procedure may block the clusteroperation for the duration of backup, in the case of a blocking backup.
You can associate a Global Transaction ID with the backup and avoid cluster stalling by carrying out the backup in the same manner as a state snapshot transfer between the nodes. For that:
o A special backup script must be installed in the mysqld path on the node that will be a backup source. o The backup must be initiated through the Galera Cluster replication system.
For example, the command below will cause the chosen donor node to the run wsrep_sst_backup script and pass the corresponding Global Transaction ID to it:
/usr/bin/garbd --address gcomm://<donor node address>?gmcast.listen_addr=tcp://0.0.0.0:4444 --group <wsrep_cluster_name> --donor <wsrep_node_name on donor> --sst backup
NOTE: In the command, the ?gmcast.listen_addr=tcp://0.0.0.0:4444 section is an arbitrary listen socket address that the garbd will have to open to communicate with the cluster. You only have to specify it if the default socket address (0.0.0.0:4567) is busy.
The garbd may immediately exit with confusing diagnostic
after making a successful SST request. This is not a failure.
The backup script is being run by the donor mysqld. You can
monitor its progress in the donor mysqld error log and/or in
the script log
If the expense of adding, for example, a third datacenter is too high, you can use Galera Arbitrator. Galera Arbitrator is a member of the cluster which participates in voting, but not in actual replication.
WARNING: Galera Arbitrator does not participate in actual replication, but it receives the same data as the nodes. Use a secure network connection for the arbitrator.
Galera Arbitrator servers two purposes:
|o||It helps to avoid split-brain situations by acting as an odd node in a cluster that is spread only across two nodes.|
|o||It can request a consistent application state snapshot.|
Galera Arbitrator is depicted in the figure below: [image] Galera Arbitrator.UNINDENT
In the figure above, if one of the data centers fails or loses WAN connection, the node that sees the arbitrator (and therefore sees clients) will continue operation.
NOTE: garbd must see all replication traffic although it does not store it anywhere. Placing the arbitrator in a location with poor connectivity to the rest of the cluster may lead to poor cluster performance.
Arbitrator failure does not affect cluster operation and a new
instance can be reattached to the cluster at any time. There can be
several arbitrators in the cluster.
Configuring Galera Arbitrator
As a Galera Cluster cluster member, Galera Arbitrator accepts all Galera Cluster parameters except those prefixed by replicator..
SEE ALSO: Chapters Galera Parameters and Starting Galera Arbitrator.
Galera Arbitrator it is a separate daemon called garbd. You can start is manually as follows:
# /etc/init.d/garb start
You can also add configuration options to the command.
You can also automate running Galera Arbitrator by using an /etc/default/garb init script, such as the one below:
# Copyright (C) 2013 Codership Oy # This config file is to be sourced by garb service script.
# A space-separated list of node addresses (address[:port]) in the cluster GALERA_NODES="192.168.1.1:4567 192.168.1.2:4567"
# Galera cluster name, should be the same as on the rest of the nodes. GALERA_GROUP="example_wsrep_cluster"
# Optional Galera internal options string (e.g. SSL settings) # see http://www.codership.com/wiki/doku.php?id=galera_parameters GALERA_OPTIONS="socket.ssl_cert = /etc/galera/cert/cert.pem; socket.ssl_key = /$
# Log file for garbd. Optional, by default logs to syslog LOG_FILE="/var/log/garbd.log"
This chapter lists a number of frequently asked questions on Galera Cluster for MySQL and other related matters. What does "commit failed for reason: 3" mean?
Occasionally, a slave thread tries to apply a replicated writeset and finds a lock conflict with a local transaction, which may already be in the commit phase. The local transaction is aborted and the slave thread can proceed. This is a consequence of an optimistic (expecting no row conflicts) transaction execution, and is expected in a multi-master configuration. You may see any of the following messages:
110906 17:45:01 [Note] WSREP: BF kill (1, seqno: 16962377), victim: (140588996478720 4) trx: 35525064 110906 17:45:01 [Note] WSREP: Aborting query: commit 110906 17:45:01 [Note] WSREP: kill trx QUERY_COMMITTING for 35525064 110906 17:45:01 [Note] WSREP: commit failed for reason: 3, seqno: -1
NOTE: The log example is taken from a debug log (wsrep_debug=1).
To avoid such conflicts, you can:
|o||Use the cluster in a master-slave configuration, that is, direct all writes to a single node|
|o||Use the same approaches as for master-slave read/write splitting|
These instructions describe how to migrate from the MyISAM storage engine to the InnoBD storage engine on Galera Cluster. These instructions are applicable to both a standalone MySQL server and a stock MySQL master-slave cluster that uses the MyISAM storage engine.
Proceed as follows:
1. Create a Galera Cluster cluster. The cluster can consist of one node only, if necessary. 2. Stop all load on the MyISAM master. 3. Initialize the Galera Cluster by performing a mysqldump with --skip-create-options. After this operation, the database will by default create InnoDB tables on the cluster. 4. Resume the load on one of the cluster nodes. 5. Upgrade the mysqld on the former master to Galera Cluster software. 6. Convert the tables to the InnoDB format on the former master node. 7. Copy the grastate.dat file from one of the cluster nodes to the former master. 8. Open the grastate.dat file and change the seqno from -1 to 0 there. 9. Join the former master to the Galera Cluster.
The downtime for the migration is the time it takes to perform step 3.
Compiling Galera Cluster for MySQL from Source
These instructions describe how to compile Galera Cluster from source.
Build dependencies for the Galera Replication Plugin are: * System headers * Bash * GNU toolchain, gcc/g++ >= 4.4 * Boost libraries >= 1.41 * Check ( http://check.sourceforge.net/) * Scons ( http://www.scons.org/)
To build, proceed as follows:
1. Download and extract the Galera Replication Plugin source package from https://launchpad.net/galera/+download:
$ wget https://launchpad.net/galera/2.x/23.2.6/+download/galera-23.2.6-src.tar.gz $ tar zxf galera-23.2.6-src.tar.gz $ cd galera-23.2.6-src/
Run scons to build the Galera Replication Plugin:
|3.||When the build process is completed, the Galera provider library libgalera_smm.so can be found in the build directory root.|
Build the MySQL server as follows:
|1.||Download MySQL source code from http://dev.mysql.com/downloads/mysql/.|
Extract the source package:
$ tar zxf mysql-5.5.31.tar.gz
Download and uncompress the wsrep patch that
corresponds to the MySQL version:
$ wget https://launchpad.net/codership-mysql/5.5/5.5.31-23.7.5/+download/mysql-5.5.31-23.7.5.patch.gz $ gunzip mysql-5.5.31-23.7.5.patch.gz
Apply the patch:
$ cd mysql-5.5.31 $ patch -p0 < ../mysql-5.5.31-23.7.5.patch
Build the MySQL server:
$ cmake -DWITH_WSREP=1 -DWITH_INNODB_DISALLOW_WRITES=1 $ make
This chapter contains some advanced configuration tips.
There is no rule about how many slave threads one should configure for replication. At the same time, parallel threads do not guarantee better performance. However, parallel applying will not impair regular operation performance and will most likely speed up the synchronization of new nodes with the cluster.
Start with four slave threads per CPU core, the logic being that, in a balanced system, four slave threads can usually saturate the core. However, depending on IO performance, this figure can be increased several times (for example, you can use 32 slave threads on a single-core ThinkPad R51 with a 4200 RPM drive).
The top limit on the total number of slave threads can be obtained from the wsrep_cert_deps_distance status variable. This value essentially determines how many writesets on average can be applied in parallel. Do not use a value higher than that.
To set four parallel slave threads, use the parameter value below:
NOTE: Parallel applying requires the following settings:
Transient network connectivity failures are not rare in WAN configurations. Thus, you may want to increase the keepalive timeouts to avoid partitioning. The following group of my.cnf settings tolerates 30 second connectivity outages:
wsrep_provider_options = "evs.keepalive_period = PT3S; evs.inactive_check_period = PT10S; evs.suspect_timeout = PT30S; evs.inactive_timeout = PT1M; evs.install_timeout = PT1M"
Set the evs.suspect_timeout parameter value as high as possible to avoid partitions (as partitions will cause state transfers, which are very heavy). The evs.inactive_timeout parameter value must be no less than the evs.suspect_timeout parameter value and the evs.install_timeout parameter value must be no less than the evs.inactive_timeout parameter value.
WAN links can have exceptionally high latencies. Take
Round-Trip Time (RTT) measurements (ping RTT is a fair estimate)
from between your cluster nodes and make sure
that all temporal Galera Cluster
settings (periods and timeouts, such
as evs.join_retrans_period) exceed the highest RTT in
The more masters (nodes which simultaneously process writes from clients) are in the cluster, the higher the probability of certification conflict. This may cause undesirable rollbacks and performance degradation. In such a case, reduce the number of nodes used as masters.
If only one node at a time is used as a master, certain requirements, such as the slave queue size, may be relaxed. Flow control can be relaxed by using the settings below:
wsrep_provider_options = "gcs.fc_limit = 256; gcs.fc_factor = 0.99; gcs.fc_master_slave = yes"
These settings may improve replication performance by
reducing the rate of flow control events. This setting
can also be used as suboptimal in a multi-master setup.
Customizing GCache Size
These configuration tips are guidelines only. You may end up using a bigger GCache than suggested by these guidelies, for example, if you must avoid SST as much as possible.
The GCache size, that is, the gcache.size parameter value, should be smaller than the database size. However, in this context, the database size depends on the SST method. For example, mysqldump does not copy InnoDB log files whereas rsync and xtrabackup do. As a rule, it is recommended to use the data directory size (including any possible links) minus the size of the galera.cache parameter.
You can also consider the speed of copying as one variable in the calculation. If you use Incremental State Transfer (IST) as your node provisioning method, you can probably copy the database five times faster through IST than through mysqldump. With xtrabackup, the factor is approximately 1.5. If this is the case, you can use a relatively big GCache size.
The database write rate indicates the tail length that will be stored in the GCache. You can calculate the write rate by using the wsrep_received_bytes status variable. Proceed as follows:
1. Read the wsrep_received_bytes1 value at time t1. 2. Read the wsrep_received_bytes2 value at time t2. 3. Calculate the write rate as follows:
(wsrep_received_bytes2 - wsrep_received_bytes1)/(t2 - t1)
If you want to use Galera Cluster with SElinux, start by running SELinux in the permissive mode. In this mode, SELinux will not prohibit any Galera Cluster actions, but will log a warning for all actions that would have been prohibited. Collect these warnings and iteratively create a policy for Galera Cluster that allows to use all the different ports and files that you need. When there are no more warnings, switch back to the enforcing mode.
Virtually every Linux distribution ships with a MySQL SELinux policy. You can use this policy as a starting point and extend it with the above procedure.
This chapter contains troubleshooting instructions for Galera Cluster for MySQL.
The table below lists some symptoms and solutions for troubleshooting purposes.
Symptom Solution If you use rsync for state transfer and a node crashes before the state transfer is over, the rsync process may hang forever, occupying the port and not allowing to restart the node. The problem will show up as port in use in the server error log. Find the orphan rsync process and kill it manually. If you use mysqldump for state transfer, and it fails, an SQL SYNTAX error is written in the server error log. This error is only an indication of the error. The pseudo-statement within the SQL SYNTAX error contains the actual error message. Read the pseudo-statement within the SQL SYNTAX resynchronizes with the primary component. After a temporary split, if the Primary Component was still reachable and its state was modified, resynchronization occurs. In resynchronization, nodes on the other part of the cluster drop all client connections. The connections get the Unknown command error. This situation will be cleared after the node automatically while. Every query returns "Unknown command". This phenomenon takes place if you have explicitly specified the wsrep_provider variable, but the wsrep provider rejects service, for example, because the node is not connected to the cluster Primary Component (the wsrep_cluster_address parameter may be unset, or there can be networking issues). In this case, the node is considered to be unsynced with the global state and unable to serve SQL requests except SET and/or SHOW. You can bypass the wsrep_provider check by switching the wsrep service off by using the command: mysql> SET wsrep_on=0; This command instructs mysqld to ignore the wsrep_provider setting and to behave as a standalone MySQL server. This may lead to data inconsistency with the rest of the cluster, which, on the other hand, may be a desirable result for, for example, modifying "local" tables. If you know that no other nodes of your cluster form Primary Component, rebootstrap the Primary Component as follows: Choose the most up-to-date node by checking the output of SHOW STATUS LIKE wsrep_last_committed. Choose the node with the highest value. Run SET GLOBAL wsrep_provider_options=pc.bootstrap=yes on it. The component this node is part of will become a Primary Component, and all nodes in it will synchronize to the most up-to-date one and start accepting SQL requests again. Users (name, host, password) changes are not replicated to the cluster. You have tried to update the mysql.user table directly. Use the GRANT command. Currently, replication only works with the InnoDB storage engine. Any writes to tables of other types, including system (mysql.*) tables, are not replicated. However, DDL statements are replicated on statement level, and changes to mysql.* tables will get replicated that way. You can safely issue commands such as CREATE USER... or or GRANT..., but issuing commands such as INSERT INTO mysql.user... will not be replicated. As a rule, non-transactional engines cannot be supported in multi-master replication. Cluster stalls when running the ALTER command on an unused table. This is a side effect of a multi-master and several appliers scenario. The system needs to control when the DDL ends in relation to other transactions in order to deterministically detect conflicts and schedule parallel appliers. Effectively, the DDL commands must be executed in isolation. Galera Cluster for MySQL has a 65K window tolerance where transactions can be applied in parallel, but if an ALTER command takes too long, the cluster has to wait. You cannot help this situation. However, if you can guarantee that no other session will try to modify the table AND that there are no other DDLs running, you can: Set wsrep_OSU_method=RSU Run the ALTER command Set wsrep_OSU_method=TOI Do this on each node in turn.
By design, the performance of the cluster cannot be higher than the performance of the slowest node on the cluster. Even if you have one node only, its performance can be considerably lower when compared with running the same server in a standalone mode (without a wsrep provider). This is particularly true for big transactions even if they were within the transaction size limits. This is why it is important to be able to detect a slow node on the cluster.
The slowest cluster node will have the highest values for the following variables:
The lower the values are the better.
Dealing with Multi-Master Conflicts
This chapter describes how Galera Cluster deals with conflicts in multi-master database environments. In practice, the conflict scenario that must be addressed is row conflicts on different nodes. In a multi-master replication system, where updates can be submitted to any database node, different nodes may try to update the same database row with different data.
Galera Cluster can cope with a situation such as this by using Certification Based Replication.
SEE ALSO: Chapter Certification Based Replication.
You can log cluster wide conflicts by using the wsrep_debug=1 variable value, which will log these conflicts and plenty of other information. You may see any of the following messages:
110906 17:45:01 [Note] WSREP: BF kill (1, seqno: 16962377), victim: (140588996478720 4) trx: 35525064 110906 17:45:01 [Note] WSREP: Aborting query: commit 110906 17:45:01 [Note] WSREP: kill trx QUERY_COMMITTING for 35525064 110906 17:45:01 [Note] WSREP: commit failed for reason: 3, seqno: -1
You can also monitor the parameters below:
When a conflicting transaction is rolled back, the client application sees a deadlock error. The client application should try to re-commit the deadlocked transactions, but not all client applications have this logic inbuilt.
Nevertheless, Galera Cluster can re-try to autocommit deadlocked transactions on behalf of the client application. The wsrep_retry_autocommit parameter defines how many times the transaction is retried before returning a deadlock error.
NOTE: Retrying only applies to autocommit transactions, as retrying is not safe for multi-statement transactions.
Galera Cluster automatically resolves multi-master conflicts. However, you can try to minimize the amount of multi-master conflicts as follows:
o Analyze the hot-spot. See if you can change the application logic to catch deadlock exceptions and use retrying logic. o Use wsrep_retry_autocommit and see if it helps. o Limit the number of master nodes or switch to a master-slave model.
NOTE: If you can filter out the access to the hot spot table, it is enough to treat writes only to the hot spot table as master-slave.
This chapter contains reference information on Galera Cluster for MySQL.
As of version 0.8, Galera Cluster accepts parameters as semicolon-separated key value pair lists, such as key1 = value1; key2 = value2. In this way, you can configure an arbitrary number of Galera Cluster parameters in one call. A key consists of parameter group and parameter name:
Where <group> roughly corresponds to some Galera module.
|o||Numeric values—Galera Cluster understands the following numeric modifiers: K, M, G, T standing for 210, 220, 230 and 240 respectively.|
|o||Boolean values—Galera Cluster accepts the following boolean values: 0, 1, yes, no, true, false, on, off.|
|o||Time periods must be expressed in the ISO8601 format. See also the examples below.|
Parameter Default Introduced Deprecated Dynamic protonet.backend asio 1.0 n/a No protonet.version 0 1.0 n/a No socket.ssl_cert 1.0 n/a No socket.ssl_key 1.0 n/a No socket.ssl_compression yes 1.0 n/a No socket.ssl_cipher AES128-SHA 1.0 n/a No gmcast.listen_addr tcp://0.0.0.0:4567 1.0 n/a No gmcast.mcast_addr 1.0 n/a No gmcast.mcast_ttl 1 1.0 n/a No gmcast.peer_timeout PT3S 1.0 n/a No gmcast.time_wait PT5S 1.0 n/a No gmcast.version 0 1.0 n/a No evs.causal_keepalive_period 1.0 n/a No evs.consensus_timeout PT30S 1.0 Yes, as of 2.0 No evs.debug_log_mask 0x1 1.0 n/a Yes evs.inactive_check_period PT1S 1.0 n/a No evs.inactive_timeout PT15S 1.0 n/a No evs.info_log_mask 0 1.0 n/a No evs.install_timeout PT15S 1.0 n/a Yes evs.join_retrans_period PT1S 1.0 n/a Yes evs.keepalive_period PT1S 1.0 n/a No evs.max_install_timeouts 1 1.0 n/a No evs.send_window 4 1.0 n/a Yes evs.stats_report_period PT1M 1.0 n/a No evs.suspect_timeout PT5S 1.0 n/a No evs.use_aggregate true 1.0 n/a No evs.user_send_window 2 1.0 n/a Yes evs.view_forget_timeout PT5M 1.0 n/a No evs.version 0 1.0 n/a No pc.bootstrap 1.0 n/a No pc.checksum true 1.0 n/a No pc.ignore_sb false 1.0 n/a Yes pc.ignore_quorum false 1.0 n/a Yes pc.linger PT2S 1.0 n/a No pc.npvo false 1.0 n/a No pc.wait_prim false 1.0 n/a No pc.weight 1 2.4 n/a Yes pc.version 0 1.0 n/a No gcs.fc_debug 0 1.0 n/a No gcs.fc_factor 0.5 1.0 n/a Yes gcs.fc_limit 16 1.0 n/a Yes gcs.fc_master_slave NO 1.0 n/a No gcs.max_packet_size 32616 1.0 n/a No gcs.max_throttle 0.25 1.0 n/a No gcs.recv_q_hard_limit LLONG_MAX 1.0 n/a No gcs.recv_q_soft_limit 0.25 1.0 n/a No gcs.sync_donor NO 1.0 n/a No ist.recv_addr 1.0 n/a No replicator.commit_order 3 1.0 n/a No replicator.causal_read_timeout PT30S 1.0 n/a No gcache.dir 1.0 n/a No gcache.name "galera.cache" 1.0 n/a No gcache.size 128Mb 1.0 n/a No gcache.page_size 128Mb 1.0 n/a No gcache.keep_pages_size 0 1.0 n/a No gcache.mem_size 0 1.0 n/a No
Which transport backend to use. Currently only ASIO is supported. protonet.version
This status variable is used to check which transport backend protocol version is used. This variable is mostly used for troubleshooting purposes. socket.ssl_cert
A path (absolute or relative to the working directory )to an SSL certificate (in PEM format). socket.ssl_key
A path (absolute or relative to the working directory to a private key for a certificate (in PEM format). socket.ssl_compression
Whether to enable compression on SSL connections. socket.ssl_cipher
Symmetric cipher to use. AES128 is used by default it is considerably faster and no less secure than AES256. gmcast.listen_addr
Address at which Galera Cluster listens to connections from other nodes. By default the port to listen at is taken from the connection address. This setting can be used to overwrite that. gmcast.mcast_addr
If set, UDP multicast will be used for replication, for example:
The value must be the same on all nodes.
If you are planning to build a large cluster, we recommend using UDP. gmcast.mcast_ttl
Time to live value for multicast packets. gmcast.peer_timeout
Connection timeout to initiate message relaying. gmcast.time_wait
Time to wait until allowing peer declared outside of stable view to reconnect. gmcast.version
This status variable is used to check which gmcast protocol version is used. This variable is mostly used for troubleshooting purposes. evs.causal_keepalive_period
For developer use only. Defaults to evs.keepalive_period. evs.consensus_timeout
Timeout on reaching the consensus about cluster membership.
Deprecated See evs.install_timeout. evs.debug_log_mask
Control EVS debug logging, only effective when wsrep_debug is in use. evs.inactive_check_period
How often to check for peer inactivity. evs.inactive_timeout
Hard limit on the inactivity period, after which the node is pronounced dead. evs.info_log_mask
Control extra EVS info logging. Bits:
|o||0x1—extra view change info|
|o||0x2—extra state change info|
|o||0x8—profiling (only in builds with profiling enabled)|
Timeout on waiting for install message acknowledgments. Successor to evs.consensus_timeout. evs.join_retrans_period
How often to retransmit EVS join messages when forming the cluster membership. evs.keepalive_period
How often to emit keepalive beacons (in the absence of any other traffic). evs.max_install_timeouts
How many membership install rounds to try before giving up (total rounds will be evs.max_install_timeouts + 2). evs.send_window
Maximum packets in replication at a time. For WAN setups may be set considerably higher, e.g. 512. Must be no less than evs.user_send_window. If you must use other that the default value, we recommend using double the evs.user_send_window value. evs.stats_report_period
Control period of EVS statistics reporting. the node is pronounced dead. evs.suspect_timeout
Inactivity period after which the node is suspected to be dead. If all remaining nodes agree on that, the node is dropped out of cluster before evs.inactive_timeout is reached. evs.use_aggregate
Aggregate small packets into one, when possible. evs.user_send_window
Maximum data packets in replication at a time. For WAN setups, this value can be set considerably higher, to, for example, 512. evs.view_forget_timeout
Drop past views from the view history after this timeout. evs.version
This status variable is used to check which evs protocol version is used. This variable is mostly used for troubleshooting purposes. pc.bootstrap
If you set this value to true is a signal to turn a NON-PRIMARY component into PRIMARY. pc.checksum
Checksum replicated messages. pc.ignore_sb
Should we allow nodes to process updates even in the case of split brain? This is a dangerous setting in multi-master setup, but should simplify things in master-slave cluster (especially if only 2 nodes are used). pc.ignore_quorum
Completely ignore quorum calculations. For example if the master splits from several slaves it still remains operational. Use with extreme caution even in master-slave setups, as slaves will not automatically reconnect to master in this case. pc.linger
The period for which the PC protocol waits for the EVS termination. pc.npvo
If set to true, the more recent primary component overrides older ones in the case of conflicting primaries. pc.wait_prim
If set to true, the node waits for the primary component forever. Useful to bring up a non-primary component and make it primary with pc.bootstrap. pc.weight
As of version 2.4. Node weight for quorum calculation. pc.version
This status variable is used to check which pc protocol version is used. This variable is mostly used for troubleshooting purposes. gcs.fc_debug
Post debug statistics about SST flow every this number of writesets. gcs.fc_factor
Resume replication after recv queue drops below this fraction of gcs.fc_limit. gcs.fc_limit
Pause replication if recv queue exceeds this number of writesets. For master-slave setups this number can be increased considerably. gcs.fc_master_slave
Should we assume that there is only one master in the group? gcs.max_packet_size
All writesets exceeding that size will be fragmented. gcs.max_throttle
How much to throttle replication rate during state transfer (to avoid running out of memory). Set the value to 0.0 if stopping replication is acceptable for completing state transfer. gcs.recv_q_hard_limit
Maximum allowed size of recv queue. This should normally be half of (RAM + swap). If this limit is exceeded, Galera Cluster will abort the server. gcs.recv_q_soft_limit
The fraction of gcs.recv_q_hard_limit after which replication rate will be throttled.
The degree of throttling is a linear function of recv queue size and goes from 1.0 (full rate) at gcs.recv_q_soft_limit to gcs.max_throttle at gcs.recv_q_hard_limit Note that full rate, as estimated between 0 and gcs.recv_q_soft_limit is a very imprecise estimate of a regular replication rate. gcs.sync_donor
Should the rest of the cluster keep in sync with the donor? Yes means that if the donor is blocked by state transfer, the whole cluster is blocked with it.
If you choose to use value YES, it is theoretically possible that the donor node cannot keep up with the rest of the cluster due to the extra load from the SST. If the node lags behind, it may send flow control messages stalling the whole cluster. However, you can monitor this using the wsrep_flow_control_paused status variable. ist.recv_addr
As of 2.0. Address to listen for Incremental State Transfer. By default this is the <address>:<port+1> from wsrep_node_address. replicator.commit_order
Whether to allow Out-Of-Order committing (improves parallel applying performance). Possible settings:
|o||0—BYPASS: all commit order monitoring is switched off (useful for measuring performance penalty)|
|o||1—OOOC: allow out of order committing for all transactions|
|o||2—LOCAL_OOOC: allow out of order committing only for local transactions|
|o||3—NO_OOOC: no out of order committing is allowed (strict total order committing)|
Sometimes causal reads need to timeout. gcache.dir
Directory where GCache should place its files. Defaults to the working directory. gcache.name
Name of the ring buffer storage file. gcache.size
Size of the persistent on-disk ring buffer storage. This will be preallocated on startup.
The buffer file name is galera.cache.
SEE ALSO: Chapter Customizing GCache Size. gcache.page_size
Size of the page files in page storage. The limit on overall page storage is the size of the disk. Pages are prefixed by gcache.page. gcache.keep_pages_size
Total size of the page storage pages to keep for caching purposes. If only page storage is enabled, one page is always present. gcache.mem_size
Max size of the malloc() store (read: RAM). For setups with spare RAM.
Setting Galera Parameters in MySQL
You can set Galera Cluster parameters in the my.cnf configuration file as follows:
This is useful in master-slave setups.
You can set Galera Cluster parameters through a MySQL client as follows:
mysql> SET GLOBAL wsrep_provider_options="evs.send_window=16";
This command would only change the evs.send_window value.
To check which parameters are used in Galera Cluster, enter the command below:
mysql> SHOW VARIABLES LIKE wsrep_provider_options;
These are MySQL system variables introduced by wsrep patch v0.8. All variables are global except where marked by (L).wsrep_auto_increment_control
Option Default Introduced Deprecated Dynamic wsrep_auto_increment_control ON 1 n/a wsrep_causal_reads (L) OFF 1 n/a wsrep_certify_nonPK ON 1 n/a wsrep_cluster_address 1 n/a wsrep_cluster_name my_test_cluster 1 n/a wsrep_convert_LOCK_to_trx OFF 1 n/a wsrep_data_home_dir <mysql_real_data_home> 1 n/a wsrep_dbug_option 1 n/a wsrep_debug OFF 1 n/a wsrep_drupal_282555_workaround ON 1 n/a wsrep_forced_binlog_format NONE 1 n/a wsrep_max_ws_rows 128K 1 n/a wsrep_max_ws_size 1G 1 n/a wsrep_node_address <address>[:port] 1 n/a wsrep_node_incoming_address <address>[:mysqld_port] 1 n/a wsrep_node_name <hostname> 1 n/a wsrep_notify_cmd 1 n/a wsrep_on (L) ON 1 n/a wsrep_OSU_method TOI Patch version 3 (5.5.17-22.3) n/a wsrep_provider none 1 n/a wsrep_provider_options 1 n/a wsrep_retry_autocommit 1 1 n/a wsrep_slave_threads 1 1 n/a wsrep_sst_auth 1 n/a wsrep_sst_donor 1 n/a wsrep_sst_donor_rejects_queries OFF 1 n/a wsrep_sst_method mysqldump 1 n/a wsrep_sst_receive_address <wsrep_node_address> 1 n/a wsrep_start_position 00000000-0000-0000- 0000-000000000000:-1 1 n/a
Automatically adjust auto_increment_increment and auto_increment_offset system variables when the cluster membership changes.
This parameters significantly reduces the certification conflict rate for``INSERT`` clauses. wsrep_causal_reads
Enforce strict cluster-wide READ COMMITTED semantics on non-transactional reads. Results in larger read latencies. wsrep_certify_nonPK
Generate primary keys for rows without them for the purpose of certification. This is required for parallel applying. Do not use tables without primary keys. wsrep_cluster_address
Galera Cluster takes addresses in the URL format:
<backend schema>://<cluster address>[?option1=value1[&option2=value2]]
Changing this variable in runtime will cause the node to close connection to the current cluster (if any), and reconnect to the new address. (However, doing this at runtime may not be possible for all SST methods.) As of Galera Cluster 23.2.2, it is possible to provide a comma separated list of other nodes in the cluster as follows:
Using the string gcomm:// without any address will cause the node to startup alone, thus initializing a new cluster (that the other nodes can join to). wsrep_cluster_name
The logical cluster name. If a node tries to connect to a cluster with a different name, the connection fails. The cluster name must be same on all the cluster nodes. wsrep_convert_LOCK_to_trx
Convert LOCK/UNLOCK TABLES statements to BEGIN/COMMIT statements. In other words, this parameter implicitly converts locking sessions into transactions within mysqld. By itself, it does not mean support for locking sessions, but it prevents the database from ending up in a logically inconsistent state.
Sometimes this parameter may help to get old applications working in a multi-master setup.
NOTE: Loading a large database dump with LOCK statements can result in abnormally large transactions and cause an out-of-memory condition. wsrep_data_home_dir
A directory where the wsrep provider will store its files. Galera Cluster uses this parameter to store its internal state. wsrep_dbug_option
A debug option to be passed to the provider. wsrep_debug
Enable debug log output. wsrep_drupal_282555_workaround
Enable a workaround for Drupal (actually MySQL/InnoDB) bug #282555 (Inserting a DEFAULT value into an AUTO_INCREMENT column may return a duplicate key error).
Force every transaction to use the given binlog format. When this variable is set to something else than NONE, all transactions will use the given forced format, regardless of the client session specified in binlog_format.
Valid choices for wsrep_forced_binlog_format are: ROW, STATEMENT, MIXED and the special value NONE, meaning that there is no forced binlog format in effect.
This variable was introduced to support STATEMENT format replication during rolling schema upgrade processing. However, in most cases, ROW replication is valid for asymmetric schema replication. wsrep_max_ws_rows
The maximum number of rows allowed in the writeset. Currently, this parameter limits the supported size of transactions and LOAD DATA statements. wsrep_max_ws_size
The maximum allowed writeset size. Currently, this parameter limits the supported size of transactions and LOAD DATA statements.
The maximum allowed writeset size is 2G. wsrep_node_address
An option to explicitly specify the network address of the node, if autoguessing for some reason does not produce desirable results (multiple network interfaces, NAT, etc.)
By default, the address of the first network interface (eth0) and the default port 4567 are used. The <address> and :port will be passed to the Galera replication Plugin to be used as a base address in its communications. It will also be used to derive the default values for parameters wsrep_sst_receive_address and ist.recv_address. wwsrep_node_incoming_address
The address at which the server expects client connections. Intended for integration with load balancers. Not used for now. wsrep_node_name
The logical node name - for convenience. wsrep_notify_cmd
This command is run whenever the cluster membership or state of this node changes. This option can be used to (re)configure load balancers, raise alarms, and so on. The command passes on one or more of the following options:
|The status of this node. The possible statuses are:|
|o||Undefined—The node has just started up and is not connected to any Primary Component|
|o||Joiner—The node is connected to a primary component and now is receiving state snapshot.|
|o||Donor—The node is connected to primary component and now is sending state snapshot.|
|o||Joined—The node has a complete state and now is catching up with the cluster.|
|o||Synced—The node has synchronized itself with the cluster.|
|o||Error(<error code if available>)—The node is in an error state.|
|The cluster state UUID.|
|Whether the current cluster component is primary or not.|
|A comma-separated list of the component member UUIDs. The members are presented in the following syntax:|
|o||<node UUID>—A unique node ID. The wsrep provider automatically assigns this ID for each node.|
|o||<node name>—The node name as it is set in the wsrep_node_name option.|
|o||<incoming address>—The address for client connections as it is set in the wsrep_node_incoming_address option.|
|--index||The index of this node in the node list.|
Click this link link to view an example script that updates two tables on the local node with changes taking place at the cluster. wsrep_on
Use wsrep replication. When switched off, no changes made in this session will be replicated. wsrep_OSU_method
Online schema upgrade method (MySQL >= 5.5.17). See also Schema Upgrades.
Online Schema Upgrade (OSU) can be performed with two alternative methods:
|o||Total Order Isolation (TOI) runs the DDL statement in all cluster nodes in the same total order sequence, locking the affected table for the duration of the operation. This may result in the whole cluster being blocked for the duration of the operation.|
|o||Rolling Schema Upgrade (RSU) executes the DDL statement only locally, thus blocking one cluster node only. During the DDL processing, the node is not replicating and may be unable to process replication events (due to a table lock). Once the DDL operation is complete, the node will catch up and sync with the cluster to become fully operational again. The DDL statement or its effects are not replicated; the user is responsible for manually performing this operation on each of the nodes.|
A path to wsrep provider to load. If not specified, all calls to wsrep provider will be bypassed and the server behaves like a regular mysqld server. wsrep_provider_options
A string of provider options passed directly to the provider.
Usually, you just fine-tune:
|o||gcache.size, that is, the size of the GCache ring buffer, which is used for Incremental State Transfer, among other things. See chapter Galera Parameters.|
Group communication timeouts. See chapter
See also a list of all Galera Cluster parameters in chapter Galera Parameters.
If an autocommit query fails the certification test due to a cluster-wide conflict, we can retry it without returning an error to the client. This option sets how many times to retry.
This option is analogous to rescheduling an autocommit query should it go into deadlock with other transactions in the database lock manager. wsrep_slave_threads
How many threads to use for applying slave writesets. There are two things to consider when choosing the number:
|1.||The number should be at least two times the number of CPU cores.|
|2.||Consider how many writing client connections the other nodes would have. Divide this by four and use that as the wsrep_slave_threads value.|
A string with authentication information for state snapshot transfer. The string depends on the state transfer method. For the mysqldump state transfer, it is <username>:<password>, where username has root privileges on this server. The rsync method ignores this option.
Use the same value on all nodes. This parameter is used to authenticate with both the state snapshot receiver and the state snapshot donor. wsrep_sst_donor
A name (given in the wsrep_node_name parameter) of the server that should be used as a source for state transfer. If not specified, Galera Cluster will choose the most appropriate one.
In this case, the group communication module monitors the node state for the purpose of flow control, state transfer and quorum calculations. The node can be a if it is in the SYNCED state. The first node in the SYNCED state in the index becomes the donor and is not available for requests.
If there are no free SYNCED nodes at the moment, the joining node reports:
Requesting state transfer failed: -11(Resource temporarily unavailable). Will keep retrying every 1 second(s)
and keeps on retrying the state transfer request until it succeeds. When the state transfer request succeeds, the entry below is written to log:
Node 0 (XXX) requested state transfer from *any*. Selected 1 (XXX) as donor. wsrep_sst_donor_rejects_queries
This parameter prevents blocking client sessions on a donor if the donor is performing a blocking SST, such as mysqldump or rsync.
In these situations, all queries return error ER_UNKNOWN_COM_ERROR, "Unknown command" like a joining node does. In this case, the client (or the JDBC driver) can reconnect to another node.
NOTE: As SST is scriptable, there is no way to tell whether the requested SST method is blocking or not. You may also want to avoid querying the donor even with non-blocking SST. Consequently, this variable will reject queries on the donor regardless of the SST (that is, also for xtrabackup) even if the initial request concerned a blocking-only SST.
NOTE: The mysqldump SST does not work with this setting, as mysqldump must run queries on the donor and there is no way to distinguish a mysqldump session from a regular client session. wsrep_sst_method
The method to use for state snapshot transfers. The wsrep_sst_<wsrep_sst_method> command will be called with the following arguments. For more information, see also Scriptable State Snapshot Transfer.
The supported methods are:
|o||mysqldump—This is a slow (except for small datasets), but the most tested option.|
|o||rsync—This option is much faster than mysqldump on large datasets.|
rsync_wan—This option is almost the same as rsync,
but uses the delta-xfer algorithm to minimize
NOTE: You can only use rsync when a node is starting. In other words, you cannot use rsync under a running InnoDB storage engine.
xtrabackup—This option is a fast and practically
non-blocking SST method based on Perconas xtrabackup tool.
If you want to use xtrabackup, the following settings must be present in the my.cnf configuration file on all nodes:
[mysqld] wsrep_sst_auth=root:<root password> datadir=<path to data dir> [client] socket=<path to socket>
The address at which this node expects to receive state transfers. Depends on the state transfer method. For example, for the mysqldump state transfer, it is the address and the port on which this server listens. By default this is set to the <address> part of wsrep_node_address.
NOTE: Check that your firewall allows connections to this address from other cluster nodes. wsrep_start_position
This variable exists for the sole purpose of notifying a joining node about state transfer completion. For more information, see Scriptable State Snapshot Transfer. wsrep_ws_persistency
Whether to store writesets locally for debugging. Not used in 0.8.
Galera Status Variables
These variables are Galera Cluster 0.8.x status variables. There are two types of wsrep-related status variables:
o Galera Cluster-specific variables exported by Galera Cluster o Variables exported by MySQL. These variables are for the general wsrep provider.
This distinction is of importance for developers only. For convenience, all status variables are presented as a single list below.
The location (L) of the variable is presented in the second column from the left. The values are:
Status Variable L Example Value Introduced Deprecated wsrep_local_state_uuid G e2c9a15e-5485-11e0-0800- 6bbb637e7211 wsrep_last_committed G 409745 wsrep_replicated G 16109 wsrep_replicated_bytes G 6526788 wsrep_received G 17831 wsrep_received_bytes G 6637093 wsrep_local_commits G 14981 wsrep_local_cert_failures G 333 wsrep_local_bf_aborts G 960 wsrep_local_replays G 0 wsrep_local_send_queue G 1 wsrep_local_send_queue_avg G 0.145000 wsrep_local_recv_queue G 0 wsrep_local_recv_queue_avg G 3.348452 wsrep_flow_control_paused G 0.184353 wsrep_flow_control_sent G 7 wsrep_flow_control_recv G 11 wsrep_cert_deps_distance G 23.88889 wsrep_apply_oooe G 0.671120 wsrep_apply_oool G 0.195248 wsrep_apply_window G 5.163966 wsrep_commit_oooe G 0.000000 wsrep_commit_oool G 0.000000 wsrep_commit_window G 0.000000 wsrep_local_state G 4 wsrep_local_state_comment G Synced wsrep_incoming_addresses G 10.0.0.1:3306,10.0.0.2: 3306,undefined wsrep_cluster_conf_id M 34 wsrep_cluster_size M 3 wsrep_cluster_state_uuid M e2c9a15e-5485-11e0- 0800-6bbb637e7211 wsrep_cluster_status M Primary wsrep_local_index M 1 wsrep_ready M ON wsrep_cert_index_size G 30936 wsrep_protocol_version G 4 wsrep_connected G ON
The UUID of the state stored on this node. See wsrep API. wsrep_last_committed
Sequence number of the last committed transaction. See wsrep API. wsrep_replicated
Total number of writesets replicated (sent to other nodes). wsrep_replicated_bytes
Total size of writesets replicated. wsrep_received
Total number of writesets received from other nodes. wsrep_received_bytes
Total size of writesets received from other nodes. wsrep_local_commits
Total number of local transactions committed. wsrep_local_cert_failures
Total number of local transactions that failed certification test. wsrep_local_bf_aborts
Total number of local transactions that were aborted by slave transactions while in execution. wsrep_local_replays
Total number of transaction replays due to asymmetric lock granularity. wsrep_local_send_queue
Current (instantaneous) length of the send queue. wsrep_local_send_queue_avg
Send queue length averaged over interval since the last status query. Values considerably larger than 0.0 indicate replication throttling or network throughput issue. wsrep_local_recv_queue
Current (instantaneous) length of the recv queue. wsrep_local_recv_queue_avg
Recv queue length averaged over interval since the last status query. Values considerably larger than 0.0 mean that the node cannot apply writesets as fast as they are received and will generate a lot of replication throttling. wsrep_flow_control_paused
The fraction of time since the last status query that replication was paused due to flow control.
In other words, how much the slave lag is slowing down the cluster. wsrep_flow_control_sent
Number of FC_PAUSE events sent since the last status query. wsrep_flow_control_recv
Number of FC_PAUSE events received since the last status query (counts the events sent). wsrep_cert_deps_distance
Average distance between highest and lowest seqno that can be possibly applied in parallel (potential degree of parallelization). wsrep_apply_oooe
How often applier started writeset applying out-of-order (parallelization efficiency). wsrep_apply_oool
How often writeset was so slow to apply that writeset with higher seqnos were applied earlier. Values closer to 0 refer to a greater gap between slow and fast writesets. wsrep_apply_window
Average distance between highest and lowest concurrently applied seqno. wsrep_commit_oooe
How often a transaction was committed out of order. wsrep_commit_oool
No meaning. wsrep_commit_window
Average distance between highest and lowest concurrently committed seqno. wsrep_local_state
Internal Galera Cluster FSM state number. See Node State Changes. wsrep_local_state_comment
Human-readable explanation of the state. wsrep_incoming_addresses
Comma-separated list of incoming server addresses in the cluster component. wsrep_cluster_conf_id
Total number of cluster membership changes happened. wsrep_cluster_size
Current number of members in the cluster. wsrep_cluster_state_uuid
See wsrep API. wsrep_cluster_status
Status of this cluster component: PRIMARY or NON_PRIMARY. wsrep_local_index
This node index in the cluster (base 0). wsrep_ready
Whether the server is ready to accept queries. If this status is OFF, almost all of the queries fill fail with:
ERROR 1047 (08S01) Unknown Command
unless the wsrep_on session variable is set to 0. wsrep_cert_index_size
The number of entries in the certification index. wsrep_protocol_version
The version of the wsrep protocol used. wsrep_connected
If the value is OFF, the node has not yet connected to any
of the cluster components. This may be due to
misconfiguration. Check the error log for proper diagnostics.
INDICES AND TABLES
o genindex o search
2013, Codership Oy
|1.0||GALERA (1)||April 03, 2016|