High Availability (HA) for OpenStack Platform Services MySQL + rabbitMQ

Oleg Gelbukh’s last blog article gave us a bird’s eye view of different aspects of high availability in OpenStack. Even though all OpenStack components are designed with high availability in mind, the software still relies on external software, such as the database and the messaging system. It is left up to the user to deploy them in a fail-proof manner.

It is very important to bear in mind that everything stateful in OpenStack is done via the message system and the database, and all other components are stateless (excluding Glance). The database and message system in OpenStack are its heart and veins. While the queue system lets numerous components communicate, the database holds the cluster state. Both of them take part in every user request, whether it’s displaying a list of instances or spawning a new vm.

The default for messaging is RabbitMQ and the default database is MySQL. They are known workhorses in the industry and in our experience they generally suffice even for large deployments in terms of scalability. In theory, any database supporting SQLAlchemy will do, but most users stick with the default. For messaging, there is no real alternative to RabbitMQ, although there are people working on a ZeroMQ driver for OpenStack.

How the messages and database work in OpenStack

Let’s first consider how the database and message system components act together in OpenStack. To do this, I will describe the flow of the most popular user request: provisioning an instance.

The user submits his request to OpenStack by interacting with the nova-api component. Nova-api processes the instance creation request by invoking the create_instance function from nova-compute API. The function does the following:

  • It verifies user input: (e.g., checks if the requested vm image, flavor, networks exist). If they are not specified, it tries to get the defaults (e.g., default flavor, network).
  • It checks the request against user quotas.
  • After positive verification of the above, it creates an entry for the instance in OpenStack db (create_db_entry_for_new_instance function).
  • It calls the _schedule_run_instance function, which passes the user request to the nova-scheduler component via the message queue, using AMQP protocol. The body of this request contains the instance parameters:
    request_spec = {
        'image': jsonutils.to_primitive(image),
        'instance_properties': base_options,
        'instance_type': instance_type,
        'num_instances': num_instances,
        'block_device_mapping': block_device_mapping,
        'security_group': security_group,
    }

    The _schedule_run_instance ends by actually sending the message to AMQP with invocation of the scheduler_rpcapi.run_instance function.

Now the scheduler takes over. It receives the message with host specs, and based on them and its scheduling policies, it tries to find an appropriate host to spawn the instance. This is an excerpt from the log files of the nova-scheduler when this is being done (in this example FilterScheduler is used):

Host filter passes for ubuntu from (pid=15493) passes_filters /opt/stack/nova/nova/scheduler/host_manager.py:163
Filtered [host 'ubuntu': free_ram_mb:1501 free_disk_mb:5120] from (pid=15493) _schedule /opt/stack/nova/nova/scheduler/filter_scheduler.py:199

It chooses the host with the least cost by applying a weighting function (there is only one host, so weighting does not change anything here):

Weighted WeightedHost host: ubuntu from (pid=15493) _schedule /opt/stack/nova/nova/scheduler/filter_scheduler.py:209

After the proper compute host has been determined to run the instance, the scheduler invokes the cast_to_compute_host function, which:

  • updates the “host” entry for the instance in the nova database (host = the compute host on which the instance will be spawned) and
  • sends a message via AMQP to nova-compute on this specific host to run the instance. The message includes the UUID of the instance to run and next action to take, which is: run_instance.

In response, nova-compute on the chosen host calls the method _run_instance, which gets the instance parameters from the db (based on the UUID it was passed) and launches an instance with the appropriate parameters. During the course of the instance setup, nova-compute also communicates via AMQP with nova-network in order to set up all the networking (including IP address assignment and DHCP server setup). At various stages of the spawning process, the state of the VM is saved to the nova db, using the function _instance_update.

We can see that when communication between different OpenStack components is involved, AMQP is used. Also, the database is updated several times to reflect the VM provisioning state. So if we lose any of these components, we would severely disrupt basic functions of the OpenStack cluster:

  • A RabbitMQ loss will render us unable to perform any user tasks. Also some resources (like VMs being spawned) will be left in disrupted state.
  • A database loss will cause even more disastrous effects: While all instances will be running, we will not be able to tell who they belong to, what host they have landed on, or what IP addresses they have. Taking into account that you might be running a cloud-scale number of instances (perhaps several thousand), this situation would be unrecoverable.

HA solutions for the database

You can prevent a database crash by careful backup and data replication. In the case of MySQL, many well-documented solutions exist, including MySQL Cluster (an “official” MySQL clustering suite), MMM (multimaster replication manager), and XtraDB from Percona.

MySQL Cluster

MySQL cluster relies on a special storage engine called NDB (Network DataBase). The engine is a cluster of servers called “data nodes,” governed by the “management node.” Data is partitioned and replicated between the data nodes and at least two replicas exist for a given piece of data. All replicas are guaranteed to reside on different data nodes. On top of the data nodes, a farm of MySQL servers runs configured with NDB storage on the backend. Each of the mysqld processes holds read/write capabilities and they can be load balanced for efficiency and high availability.

MySQL Cluster architecture

MySQL Cluster guarantees synchronous replication, which is an obvious flaw of the traditional replication mechanism. It has some limitations compared to other storage engines (this link presents a good overview of them).

XtraDB Cluster

This is a solution from the highly acclaimed company Percona. XtraDB Cluster consists of a set of nodes, each of which runs an instance of Percona XtraDB with a set of patches to support replication. The patches introduce a set of hooks into the InnoDB storage engine and allow it to build a replication system underneath, which follows the WSREP specification.

XtraDB Cluster architecture

Each cluster node runs a patched version of mysqld from Percona. Also, each of them holds a full copy of the data and is available for read/write operations. Replication is synchronous. Just as with MySQL Cluster, XtraDB Cluster also suffers from some limitations, which are described here.

MMM

MultiMaster replication Manager employs a traditional master-slave mechanism for replication. It is based on a set of MySQL servers with at least two master replicas and a set of slaves plus a dedicated monitoring node. On top of this set of hosts, an IP address pool is configured that can be dynamically moved by MMM from host to host, depending on their availability. We have two types of these addresses:

  • A “writer”: Clients can write to the database by connecting to this IP (there can be only one writer address throughout the whole cluster).
  • A “reader”: Clients can read from the database by connecting to this IP (there can be a number of them to scale reads).

The monitor node checks the MySQL servers’ availability and triggers transfer of the “reader” and “writer” IPs if there is a server failure. The check set is rather simple; it includes a test network availability, checks for presence of mysqld on the host,  presence of the replication thread, and the size of the replication backlog. Load balancing reads between “reader” IPs is left up to the user (it can be done with HAProxy or round robin DNS, etc.).

MMM architecture

MMM relies on traditional replication, which is asynchronous. This means there is always a chance  the replicas might not have caught up with the master when it failed. Currently replication is single-threaded, which in the multicore, high-volume transaction world is often not enough, especially when one deals with long-running write queries. These concerns are addressed in the upcoming MySQL version, which implements a number of features for binlog optimizations and HA.

As for OpenStack-specific tutorials regarding MySQL HA, Alessandro Tagliapietra presents an interesting approach (the post is OpenStack-specific) to ensuring MySQL availability using master-slave replication, plus Pacemaker with Percona’s Pacemaker resource agent.

HA solutions for the message queue

Due to its nature, RabbitMQ data is very volatile. Since messaging is all about speed and amounts of data, all the messages are kept in RAM unless you define queues as “durable,” which will make RabbitMQ write them to disk. This is actually supported by OpenStack with the setting rabbit_durable_queues=True in nova.conf. Even though the messages are written to disk and thus will survive a crash or restart of the RabbitMQ server, they should not be considered a real HA solution, since:

  • RabbitMQ does not perform fsync to disk upon receipt of each message, so when a server crashes, there can still be messages residing in filesystem buffers and not actually on the disk. After a reboot they will be lost.
  • RabbitMQ still resides on one node only.

RabbitMQ can be clustered and a clustered RabbitMQ is called a “broker.” Clustering itself is more about scalability than high availability. Still it suffers from a severe flaw—all virtual hosts, exchanges, users, etc., are replicated except the message queues themselves. To address this, a mirrored queues feature has been implemented. Brokering plus queue mirroring should be combined to achieve full fault tolerance of RabbitMQ.

There is also a solution based on Pacemaker but it is considered obsolete in favor of the above.

It’s still worth pointing out that none of the above clustering modes are supported directly by OpenStack; however, Mirantis has done some serious work in the field (more about this later on).

Mirantis deployment experience

At Mirantis we have deployed highly available MySQL using MMM (MultiMaster replication Manager) for a number of our clientscustomers. Although some developers have expressed concerns about MMM,there are complaints about this on the web, we saw have seen no major issues with MMM in our experience, and treat it as a ‘just enough'”good enough” an acceptable solution. However, we do know that there are people who have issues with it and we are taking a look now at architectures based on WSREP’s syncrhonous replication approach, as it by definition provides more data consistency and manageability, as well as simpler setup (e.g., Galera Cluster, XtraDB Cluster).

Below is an illustration of the setup we put together for a large OpenStack deployment:

Database HA is assured by MMM: master-master replication with one standby master (only the active one supporting writes, and both masters supporting reads151so we have one “writer” IP and two “reader” IPs). The mmm_monitor checks the availability of both masters and shuffles “reader” and “writer” IPs accordingly.

On top of MMM, HAproxy load balances read traffic between both “reader” IPs for better performance. Of course, one may also add a number of slaves with further “reader” IPs for scalability. While HAproxy is good at load balancing traffic, it does not provide high availability itself, so another instance of HAproxy is present and for both of them a resource is created in Pacemaker. So if one of the HAproxies fails, Pacemaker will handle moving the IP address from the dead “writer” to the other.

Since we can have only one “writer” IP, we do not need to load balance it and write requests go straight to it.

With this approach we can ensure scalability of write requests by adding more slaves to the DB farm plus load balancing with HAproxy and we also maintain high availability by using Pacemaker (to detect HAproxy fails) plus MMM (to detect db host fails).

As for RabbitMQ HA versus OpenStack, Mirantis has proposed a patch for nova with mirrored queues support. From the user’s perspective it adds two new options to be put into nova.conf:

  • rabbit_ha_queues=True/False – to turn queue mirroring on.
  • rabbit_addresses=["rabbit_host1","rabbit_host2"] – so users can specify a RabbitMQ HA clustered host pair.

Technically what happens is that x-ha-policy:all is added to each queue.declare call inside nova and the roundrobin logic of the cluster is connected. Setting up the RabbitMQ cluster itself is left to the user.

Further information

I’ve presented several options for ensuring high availability to the database and messaging system. Here is a reading list for further research on the subject.


http://wiki.openstack.org/HAforNovaDB
: HA for the OpenStack db
http://wiki.openstack.org/RabbitmqHA
: HA for the queue system
http://www.hastexo.com/blogs/florian/2012/03/21/high-availability-openstack
: a post describing various aspects of OpenStack HA
http://docs.openstack.org/developer/nova/devref/rpc.html
: how OpenStack messaging works
http://www.laurentluce.com/posts/openstack-nova-internals-of-instance-launching/
: a nice walkthrough of the process of instance launching
https://lists.launchpad.net/openstack/pdfGiNwMEtUBJ.pdf
: presentation on nova HA
http://openlife.cc/blogs/2011/may/different-ways-doing-ha-mysql/
: the title explains everything 😉
http://www.linuxjournal.com/article/10718
: an article on MySQL replication
http://www.mysqlperformanceblog.com/2010/10/20/mysql-limitations-part-1-single-threaded-replication/
: addresses replication performance problems
https://github.com/jayjanssen/Percona-Pacemaker-Resource-Agents/blob/master/doc/PRM-setup-guide.rst
: an article on Percona Replication Manager
http://www.rabbitmq.com/clustering.html
: RabbitMQ clustering
http://www.rabbitmq.com/ha.html
: RabbitMQ mirrored queues

10 responses to “High Availability (HA) for OpenStack Platform Services MySQL + rabbitMQ

  1. Nice blog post. Ask Solem is working on HA support for the Kombu RabbitMQ client, to help get active/active HA working for this use case.

    Regarding “how not to lose messages”. You need to turn on “publisher confirms”. The idea is that messages that have been confirmed (acked) have been flushed to disk safely, or delivered. This is described here http://www.rabbitmq.com/confirms.html

    Note that most messaging systems and databases expose ways to use fsync much as Rabbit does.

    1. Hi Alexis,
      It’s interesting to hear about HA support in Kombu, thanks, I’ll look into that.

      You’re absolutely right about publisher confirms. E.g. I recently used them to implement a system that could survive Rabbit crashes by keeping track of which messages have been confirmed and which haven’t, and resending unconfirmed ones at reconnect. It was before RabbitMQ got queue mirroring, and the system was far too performance-sensitive to justify use of queue mirroring too.

      However, I’m not sure if it makes sense to implement this in Nova – if we need H/A we can just use queue mirroring (with my patch that Piotr mentioned – hopefully I’ll get it passed through codereview soon), and then we likely don’t need publisher confirms as mirroring is probably enough.

  2. I saw that the patch was approved and merged into trunk.
    Can you provide instructions on how to use the patch with an existing Essex installation?

    I tried to replace the impl_kombu.py on the Nova compute host and changed the nova.conf by adding “rabbit_hosts” and “rabbit_ha_queues” but nova-compute wouldn’t start anymore with the error:

    ClassNotFound: Class impl_kombu could not be found: ‘module’ object has no attribute ‘impl_kombu’

    Thanks in advance for any help.

  3. Have you ever considered about adding a little bit more than just
    your articles? I mean, what you say is valuable and everything.
    However just imagine if you added some great graphics or video clips
    to give your posts more, “pop”! Your content is excellent but with pics and
    clips, this site could certainly be one of the
    best in its field. Excellent blog!

    1. Hi,
      I highly appreciate your suggestions on this blog!
      In fact, we have recently gone a bit into more multimedia 😉

      Please, check out our webcast recording on new features in Folsom
      There are plans to add more.

      Regards,
      -Piotr

  4. I would be willing to bet that ‘mysql error 1062’ will be encountered in any mysql multi-master replication fronted by a load balancer, when multiple VMs are launched simultaneously.

  5. Very interesting posts from you Piotr and your colleagues as well! Thank you.

    There is something I don’t understand with the mySQL / MMM setup.

    Where in the configuration files can you differentiate DB reads from DB writes ? The only configuration points I know of are these sql_connection strings such as the one we have in nova.conf

    sql_connection=mysql://nova:password@sqlserver/nova.

    1. Chris,
      When it comes to using MMM vs Openstack, we are currently using one IP address for reads & writes. We are not aware of any Openstack patch that could allow for differentiation of reads & writes.

      -Piotr

  6. Hi,

    can anyone pls clarify “It is very important to bear in mind that everything stateful in OpenStack is done via the message system and the database, and all other components are stateless (excluding Glance). ”

    what are stateless components and why glance is stateful ?

  7. Hello

    We have deployed the MySQL HA successfully at two physical computer.

    Then we try to deploy the MySQL HA in the same method at two VMs on openstack, but we got some trouble.

    The two VMs can be synchronized and created virtual ip , but the same domain VMs can not get response from virtual ip through ‘ping’ command .

    Could you give me some advice ?

    Thanks in advance.

Leave a Reply

Your email address will not be published. Required fields are marked *

NEWS VIA EMAIL

Recommendations

Archive

LIVE DEMO
Mirantis Cloud Platform
WEBINAR
Orchestrate Hybrid Cloud Apps with Spinnaker