Clustered RabbitMQ on Kubernetes

There are a lot of possible approaches to setting up clustered RabbitMQ on Kubernetes. Today I’m going to talk about the clustering approach we adopted for  the Fuel CCP project, but most pitfalls are common for all approaches to RabbitMQ clustering, so even if you want to come up with your own solution, you should find a good bit of the material meaningful to you.

Naming your rabbits

Running a RabbitMQ cluster inside k8s poses a series of interesting problems. The very first problem is what kind of names we should use to make rabbits that can see each other. Here are some examples of allowable names in different forms:

  • rabbit@hostname
  • rabbit@hostname.domainname
  • rabbit@172.17.0.4

Even before trying to start any rabbits, you need to be sure that containers can reach each other using selected naming scheme – for example, a ping should work with the part that comes after the @.

The Erlang distribution (which is actively used by RabbitMQ) can run in one of two naming modes: short names or long names. The rule of thumb is that it’s “long” when it contains a period (.), and “short” otherwise. For the name examples above that means that the first one is a short name, and the second and the third are long names.

Looking at all this we see that on Kubernetes we have the following options for naming nodes:

  • Use PetSets (also known as StatefulSets) so we will have some stable DNS names. In contrast to the normal “disposable” replicas that can be easily dropped if they become unhealthy, a PetSet is a group of stateful pods with a stronger notion of identity.
  • Use IP-addresses and some sort of automatic peer discovery (such as the autocluster plugin, which automatically clusters RabbitMQ nodes in a discoverable way)

Both of these options require running in long-name mode, but the way in which DNS/hostnames are configured inside kubernetes pods is incompatible with RabbitMQ versions prior to 3.6.6, so be sure to run this fix if needed.

Erlang cookie

The second prerequisite for successful clustering is that RabbitMQ nodes need to have a shared secret cookie. By default, RabbitMQ reads this cookie from a file (and generates this file if it’s missing). Our options for making sure this cookie is the same on all nodes are as following:

  • Create the cookie file during docker image creation. This practice isn’t recommended, because knowing this cookie gives you the full access to all RabbitMQ internals.
  • Create the cookie file in an entrypoint script, with the secret value passed as an environment variable. If we also need an entrypoint script for other reasons, this is as good as the next solution.
  • Pass additional options to RabbitMQ via environment variables:
    RABBITMQ_CTL_ERL_ARGS=”-setcookie <our-cookie>”
    RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS=”-setcookie <our-cookie>”

Clustering gotchas

Another important thing to know about RabbitMQ clusters is that when a node joins to a cluster, its data will be lost, no matter what. It doesn’t matter in the most common case – when it’s the empty node that is joining to the cluster, as we have nothing to lose in this case. But if we had 2 nodes that operated independently for some time and accumulated some data, there is no way to join them without any losses (note that restoring a cluster after network-split or node outage is just a special case of the same thing, and also with data loss). For a specific workload you can invent some workarounds, such as draining (manually or automatically) the nodes that are bound to be reset. But there is just no way to make such a solution robust, automatic and universal.

So our choice of automatic clustering solution is heavily influenced by what kinds of data losses we can tolerate for our concrete workloads.

Cluster formation

Assuming that you’ve solved all naming-related problems and can cluster your rabbits manually using rabbitmqctl, it’s time to make our cluster assembly automatic. But as we seen in the previous section, there is no one-size-fits-all solution for that problem.

One such very specific solution is only suitable for workloads where we don’t care if some data is lost along with a server that went down or got disconnected. An example of this kind of workload is when you’re using RPC, where clients just retry their (preferably idempotent) requests after any error/timeout, because by the time a server recovers from a failure, the RPC request will be stale, and no longer relevant anyway. Fortunately, this is exactly what’s happening with RPC calls performed by various components of OpenStack.

Bearing all of the above in mind, we can start designing our solution. The more stateless the better, so using IP-addresses is preferable to using PetSets, and the autocluster plugin is our obvious candidate for forming a cluster from a bunch of dynamic disposable nodes.

Going through autocluster documentation, we end up setting the following configuration options:

  • {backend, etcd}: This is an almost arbitrary choice; consul or k8s would’ve worked just as well. The only reason for choosing it was that it’s easier to test. You can download the etcd binary, run it without any parameters, and just start forming clusters on localhost.
  • {autocluster_failure, stop}: A pod that failed to join the cluster is useless for us, so it should bail out and hope that the next restart will happen in a more friendly environment.
  • {cluster_cleanup, true}, {cleanup_interval, 30},{cleanup_warn_only, false}, {etcd_ttl, 15}: The node is registered in etcd only after it’s successfully the joined cluster and fully started up. This registration TTL is constantly updated while the node is alive. If the node dies (or fails to update TTL in any other way), it’s forcefully kicked from the cluster. So even if the failed node restarts with the same IP, it’ll be able to join the cluster afresh.

Unexpected races

If you’ve tried assembling a cluster several times with the config above, you may have noticed that sometimes it can assemble several different unconnected clusters. This happens because the only protection against startup races in autocluster is some random delay during node startup – and with very bad timing every node can decide that it is the first node (i.e. there is no other records in etcd) and just start in unclustered mode.

This problem led to the development of a pretty big patch for autocluster. It adds proper startup locking – the node acquires the startup lock early in startup, and releases it only after it has properly registered itself in the backend. Only the etcd backend is supported at the moment, but others can be added easily (by implementing just 2 new callbacks in the backend module).

Another solution to that problem is, again, kubernetes PetSets, as they can perform startup orchestration – with only one node performing startup at any given time – but they are currently considered an alpha feature, and the patch above provides provides this functionality for everyone, not only for kubernetes users.

Monitoring

The only thing left to make our cluster run unattended is adding some monitoring. We need to monitor both rabbit’s health and whether it’s properly clustered with the rest of nodes.

You may remember the times when rabbitmqctl list_queues /rabbitmqctl list_channels was used as a method of monitoring, but this is not ideal, because it can’t distinguish between local and remote problems, and it creates significant network load. To that end, meet the new and shiny rabbitmqctl node_health_check – since 3.6.4 it’s the best way to check the health of any single RabbitMQ node.

Checking whether a node is properly clustered requires several checks:

  • It should be clustered with the best node registered in the autocluster backend. This is the node that new nodes will attempt to join to, which is currently the first alive node in alphabetical order.
  • Even when the node is properly clustered with the discovery node, its data can still be diverged. Also this check is not transitive, so we need to check the partitions list both on the current node and on the discovery node.

All these clustering checks are implemented in separate commits and can be invoked using:

rabbitmqctl eval ‘autocluster:cluster_health_check_report().’

Using this rabbitmqctl command we can both detect any problem with our rabbit node and stop it immediately, so kubernetes will have a chance to do its restarting magic.

Make your own RabbitMQ cluster

If you want to replicate this setup yourself, you’ll need a recent version of RabbitMQ and the custom release of the autocluster plugin (as the startup locking patch is not yet accepted upstream).

You can look into how this setup was done for Fuel CCP, or use the standalone version of the same setup as a base for your own implementation.

To give you an idea of how this works, let’s assume that you have cloned the second repository, and that you have a k8s namespace named `demo` with an `etcd` server running in it and accessible using the same `etcd` name. You can create this setup by running the following commands:

kubectl create namespace demo
kubectl run etcd --image=microbox/etcd --port=4001 \
--namespace=demo -- --name etcd
kubectl --namespace=demo expose deployment etcd

Once that’s done, to set up RabbitMQ, follow these steps:

  1. Build a Docker image with proper version of RabbitMQ and autocluster, and with all the necessary configuration parts.

    $ docker build . -t rabbitmq-autocluster
  2. Store the erlang cookie to k8s secret storage.

    $ kubectl create secret generic --namespace=demo erlang.cookie \
    --from-file=./erlang.cookie
  3. Create a 3-node RabbitMQ deployment. For simplicity’s sake, you can use the rabbitmq.yaml file from https://github.com/binarin/rabbit-on-k8s-standalone/blob/master/rabbitmq.yaml.

    $ kubectl create -f rabbitmq.yaml
  4. Check that clustering indeed worked.

    $ FIRST_POD=$(kubectl get pods --namespace demo -l 'app=rabbitmq' \
    -o jsonpath='{.items[0].metadata.name }')
    $ kubectl exec --namespace=demo $FIRST_POD rabbitmqctl \
    cluster_status

You should see output similar to:

Cluster status of node 'rabbit@172.17.0.3' ...
[{nodes,[{disc,['rabbit@172.17.0.3','rabbit@172.17.0.4',
                'rabbit@172.17.0.7']}]},
 {running_nodes,['rabbit@172.17.0.4','rabbit@172.17.0.7','rabbit@172.17.0.3']},
 {cluster_name,<<"rabbit@rabbitmq-deployment-861116474-cmshz">>},
 {partitions,[]},
 {alarms,[{'rabbit@172.17.0.4',[]},
          {'rabbit@172.17.0.7',[]},
          {'rabbit@172.17.0.3',[]}]}]

The important part here is that both nodes and running_nodes contain three nodes.

2 responses to “Clustered RabbitMQ on Kubernetes

  1. Hello, I followed along in your demo but in a full Kubernetes test cluster and was having issues. Since I run other clusters from my workstation via the gcloud sdk (and didn’t want to setup a second kubectl executable elsewhere), I didn’t try to setup the minikube locally so I hope the answer isn’t related to that…

    My initial hurdle is that the pod containers keep crashing with “CrashLoopBackOff” even though I followed the steps above. Initially I built the container with the consul settings in the env.conf file using what I found in the autocluster project, but then I switched over to using etcd as you did. Since there isn’t any specific logs output in docker/kubernetes to review I was hoping you might have some ideas as to why this might happen. Unfortunately I am somewhat new to RabbitMQ…

    Also, I read that the autocluster package has some issues with split-brain when resuming crashed containers. Have you heard of this?
    Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *

NEWS VIA EMAIL

Recommendations

Archive