Best practices for running RabbitMQ in OpenStack

RabbitMQ in OpenStackOpenStack is dependent on message queues, so it’s crucial that you have the best possible setup. Most deployments include RabbitMQ, so let’s take a few minutes to look at best practices for making certain it runs as efficiently as possible.

Deploy RabbitMQ on dedicated nodes

With dedicated nodes, RabbitMQ is isolated from other CPU-hungry processes, and hence can sustain more stress.

This isolation option is available in Mirantis OpenStack starting from version 8.0. For more information, do a search for ‘Detach RabbitMQ’ on the validated plugins page.

Run RabbitMQ with HiPE

HiPE stands for High Performance Erlang. When HiPE is enabled, the Erlang application is pre-compiled into machine code before being executed. Our benchmark showed that this gives RabbitMQ a performance boost up to 30%. (If you’re into that sort of thing, you can find the benchmark details here and the results are here.)

The drawback with doing things this way is that application initial start time increases considerably while the Erlang application is compiled. With HiPE, the first RabbitMQ start takes around 2 minutes.

Another subtle drawback we have discovered is that if HiPE is enabled, debugging RabbitMQ might be hard as HiPE can spoil error tracebacks, rendering them unreadable.

HiPE is enabled in Mirantis OpenStack starting with version 9.0.

Do not use queue mirroring for RPC queues

Our research shows that enabling queue mirroring on a 3-node cluster makes message throughput drop twice. You can see this effect in publicly available data produced by Mirantis Scale team – test reports.

On the other side, RPC messages become obsolete pretty quickly (1 minute) and if messages are lost, it leads only to failure of current operations in progress, so overall RPC queues without mirroring seem to be a good tradeoff.

At Mirantis, you generally enable queue mirroring only for Ceilometer queues, where messages must be preserved. You can see how we define such a RabbitMQ policy here.

The option to turn off queue mirroring is available in MOS starting in Mirantis OpenStack 8.0 and is enabled by default for RPC queues starting in version 9.0.

Use a separate RabbitMQ cluster for Ceilometer

In general, Ceilometer doesn’t send many messages through RabbitMQ. But if Ceilometer gets stuck, its queues overflow. That leads to RabbitMQ crashing, which in turn causes outages for other OpenStack services.

The ability to use a separate RabbitMQ cluster for notifications is available starting with OpenStack Mitaka (MOS 9.0) and is not supported in MOS out of the box. The feature is not documented yet, but you can find the implementation here.

Reduce Ceilometer metrics volume

Another best practice when it comes to running RabbitMQ beneath OpenStack is to reduce the number of metrics sent and/or their frequency. Obviously that reduces stress put on RabbitMQ, Ceilometer and MongoDB, but it also reduces the chance of messages piling up in RabbitMQ if Ceilometer/MongoDB can’t cope with their volume. In turn, messages piling up in a queue reduce overall RabbitMQ performance.

You can also mitigate the effect of messages piling up by using RabbitMQ’s lazy queues feature (available starting with RabbitMQ 3.6.0), but as of this writing, MOS does not make use of lazy queues..

(Carefully) consider disabling queue mirroring for Ceilometer queues

In the Mirantis OpenStack architecture, queue mirroring is the only ‘persistence’ measure used. We do not use durable queues, so do not disable queue mirroring if losing Ceilometer notifications will hurt you. For example, if notification data is used for billing, you can’t afford to lose those notifications.

The ability to disable mirroring for Ceilometer queues is available in Mirantis OpenStack starting with version 8.0, but it is disabled by default.

So what do you think?  Did we leave out any of your favorite tips? Let us know in the comments!

One response to “Best practices for running RabbitMQ in OpenStack

  1. If you’ve disabled all the mirroring, might it be worth disabling clustering too? Because you’ve literally accepted that messages are not that important to keep them during a failover. So, getting rid of a cluster might add some performance and reliability (in terms of a possible downtime, cluster reassembling takes time) to the messaging, since there won’t be overhead to move messages from one RabbitMQ-node to another along with the absence of the fault intolerant “clustering” mechanism. All the messaging nodes will be standalone and for the failover you can use dead-simple haproxy, for instance. What do you think? 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

NEWS VIA EMAIL

Recommendations

Archive

LIVE DEMO
Mirantis Cloud Platform
WEBINAR
Automate Upgrades with Mirantis DriveTrain
WEBINAR
ONAP Overview