Auto-remediation: making an Openstack cloud self-healing
Some of the outages will never happen again if you’ll make the proper long-term fix to the environment, but others will rear their heads again and again. Finding an automated way to handle those issues, either by preventing or fixing them, is crucial if you want to keep your environment stable and reliable.
That's where auto-remediation kicks in.
What is Auto-Remediation?Auto-Remediation, or Self-Healing, is when automation responds to alerts or events by executing actions that can prevent or fix the problem.
The simplest example of auto-remediation is cleaning up the log files of a service that has filled up the available disk space. (It happens to everybody. Admit it.) Imagine an automated action that is triggered by a monitoring system to clean the logs and prevent the service from crashing. In addition, it creates a ticket and sends a notification so the engineer can fix log rotation during business hours, and there is no need to do it in the middle of the night. Furthermore, the event-driven automation can be used for assisted troubleshooting, so when you get an alert it includes related logs, monitoring metrics/graphs, and so on.This is what an incident resolution workflow should look like:
Auto-remediation toolingFacebook, LinkedIn, Netflix, and other hyper-scale operators use event-driven automation and workflows, as described above. While looking for an open source solution, we found StackStorm, which was used by Netflix for the same purpose. Sometimes called IFTTT (If This, Then That) for ops, the StackStorm platform is built on the same principles as a famous Facebook FBAR (FaceBook AutoRemediation), with “infrastructure as code”, a scalable microservice architecture, and it's supported by a solid and responsive team. (They are now part of Brocade, but the project is accelerating.) StackStorm uses OpenStack Mistral as a workflow engine, and offers a rich set of sensors and actions that are easy to build and extend.
The auto-remediation approach can easily be applied when operating an OpenStack cloud in order to improve reliability. And it's a good thing, too, because OpenStack has many moving parts that can break. Event-driven automation can take care of a cloud when you sleep, handling not only basic operations such as restarting nova-api and cleaning ceilometer logs, but also complex actions such as rebuilding the rabbitmq cluster or fixing Galera replication.
Automation can also expedite incident resolution by “assisting” engineers with troubleshooting. For example, if monitoring detects that keystone has started to return 503 for every request, the on-call engineer can be provided with logs from every keystone node, memcached and DB state even before starting the terminal.
In building our own self-healing OpenStack cloud, we started small. Our initial POC had just 3 simple automations: cleaning logs, service restarts and cleaning rabbitmq queues. We placed them on our 1,000 node OpenStack cluster, and they run there for 3 months, taking these 3 headaches off our operators. This example showed us that we need to add more and more self-healing actions, so our on-call engineers can sleep better at night.
Here is the short list of issues that can be auto-remediated:
- Dead process
- Lack of free disk space
- Overflowed rabbitmq queues
- Corrupted rabbitmq mnesia
- Broken database replication
- Node hardware failures (e.g. triggering VM evacuation)
- Capacity issue (by adding more hypervisors)