Some practical considerations for monitoring in OpenStack cloud
September 4, 2012
There is an old saying in IT: “Tough guys never do backups.” While this kind of thinking doesn’t prevent infrastructures from functioning, it’s definitely asking for trouble. And the same goes for monitoring.
Having gone through all the pain of polishing an OpenStack installation, I can see why it might seem like a better idea to grab a couple of beers in a local bar than spend additional hours fighting Nagios or Cacti. In the world of bare metal these tools seem to do the job—they warn about abnormal activities and provide insight into trend graphs.
With the advent of the cloud, however, proper monitoring has started getting out of hand for many IT teams. On top of the relatively predictable hardware, a complicated pile of software called “cloud middleware” has been dropped in our laps. Huge numbers of virtual resources (VMs, networks, storage) are coming and going at a rapid pace, causing unpredictable workload peaks.
Proper methods of dealing with this mess are in their infancy – and could even be worth launching a startup company around. This post is an attempt to shed some light on the various aspects of cloud monitoring, from the hardware to the user’s cloud ecosystem.
In cloud environments, we can identify three distinct areas for monitoring:
Next, I’ll elaborate on these three aspects of monitoring.
Monitoring OpenStack services and hardware
A vast array of monitoring and notification software is on the market. To monitor OpenStack and platform services (nova-compute, nova-network, MySQL, RabbitMQ, etc.), Mirantis usually deploys Nagios. This table shows which Nagios checks do the trick:
You might also want to take a look at Zenoss ZenPack for Openstack.
Monitoring status of the user’s cloud ecosystem
When it comes to users, their primary expectation is usually about consistent feedback from OpenStack about their instance states. There is nothing more annoying than having an instance reported as “ACTIVE” by the dashboard, even though it’s been gone for several minutes.
OpenStack tries to prevent such discrepancies by running several checks in a cron-like manner (they are called “periodic tasks” in OpenStack code) on each compute node. These checks are simply methods of the ComputeManager class located in compute/manager.py or directly in drivers for different hypervisors (some of these checks are available for certain hypervisors only).
Check behavior can usually be controlled by flags. Some flags are limited to sending a notification or setting the instance to an ERROR state, while others try to take some proactive actions (rebooting, etc.). Currently only a very limited set of checks is provided.
Even though a number of checks are present for the compute node, there seems to be no periodic check for the operation of other relevant components like
The above issues definitely require substantial improvement as they are critical not only from a usability standpoint, but also because they introduce a security risk. A good OpenStack design draft to watch seems to be the one on resource monitoring. It deals not only with monitoring resource statuses, but also provides a notification framework to which different tenants can subscribe. As per the blueprint page, you can see that the project has not been accepted yet by the community.
Monitoring performance of cloud resources
While monitoring farms of physical servers is a standard task even on a large scale, monitoring virtual infrastructure (“cloud scale”) is much more daunting. Cloud introduces a lot of dynamic resources, which can behave unpredictably and move between different hardware components. So it is usually quite hard to tell which of the thousands of VMs has problems (without even having root access to it) and how the problem affects other resources. Standards like sFlow try to tackle this problem, by providing efficient monitoring for a high volume of events (based on probing) and determining relationships between different cloud resources.
I can give no better bird’s eye explanation of sFlow than its webpage provides:
sFlow is worked on by a consortium of mainstream network device vendors, including ExtremeNetworks, HP, Hitachi, etc. Since it’s embedded in their devices, it provides a consistent way to monitor traffic across different networks. However, from the standpoint of a number of open source projects, I must mention that it’s also built into OpenvSwitch virtual switch (which is a more robust alternative to a Linux bridge).
To provide end-to-end packet flow analysis, sFlow agents need to be deployed on all devices across the network. The agents simply collect sample data on network devices (including packet headers and forwarding/routing table state), and send them to the sFlow Collector. The collector gathers data samples from all the sFlow agents and produces meaningful statistics out of them, including traffic characteristics and network topology (how packets traverse the network between two endpoints).
To achieve better performance in high-speed networks, the sFlow agent does sampling (picks one packet out of a given number). A more detailed walkthrough of sFlow is provided here.
Now, since it is all about networking, why should you care about it if you’re in the cloud business? While sFlow itself is defined as a standard to monitor networks, it also comes with a “host sFlow agent.” Per the website:
Apart from this definition, you need to know that host sFlow agents are available for mainstream hypervisors, including Xen, KVM/libvirt, and Hyper-V (VMWare to be added soon) and can be installed on a number of operating systems (FreeBSD, Linux, Solaris, Windows) to monitor applications running on them. For the IaaS clouds based on these hypervisors, it means that it’s now possible to sample different metrics of an instance (including I/O, CPU, RAM, interrupts/sec etc.) without even logging into it.
To make it even better, one can combine the “network” and “host” parts of sFlow data to provide a complex monitoring solution for the cloud:
This diagram shows the so-called sFlow host structure—the way of pairing host metrics with corresponding network metrics. You can see that to relate the “host” and “network” data, mapping of MAC addresses is used. Also, sFlow agents can be installed on top of the instance’s operating system, to measure different applications’ performance based on their TCP/UDP ports. Apart from combining host and network stats into one monitored entity, sFlow host structures also allow you to track instances as they migrate across physical infrastructure (by tracking MAC addresses).
What does this all mean for OpenStack?
With the advent of Quantum in the Folsom release, a virtual network device will move from a Linux bridge to OpenvSwitch. If we add KVM or Xen to the mix, we will have an applicable framework to monitor instances themselves and their virtual network topologies as well.
There are a number of sFlow collectors available. The most widely used seem to be Ganglia and sFlowTrend, which are free. While Ganglia is focused mainly on monitoring the performance of clustered hosts or instance pools, sFlowTrend seems to be more robust, adding network metrics and topologies on top of that.
Since sFlow is so cool, it’s hard to believe I couldn’t find any signs of its presence in the OpenStack community (no blueprints, discussions, patches, etc.). On the other hand, sFlow folks have already noticed a possible use case for their product with OpenStack.
Cloud infrastructures have many layers and each of them requires a different approach to monitoring. For the hardware and service layer, all the tools have been there for years. And OpenStack itself follows well-established standards (web services, WSGI, REST, SQL, AMQP), which prevents sysadmins from having to write their own custom probes.
On the other hand, OpenStack’s ability to track the resources it governs is rather limited now—as shown above, only the compute part is addressed and other aspects like networking and block storage seem to be left untouched. For now the user must assume that the status of the resources reported by OpenStack is unreliable. Unless this area receives more attention from the project, I would advise you to monitor cloud resources using third-party tools.
As for monitoring the cloud resources’ performance, sFlow seems to be the best path to follow. The standard is designed to scale to monitor thousands of resources without serious performance impact and all the metrics are gathered without direct interaction with tenants’ data, which ensures privacy. The relationship between the host metrics and network metrics, plus the determination of network flows allows us to see how problems on a given host affect the whole network (and possibly other hosts).
To properly monitor all the layers of the architecture, it is currently necessary to employ a number of different tools. There is a great market niche now for startup companies to integrate tools such as sFlow into existing cloud platforms. Or maybe even better—provide one, complex solution, which could monitor all the cloud aspects on one console.4 comments
Continuing the Discussion