
The LMA toolchain is a relatively new project, but it already has contributors among the big telcos, so it’s probably no surprise that it has been designed from the ground-up to be scalable, extensible, and capable of integrating with existing monitoring systems. Think of it as composable suite of tools focusing on operational health and response for OpenStack clusters.
The main idea of this set of tools is that it collects “everything” (logs, notifications, service states, metrics and so on), and turns it into internal structured messages. In contrast to “conventional” monitoring tools such as Zabbix, it is based on a decentralized stream processing architecture, opinionated about what’s important to monitor and how, and aims to deliver insightful data to consumers out-of-the-box.
Perhaps most importantly, this toolchain was designed to scale for humans. In other words, it’s synthetic, with no alerting sprawl. In other words, we made a distinction between what is truly indicative of a critical condition that must be acted upon immediately (alert), versus what can be deferred (diagnose). We knew how important this aspect of usability was, because we were already using the LMA toolchain in our own Scalability Lab, where clusters of hundreds of nodes are the norm.
What we needed the LMA toolchain to do
The conventional monitoring solutions we have been using since the 1990s fall short with respect to addressing the monitoring needs of OpenStack in a way that can scale both with the growth of your cloud infrastructure.
Furthermore, conventional monitoring solutions tend to alert operators on binary conditions such as: “this process has failed”, “a service is not responding”, “CPU has crossed a 95% utilisation threshold”, “root filesystem is near 100% full”, and so forth. A modern monitoring solution should instead answer questions like:
- Are my services running healthy, and if not, how much are they degraded?
- Will my services continue to be running healthy in the (near) future?
- What happened that caused my services to stop running healthy?
- What should I do to make my services run healthy again?
The LMA team believes that conventional monitoring solutions cannot readily answer those kinds of difficult questions because it requires instilling the knowledge to deeply understand how all the moving parts of the system work, and how those moving parts relate to one another in order to deliver a particular service. As such, a modern monitoring solution should behave more like an expert system that can make value judgements using multiple criteria about what is wrong in the system, what the conditions are that require immediate attention (versus those that can be deferred to a ticketing system for offline analysis) and so forth.
To cope with that challenge, Mirantis has created the LMA (Logging, Monitoring, Alerting) Toolchain project, which consists of both the framework itself and a number of different plugins that use that framework. The LMA toolchain is composed of a collection of finely integrated best-of-breed open-source applications that bring the operational visibility and insights you need to effectively operate your OpenStack infrastructure. To facilitate its deployment, the LMA toolchain is packaged and delivered as Fuel plugins that you can seamlessly deploy using the graphical user interface of Fuel. The LMA toolchain is also designed from the ground-up to be massively scalable, easily extensible, and easily integratable with existing IT operations tooling.
The LMA toolchain architecture and components
The toolchain is comprised of seven key elements that are interconnected at the interface level as shown in the figure below. Best-of-breed applications that specialize in handling the task at hand in the most effective way support each element.
How the toolchain collects and processes logs, metrics and the OpenStack notifications in an efficient manner
The operational data is collected from a variety of sources, including logfiles, collectd and RabbitMQ (for OpenStack notifications). There’s one Collector per node monitored; the Collector that runs on the active controller of the control plane is called the Aggregator because it performs additional aggregation and multivariate correlation functions to compute service healthiness metrics at the cluster level. An important function of the Collector is to sanitize and transform the raw operational data into an internal message representation that uses the Heka message structure. That structure is used to match, filter, and route certain categories of messages to their particular destination for processing and persistence to the backend servers.
The logs analytics and metrics analytics services of the LMA Toolchain — namely, the InfluxDB-Grafana and Elasticsearch-Kibana services of the LMA Toolchain — support the visualization.
The analytics can be installed internally or externally to the OpenStack environment. We often refer to them as satellite clusters. When using Fuel plugins, the satellite clusters are installed within the OpenStack environment on separate nodes.
Note: Installing the LMA Toolchain satellite clusters on an OpenStack controller node or collocated with other Fuel plugins is not recommended nor supported.
You can also connect custom satellite clusters to the LMA Collector, as long as Heka supports the protocol and data serialization format.
How the toolchain supports effective alerting and can integrate with alerting systems like Nagios
The LMA toolchain doesn’t impose yet another built-in alerting and escalation solution. Instead, we think it is more effective to make the LMA Collector interoperate with an alerting and escalation solution, such as Nagios, that is widely used across the industry.
To facilitate this integration, we have created an Infrastructure Alerting Plugin that is based on Nagios that you can deploy along with other LMA Toolchain plugins. The Infrastructure Alerting Plugin is configured to receive passive checks from the LMA Aggregator. These checks are turned into email notifications when the availability status of a cluster entity has changed state. A cluster entity can be a node cluster such as ‘compute’ or ‘storage’ or a service cluster such as ‘nova’, or a more fine-grained service cluster such as ‘nova-api’ or ‘nova-scheduler’. Nagios can be configured with an InfluxDB driver so that it is also possible to create alarms for metrics by querying the time-series provided by the Influxdb-Grafana Plugin, as shown in the figure below.
What’s next: Scaling the LMA toolchain to process millions of logs and metrics
Next version of LMA (LMA v 0.9) is going to target massive scaling and clustering. At the moment, the plan is to release LMA 0.9 concomitantly with Mirantis OpenStack 8.0 (though of course scheduling of either may change). Today, the performance testing we have done on a 200-node test harness shows that LMA v 0.8 already provides acceptable performance and scales fairly well using the current point-to-point connections architecture. This is due to the fact that LMA is scalable by design because monitoring is distributed across all the Collectors.
The objective for LMA 0.9 is to go beyond that number with one order of magnitude bigger deployments and to also eliminate all single points of failure. As such, the LMA Toolchain Fuel plugins will enable deployment of both InfluxDB and Elasticsearch in highly available and scale out clusters. In addition to that, the interconnect from the Collectors and the Aggregator to the storage clusters will be mediated through a Kafka message broker to avoid the congestion points (and thus loss of data) that would inevitably occur with point-to-point connections in deployments of thousands of nodes.
Thanks for this informative article..
It seems that LMA and Ceilometer have similar architecture. Can you elaborate difference and similarities between Celometet and LmA ?? R they competitors OR collaborators ??
Plz guide.
Reds
Vadan
Hi Vadan,
Thanks for your question. As a starting point, I would not claim that LMA and Ceilometer have similar architecture even though they endorse some similar conceptual patterns but many telemetry and monitoring applications do. They are indeed quite different both from an implementation and use cases coverage standpoint. LMA is a decentralised monitoring application featuring stream processing for in-flight alarm and anomaly detection processing with clustering health status inferencing capabilities. It’s an operational health and response monitoring solution for the OpenStack infrastructure featuring low latency and high resolution monitoring. Ceilometer on the other hand, is more of a high latency and low resolution solution and so has been typically used for collecting and processing billing determinants for the virtual instances. Those are completely different requirements and use cases. I would say that LMA and Ceilometer are more complementary than opposing. In fact LMA in a next version will rely on Compute Agent as an additional source of telemetry to ingest virtual instances measurements into the system. Hope this help figuring out the similarities and differences.
Thanks Petit for your elaborate Response.
Based on your response, May I say that,
1) Ceilometer is non-real time OR near-real time event manager, with primary objective of usage collection for billing purposes.
2) LMA is Real-time event manager. based on Data Stream Processing, with primary objective to provide real-time cloud infra alerts to achieve carrier grade reliability.
One of major concerns in achieving 5 nines (i.g carrier grade low latency & high availability requirements ) in Telecom networks is existing non-real time data collection by Cloud Infra managers (e.g. Openstack). One side we have non-real time cloud events and another side we have real-time application alarts from VNF managers/Application NMS systems. Correlating both data-sets(non-real time & real time) for End to end performance dashboards is quite complicated.
OpNFV’s Doctor project aimed to address this gap.. ( https://wiki.opnfv.org/doctor )
Yes you are correct about point #1 and #2. That’s what I meant.
Thanks Petit
I feel LMA is somehow resonates real time Cisco IOS XR’s Streaming Telemetry. ( http://www.cisco.com/c/en/us/td/docs/iosxr/Telemetry/Telemetry-Config-Guide/Telemetry-Config-Guide_chapter_01.htmln)
Big data technologies such as Spark Streaming is being considered as option to collect & process huge performance data..
We, along with AT&T’s Cloud data collection team working on performance data models for the same..
Nice article. End to end lifecycle management is critical to managing a scaled out OpenStack environment. Is there a reason why you decided to not use the OpenStack project Monasca for the system monitoring and notification pieces? It seems like HP, Cisco, and Fujitsu have adopted Monasca as the OpenStack monitoring service so it seems odd to see Mirantis do something that competes with other community projects.
I think part of the answer stands from the fact that we wanted something composable, lightweight, easy to deploy and maintain at scale. The Initial technical choices were founded on the fact that logs management and monitoring was a must have right from the start which Monasca didn’t have. Last but not least, we wanted inflight and distributed stream processing with the ability to plug into virtually anything in and out out-of-the-box. Heka as the building block component of the toolchain provides us with all that. Next iteration of LMA (now called StackLight) will support an integration with the Ceilometer and Aodh APIs which as far as I know, is THE reference Telemetry API implementation of OpenStack.
Minor thing, I think your tables in the boxes “Time-series Analytics” and “Logs Analytics” are swapped.
looks like you are right. I mentioned it on scheme as well