StackLight – the Logging, Monitoring and Alerting (LMA) toolchain of Mirantis OpenStack

Patrick Petit - May 12, 2016 -

In a post of December 2015, I introduced the concepts and base building blocks underpinning the so-called Logging, Monitoring and Alerting (LMA) Toolchain of Mirantis OpenStack, now officially called StackLight. The purpose of this second post is talk about what’s new in StackLight 0.9 (compatible with Mirantis OpenStack 8.0), which you can download from the Fuel Plugins Catalog.

The depth and breadth of the new features we have added in this release is quite significant as outlined below. The main theme of these features revolves around resiliency and scale requirements.

Clustering of the backend servers for high availability and scale

There are still four Fuel plugins in the toolchain, but in StackLight 0.9, those plugins can be deployed on a cluster of nodes for high availability and scale. Note that the cluster of nodes can be made of physical machines or virtual machines using the Reduced Footprint feature of Fuel.

Setting up a StackLight cluster is actually quite simple because all the heavy lifting work is done for you automatically and transparently by the plugins. All you need to do is to assign the StackLight Fuel plugins roles to nodes in your environment and deploy, as shown in Figure 1.


Figure 1: Assigning roles to StackLight servers

One improvement from StackLight 0.8 is the fact that StackLight 0.9 is composed of hot-pluggable plugins, which means that it is possible to deploy your StackLight cluster after you have deployed your OpenStack environment, throughthough installation of the Collectors on the OpenStack nodes requires a configuration change and a restart of all the OpenStack services.

InfluxDB-Grafana Plugin highlights:

The InfluxDB-Grafana Plugin also has some additional new features, including:

  • Upgrade to InfluxDB 0.10.0 with clustering support (considered beta by InfluxData).
  • The TSM storage engine is advertised by InfluxData to sustain write load of more than 350K points per sec on a fast disk (ideally an SSD).
  • The InfluxDB cluster must have at least 3 meta nodes in order to form a Raft consensus.
  • Clustering is used for HA (not scale as all time-series are replicated in the cluster) for both InfluxDB and Grafana.
  • Added configurable retention period in the plugin settings (30 days by default).
  • Fuel plugin support for InfluxDB clustering includes:
    • Deployment of InfluxDB on one or three nodes. The deployment of InfluxDB on two nodes (for data replication) is technically possible but it is not recommended (nor supported) as there may be situations where the failover will not work properly.
    • The ability to add and remove nodes after deployment via the Fuel UI.
    • All nodes are both meta nodes and data nodes.
    • The time-series are synchronously replicated on all nodes.
    • The API endpoint VIP is managed by HAProxy and Pacemaker.

Elasticsearch-Kibana Plugin highlights:

New features in the Elasticsearch-Kibana plugin include:

  • Upgrade to Elasticsearch 1.7.4, bringing better resiliency, new features, security fixes, clustering stability and recovery improvements.
  • Clustering, used for both scale and HA for both Elasticsearch and Kibana.
  • The cluster must have at least three nodes to avoid split-brain issues.
  • Configurable retention period in the plugin settings (30 days by default).
  • Fuel plugin support for Elasticsearch clustering includes:
    • Cluster size up to five nodes
    • All nodes store data and can be elected master
    • Five shards per index type per day
    • Data is replicated on all nodes but is configurable in the plugin settings
    • The ability to add and remove nodes after deployment via Fuel UI
    • The API endpoint VIP is managed by HAProxy and Pacemaker

In addition to clustering support, StackLight 0.9 comes with a number of bug fixes that are detailed in the Release Notes, which are available in the plugin documentation, as well as several other new capabilities.

Logs monitoring

A high rate of errors in the logs is often an indication that something is going wrong and should be acted upon. The good news is that an unusual error rate in the logs can now be detected thanks to a new log_messages metric that contains a logging rate value per severity level and per service. As with any other metric, the log_messages metric can be added to an alarm rule that will fire an anomaly and fault detection (AFD) metric if the logging rate, for a given severity level, such as ‘ERROR’, exceeds a threshold.

Worker alarms per node

Prior to StackLight 0.9, it wasn’t possible to know on which node a particular OpenStack worker was down. This information is now captured and displayed in the Grafana dashboards for all the OpenStack core services dashboards.

Libvirt Instances Monitoring

Stacklight 0.9 introduces the monitoring of libvirt instances. Ceilometer is not used at this stage and so the instances metrics are not tagged with Nova metadata such as the tenant ID. A new Hypervisor Grafana dashboard was created to visualize those metrics instead. It is possible to visualize the libvirt metrics in the Hypervisor dashboard by node name, instance ID, disk and interface name dimensions.

How to get StackLight 0.9

To get started with StackLight 0.9, first deploy Mirantis OpenStack 8.0. From there, you can go to the Fuel Plugins Catalog and search for MOS 8.0 plugins in the MONITORING category as shown belowStackLight.



We’ve created a video overview and demo of Stacklight. Check it out!

From Virtualization to Containerization
Learn how to move from monolithic to microservices in this free eBook
Download Now
Radio Cloud Native – Week of May 11th, 2022

Every Wednesday, Nick Chase and Eric Gregory from Mirantis go over the week’s cloud native and industry news. This week they discussed: Docker Extensions Artificial Intelligence shows signs that it's reaching the common person Google Cloud TPU VMs reach general availability Google buys MobileX, folds into Google Cloud NIST changes Palantir is back, and it's got a Blanket Purchase Agreement at the Department of Health and Human …

Radio Cloud Native – Week of May 11th, 2022
Where do Ubuntu 20.04, OpenSearch, Tungsten Fabric, and more all come together? In the latest Mirantis Container Cloud releases!

In the last several weeks we have released two updates to Mirantis Container Cloud - versions 2.16 and 2.17, which bring a number of important changes and enhancements. These are focused on both keeping key components up to date to provide the latest functionality and security fixes, and also delivering new functionalities for our customers to take advantage of in …

Where do Ubuntu 20.04, OpenSearch, Tungsten Fabric, and more all come together? In the latest Mirantis Container Cloud releases!
Monitoring Kubernetes costs using Kubecost and Mirantis Kubernetes Engine [Transcript]

Cloud environments & Kubernetes are becoming more and more expensive to operate and manage. In this demo-rich workshop, Mirantis and Kubecost demonstrate how to deploy Kubecost as a Helm chart on top of Mirantis Kubernetes Engine. Lens users will be able to visualize their Kubernetes spend directly in the Lens desktop application, allowing users to view spend and costs efficiently …

Monitoring Kubernetes costs using Kubecost and Mirantis Kubernetes Engine [Transcript]
Manage your cloud-native container environment with Mirantis Container Cloud

Wednesday, January 5 at 10:00 am PST
Istio in the Enterprise: Security & Scale Out Challenges for Microservices in k8s

Presented with Tetrate
Mirantis Webstore
Purchase Kubernetes support