OpenStack Ceilometer, cloud performance, and hardware requirements

Whether you are considering options for billing for your OpenStack cloud services or collecting statistics from them, you surely will consider the various metering and monitoring solutions that have appeared in the ecosystem, such as Zabbix, StackTach, and Monasca. While any of these may work for you, depending on your specific requirements, only Ceilometer is offered by the OpenStack community and supported by engineers from a wide range of companies, including Mirantis, RedHat, eNovance, Huawei, IBM, Intel and others.

Ceilometer, known officially as the OpenStack Telemetry project, provides a single point of contact for billing systems and other applications by supporting queries for necessary information, such as average cloud load or details on various billable events.

Ceilometer was designed to collect data about events and cloud resource usage, such as how much data is being used, how many VMs are created, how much bandwidth is used, and more. Because Ceilometer itself interacts with the cloud, however, we decided to see whether running ceilometer would affect cloud performance, and whether you need to consider special hardware requirements. In tests in two medium-sized labs, we examined the load Ceilometer put on the system and its effect on cloud performance.

In the remainder of this post, we review fundamentals of the capabilities that Ceilometer offers, examine the results obtained from the two lab scenarios, and provide details on the configurations and operations used in those tests.

Ceilometer basics

Ceilometer supports two types of data collection within an OpenStack cloud:

  • Billable events-related data. This type of data collection covers events such as instance X was created€ or €œvolume Z was deleted. This event-related information is collected via notifications, and this process does not significantly increase load on the OpenStack cloud. Events that trigger the publication of notifications occur continuously at random times within the cloud infrastructure. The notification subsystem within Ceilometer (and the backend we used for it, MongoDB) has been optimized to accommodate them, so they do not add much load even when hundreds or thousands of VMs are being simultaneously created.

  • Data collected via polling. When using this type of data collection, OpenStack services and the infrastructure points are subject to continuous, periodic polling, in which the period is determined by the polling interval. The polling interval itself is typically set to minutes or even seconds. To have the most complete and granular picture of what is going on in the cloud, we might want to set the polling interval as small as possible. The polling mechanism itself provides two ways of collecting the information: (1) By direct hypervisor or backend polling (like most of the data collected from the VMs), and (2) by polling APIs supported within OpenStack services.

Performance testing results summary

We performed Ceilometer benchmark tests and collected results primarily in the 20-node lab configuration. As expected, we found that the main load on the cloud (i.e., on the nodes running Ceilometer, MongoDB, and related controllers) resulted from polling. Our goal was to determine some guidelines for setting the polling interval to provide the greatest information granularity possible without imperiling overall system performance.

Polling load (and on average, all Ceilometer load on the cloud) actually depends on two factors:

  • Number of resources from which metrics are collected. In our benchmark testing, we used VMs as units of measurement, and we tried 360, 1000, and 2000 VMs.

  • Polling interval. Generally speaking, the smaller the polling interval is, the bigger the load.

Together, these imply that for the purposes of our benchmark tests, we could use minimally configured VMs, since in this case, any given VM served merely as a unit for information collection. The VMs we created and polled were set up as single CPU systems, each having 128MB of RAM.

Results and recommendations

This section summarizes some significant results and recommendations. (See the section €œLab configurations, testing processes, and data collected for specifics of the data collected.)

Tests results showed that 2000 VMs with a 1-minute polling period load is permissible for Ceilometer configured with MongoDB.

It’s important to note two key points. First, the IO load in this case was too heavy for running MongoDB instances on the cloud controllers (as we did). The MongoDB IOStat util indicated a peak load of close to 100%. The second point is that many data samples are written to the database and after only one day running, the MongoDB cluster held 170 GB per device.

IOStat util illustrates a significant pattern in the results. The wide blue stripe close to the end of timeline used for the chart is actually the time when MongoDB was dumping the indexes for the database data to the disk. It’s clear that at this time IO rate was about 100% for the simple SATA disks used on the controllers. This means that any other management cloud processes running on the controller have no disk access, resulting in multiple failures to perform as expected.

iostat-util-pollsters-2000-60.jpg

Figure 1. IOStat utility results

To avoid this problem, we recommend that you if you use 2000 VMs with a 1-minute polling interval (or a configuration with a similar or greater load), use separate nodes for instances of MongoDB processes running together as a replica set.

If you are using 1000 VMs, with 1-minute polling, there is a lighter IO load. In this case, MongoDB isn’€™t blocking other IO operations and it works correctly with other services.

Configuration for such a case might contain a MongoDB cluster on the controllers if the hardware can store the amount of data needed. (See the section Approximation of possible stored data volume after long periods of Ceilometer usage section.)

On the 9-node lab we could reach a 5-second polling interval for 360 VMs for information collection. We wanted to find out what polling interval in this case would be the useful minimum. After it turned out that a 5-second polling period was acceptable, we tried a 1-second polling interval. Not surprisingly, that crashed the cloud. It was not the Ceilometer that failed, but Nova. Ceilometer, however, was the reason that Nova failed, because while polling in Nova, metrics must be collected from the hypervisor itself. They are collected from compute agents polling the hypervisors on the compute nodes and from the Nova API. Processing the nova list every polling cycle caused the Python REST API app to fail.

In the 20-node lab, we were not able to test whether this low margin would crash the cloud.

Lab configurations, testing processes, and data collected

The following section describes the 9-node and 20-node lab configurations, the processes we used, and the data collected in our tests.

Testing intervals in the 9-node lab

Testing in the 9-node lab included tests in 5-second polling intervals and 30-second polling intervals.

Configuration of the 9-node lab

The 9-node lab was configured as follows:

  • Used 3 Controllers and 6 Compute nodes

  • Controller hardware was 8 CPU, 16GB RAM and HDD 0.4 TB

  • Ran 360 VMs in active state

  • Ceilometer compute agent used 5 and 30 seconds polling period, respectively

  • MongoDB’€™s replica set consisted of 3 MongoDB nodes running on controllers with a replication factor of 2

  • On every controller, 2 Ceilometer collector instances were running (6 collectors in all)

  • For the lab installation, we used Mirantis OpenStack 5.0, which includes Ceilometer Icehouse, 2014.1.1.

The lab configuration diagram is presented below:

sejydqF9jDehMY4YxwfHzug


Figure 2. Configuration diagram for the 9 node lab

Let’s start with the results for the smallest successful polling interval, 5 seconds.

Figure 3 illustrates the data regarding MongoDB writes per second during the tests. Peak writing load is approximately 830 sample writes per second, with an average value of fewer than approximately 300 samples per second.

mongo-writes-pollsters-360-5s.jpg

Figure 3. Mongo DB writes per second

Figure 4 illustrates CPU loading by MongoDB. Keep in mind that each controller in the lab has 8 CPUs, and percentages are counted against a single CPU. That means that each controller has a maximum of capacity of 800%, and when we say that the CPU load is 35%, we mean it’s 35% of a single CPU.

mongo-cpu-pollsters-360-5s-one-cpu.jpg

Figure 4. MongoDB CPU loading (one CPU)

Figure 5 illustrates what happens when MongoDB loads on all 8 available CPUs. The load averages the same 35% (out of 800%) as before. In peak loading, MongoDB uses only 75% of one CPU.mongo-cpu-pollsters-360-5s-total.jpg
Figure 5. MongoDB CPU loading

Testing results for the 30-second polling interval in the 9-node lab

We collected the same statistics for the 30-second polling intervals as we collected for the 5-second polling intervals.

In the 30-second polling intervals, the peak load of MongoDB sample writes per second (w/s) was about 500 w/s. The average peaks were approximately 400 w/s (see Figure 6). More importantly, with 30-second polling intervals, half of the time MongoDB was processing nothing, as the load was too small to create something to do. We also observed that for both the event notification and polling methods, only when the VMs were created in bulk at the start was there a noticeable load on Ceilometer; when individual VMs were created, the event generation method hardly flickered; only the polling method created any significant load. Post-creation, the notification method produced zero load.

We know it’s a bit confusing to say 25% out of 800%, which may may you think that we mean 200% (800*.25) when we actually do literally mean 25%. We’re using this notation because that’s what the tools use. Think of “percent” as “unit” and it may be a little more clear.

Figure 6. MongoDB writes per second with 30-second polling interval

The average MongoDB CPU load was about 25% of one CPU on every controller, which means that for each controller in the lab, the load was 25% out of the available 800% load. (See Figure 7 and Figure 8.)

Figure 7. MongoDB CPU loading (one CPU)

mongo-cpu-pollsters-360-30s-total.jpg

Figure 8. MongoDB CPU loading (eight CPUs)

The results of the 30-second polling interval testing indicate that when this amount of data is collected via polling, Ceilometer produces a load that does not have a negative impact on the cloud hardware.

The predicted size of the database for 360 VMs

Figure 9 shows the MongoDB storage size for the different polling intervals — 1 minute, 30 seconds, and 5 seconds — for collecting samples from 360 resources.

Figure 9. Predicted MongoDB storage size

Testing in the 20-node lab

Testing in the 20-node lab included tests in 5-second polling intervals and 30-second polling intervals.

Configuration of the 20-node lab

The 20-node lab was configured as follows:

  • 3 controllers, 14 compute nodes and 3 Compute/Ceph nodes

  • Controller hardware was 12 CPU, Â 32GB RAM and 1TB HDD

  • 1000 and 2000 VMs in an active state

  • Ceilometer compute agent with polling period of 60 seconds as a set of “control” results to test against

  • Each VM produced 10 metrics at one time

  • MongoDB’s replica set consisted of 3 MongoDB nodes running on controllers with a replication factor of 2

  • 2 Ceilometer collector instances were running at every controller (6 collectors in total)

  • We used Mirantis OpenStack 5.1 (Ceilometer Icehouse, 2014.1.1) to install the lab

Testing results for the 1000 VMs

We tested the number of writes per second to MongoDB collection meters (see Figure 10).

mongo-writes-pollsters-1000-60s.jpg

Figure 10. MongoDB writes per second

Peaks with more than 900 writes per second occur because 6 Ceilometer collectors were running. Zero values showing in the figure are the time periods between execution of the polling tasks (similar to the 360 VMs and 300-second intervals). These zero values indicate that a 1-minute polling interval for 1000 VMs is appropriate and easily processed by the combination of Ceilometer and MongoDB.

On average, MongoDB writes 143 samples per second if there are 1000 resources to collect metrics from in 1 minute polling intervals. Each sample of data is ~1.2KB, so MongoDB writes ~170kb/s to the samples table in this case.

In Figure 11, the CPU loading charts illustrates CPU loading by MongoDB and Ceilometer agents processes at every controller.

mongo-cpu-pollsters-1000-60.jpg

Figure 11. MongoDB CPU loading 1000 VMs, 60-second polling intervals

The peaks are up to between 200% and 250% of the available 1200% per controller, although the average value is about 40% to 60% of one CPU. These peaks (and their frequency and duration) mean that we can process the same amount of resources three times more often (20 seconds).

ceilometer-cpu-pollsters-1000-60.jpg

Figure 12. Ceilometer services CPU load

The load Ceilometer and MongoDB put on the controller is illustrated in Figure 12. Note the peaks up to 350%.

all-ceilometer-cpu-pollsters-1000-60.jpg

all-ceilometer-cpu-pollsters-total-1000-60.jpg

Figure 14. Ceilometer services and MongoDB CPU loading

The most interesting part is the IO load at the controller level:

iostat-util-pollsters-1000-60.jpg
Figure 15. IO Stat Utility

The IOStat utility reports data on IO device bandwidth utilization, as a percentage of CPU time during which I/O requests were issued to the device. Device saturation occurs when this value is close to 100% . For 1000 VMs, we see no saturation, as reported in Figure 15.

Test results for 2000 VMs

We also tested the number of writes per second to MongoDB collection meters for 2000 VMs (see Figure 16).

mongo-writes-pollsters-2000-60s.jpg
Figure 16. MongoDB writes per second (2000 VMs)

The average number of writes per second is 367 w/s, once again, every sample is in ~1.2KB size. MongoDB writes ~400 KB/s to the samples table. Figure 16 shows that even with 2000 resources to sample, there are periods during which MongoDB is doing nothing. Thus MongoDB could be loaded more.

We show CPU loading results for MongoDB and Ceilometer services separately (see both tables in Figure 17). mongo-cpu-pollsters-2000-60.jpg

ceilometer-cpu-pollsters-2000-60.jpg

Figure 17. MongoDB and Ceilometer services CPUs

Both Ceilometer and MongoDB produce 400% CPU load at maximum (with average percentage of about 170%).

all-ceilometer-cpu-pollsters-2000-60.jpg

Figure 18. Ceilometer and MongoDB CPU

Figure 19 shows similar results to those in Figure 18, but with all available CPU for the controllers.

all-ceilometer-cpu-pollsters-total-2000-60.jpg

Figure 19. Ceilometer services and MongoDB CPU loading

IOStat util results for 2000 VMs are illustrated in Figure 20.

iostat-util-pollsters-2000-60.jpg

Figure 20. IOStat utility, 2000 VMs

The results for the 2000 VMs test are most important. The peaks of the IO operations are approximately 100% when MongoDB and Ceilometer are working to process 2000 VMs in 1-minute polling intervals. That was close to device saturation in the 20-node lab. In most cases that will work, but, as you see, the wide blue stripe in Figure 20, between 1474 and 1614 seconds of testing run, indicates the time during which MongoDB was dumping indexes to the disk. During this time, the system is close to disk saturation. Thus, 1-minute polling for 2000 resources is acceptable only in cases in which MongoDB has separate nodes to run on, not controllers. Outside of a lab situation, the disk might be needed by other processes, and that will mean that MongoDB data might be corrupted or other processes won’t have the disk access needed. A possible solution might involve more intelligent, quicker storage used for the nodes on which MongoDB instances are running.

Approximation of possible stored data volume after long Ceilometer usage

Let’s look at the predicted size for MongoDB storage in the case of 1000 and 2000 VMs for different periods.

For 1000 VMs, 25 metrics per VM, MongoDB writes 300 samples per second, so:

Samples

Meter collection size, TB

Meter collection + indexes, TB

Min bound, TB

Max bound, TB

Day

48,988,800

0.08

0.11

0.17

0.22

Month

1,469,664,000

2.50

3.33

5.00

6.66

Year

17,635,968,000

29.99

39.98

59.97

79.97

Table 1. Predicted size of MongoDB storage, 1000 VMs

For 2000 VMs, 25 metrics per VM, MongoDB writes 600 samples per second, so:

Samples

Meter collection size, TB

Meter collection + indexes, TB

Min bound, TB

Max bound, TB

Day

97,977,600

0.17

0.22

0.33

0.44

Month

2,939,328,000

5.00

6.66

10.00

13.33

Year

35,271,936,000

59.99

79.97

119.95

159.93

Table 2. Predicted size of MongoDB storage, 2000 VMs

In closing…

I hope this article will be helpful for people working on OpenStack installations so they can estimate their cloud hardware and topology regarding Ceilometer usage. All these results refer to the Mirantis OpenStack 5.x (OpenStack Icehouse release). Â We’ll have Juno results shortly, so stay tuned!

Subscribe to Our Newsletter

Latest Tweets

Suggested Content

LIVE DEMO
Mirantis Application Platform with Spinnaker
WEBINAR
What's New in Kubernetes 1.10