Nova Scheduler-Database Interactions: How to Nail Those Scalability Thwarters
October 28, 2013
OpenStack is designed to be a versatile and scalable solution for building clouds. On the face of it, this sounds like a good definition. Everybody knows that the main reason for building clouds is high scalability. But it’s a generic description, and doesn’t quantify that scalability.
If you’re an OpenStack user, you might want to know how much you can scale OpenStack, or, in other words, when to expect it to quit scaling. On the other hand, if you happen to be a developer, you might want to know the same thing but for another reason–you may want to know how to push this boundary. So, you might need to have a tool that can measure the cloud performance as a whole and also target specific components.
While the development of such tools (for example, the Rally project, which is in progress and very promising) already is underway, it may take some time for them to become available. But, you can still find the sore spots that stymie scalability and the overall performance, and you can do so without specific testing frameworks. I will show you a technique that addresses one specific issue, along with the results of testing it out, and a suggested path to solve problems. While rather specific, this approach can be used as a base for further hotspot research.
What has haunted me while tweaking the Nova scheduler for some time now is the way the compute node states are handled. The idea behind scheduling in Nova is simple: Nova has to find a node with enough resources to run an instance, and the scheduler must know the current state of all of the available nodes to find such a node. Therefore, compute node states are essential to scheduling and must be stored in a safe place that is accessible to an arbitrary number of schedulers. Yes, there can be many schedulers; remember the versatility and scalability from the very first line. Right now, that place is a database, which concerns me. The current mode allows you to have several tables with node-specific data, and on each scheduling request, these tables are obtained from the database. This works fine for small clouds, where these tables are relatively small, but it is important to figure out how cloud size affects database performance and the related scheduling lag.
I decided to do a little research, focusing on one aspect of Nova scheduler’s behavior–database access. I wanted to know how long it would take to obtain all of the compute node-related data from a database because this delay would be present in all of the instance requests. To figure this out, I developed simple tools to measure the delay. I’ll share how I did it and what I found out.
In an ideal world, you would take a server farm with some tens of thousands of machines, deploy OpenStack on it, run it for several years, record the precise timings, load the responses, and make conclusions based on this data. Be as it may, we don’t live in an ideal world and we have no means to perform real-life experiments with complex systems like neutron stars, tectonic plates, and huge OpenStack clouds. Whether we like it or not, we have to stick to synthetic experiments and simulations, even when it comes to OpenStack performance.
To carry out this experiment, I used a dedicated server equipped with Xeon E5-2620 CPU, 32 GB of RAM, and a 256-GB SSD. Naturally, your mileage may vary, and these results should not be considered a universal truth, but they present a reference point and allow for rough estimates by comparing processing power of the end user’s server with the one used in this experiment. I used a test MySQL database and created three tables–
Having filled the database, I measured how long it took to collect the data from all of the tables. To do so, I called
Figure 1 Length of
So far so good: the database response time grows linearly with the table size. A simple fitting shows the access time growth can be expressed as
Simulating OpenStack database behavior
To make matters a little simpler, suppose that compute nodes and the scheduler are the only database load sources. Thus we will have to emulate only compute node requests for state updates and scheduler’s requests for all compute nodes related data. Consequently we’ll obtain lower boundary values and it is likely that in a real cloud DB response will be even slower. Let us start with updates-caused load.
In a healthy cloud, compute nodes send state updates at least once in a minute. Therefore, the more nodes in a cloud, the more uniform the load distribution. In other words, it is safe to consider that every second a database will receive
This script spawns processes until a desired database load (
Figure 2 Length of
Figure 2 shows the results of this experiment. At first glance, it is clear that a database experiencing constant table updates responds slower. The question that must be answered is, however, which law does this growth follow. If it’s some kind of linear law, then everything is just fine. Unfortunately, my setup didn’t allow me to carry out experiments with bigger table sizes, but a simple fitting of the obtained data points produces the following quadratic function:
Are the results bad? If yes, then how bad is bad? Figure 3 answers both questions.
Figure 3. A fitted response function for a greater number of compute nodes. The red dot marks the point at which the tested setup crosses a one-minute delay.
When you have 10K nodes, you can still pretend that you have no database access issues considering that the scheduling time is only a few seconds long. When you have 20K nodes, you have to convince yourself that 13+ second delays for scheduling actually are nothing compared with the net profit of using OpenStack. But when you get closer to 49.7K nodes, you approach an important point: by now, you are definitely getting outdated compute node states because the state acquisition lasts more than a minute, and a minute is a default time step for sending state updates. Should the inability to get accurate compute node states be considered a problem? I think so.
Scared yet? Yes? No? Hold on, there is more. It is possible at least in theory to have several schedulers in your cloud. (Hey, isn’t that the primary reason for doing all of this song and dance with a database instead of simply storing everything in the scheduler’s own memory, where it eventually winds up anyway?) So one question naturally arises: what will happen with access time if several schedulers try to talk to the database at the same time?
To see what happens if there are multiple schedulers, I carried out the earlier experiment with one modification. Now I had several processes issuing
It means that
Okay, maybe this is not that bad after all? Let’s see what we can get with nine schedulers!
It is still a quadratic function, but now it grows much faster. With nine schedulers, we hit a one-minute limit at only 33K nodes. I’ve not performed other experiments, but apparently adding more schedulers won’t reduce the response time, and judging from what I have seen, I think it’s safe to assume that adding more schedulers will worsen the situation. To put the results in perspective I plotted all of the data into a single graph, as seen in Figure 4.
Figure 4. All of the measured data in one plot. The “no load” plot corresponds to the case where the database was static; all of the other cases assume that periodic updates are present.
A possible solution
So, now that we’ve seen the data, we have to find a way to deal with it. There are two options, as I see it.
One possible solution to this problem is to discredit this database test due to a lack of database tuning and the fact that no one will ever use OpenStack for more than <insert_your_estimate> nodes.
Another way to deal with it is to think about how this problem could be mitigated.
I prefer the second approach. I am convinced that the problem stems from the need to share compute node states among multiple schedulers and having to use a database for scheduler synchronization. Therefore, you could solve the problem by devising a model of storing and sharing data that does not require shoveling huge piles of data to and from a database. For instance, you could share incremental state updates via a queue or, even better, via a distributed hash-table. I did some preliminary experiments, and it appears that memcached works well as a mean for synchronization. You can see changes needed to implement one approach to this solution here. I am planning to get the hard data of a memcached-based state exchange soon, but according to my preliminary study, even in the worst case when obtaining all of the data from memcached is required, it takes less than a second to do it for tens of thousands of nodes.
It is also worth noting that an idea for a smarter scheduler has been floating around for some time, and here a smarter scheduler means obtaining more compute node characteristics more frequently. In the light of these new trends, scheduler performance has become a major issue that must be addressed before it’s too late and you’re forced to shovel gazillion lines of code to fix it.
Alexey Ovchinnikov has successfully built brain-computer interfaces for rodents and studied the behavior of chaotic electronic generators before the complexity of OpenStack stole away his attention. Now he devotes much of his time to making OpenStack even better than it is.4 comments
Continuing the Discussion