Mirantis OpenStack in the real world: Building a scalability test lab
One of OpenStack's strengths is in its ability to provide scalability, but when we're talking about running in production, it's important to be certain as to how it's going to perform under various circumstances. So in 2012, when Mirantis developers noticed wide variations between test results for Mirantis OpenStack when deployed in virtual environments and actual physical deployments, we knew we had a problem.
To make sure both we and our customers would know what how things really worked, we knew we needed physical test environments, and eventually a test scale lab that provides a true reflection of the production environment where you deploy Mirantis OpenStack, ensuring that it really does scale in your enterprise environment -- before you go live with it.
The rather lofty mission of building a test scale lab has some pretty humble beginnings, starting with Mirantis engineers using a mish-mash of geographically-dispersed internal hardware test labs to see what live deployments of our own projects and initiatives looked like. As we grew, we had to consolidate and scale testing efforts, and we had to make some big decisions about how to do it.
Many of those decisions are the same ones you'll have to make when planning your own OpenStack deployments, so we wanted to share with you the challenges we ran into, the decisions we made, and how we made those decisions work to build a test scalability lab that can really tell you how your Mirantis OpenStack deployment will work in the physical environment.
We begin at the beginning - in our IT department.
VMs don’t cut it, and buying servers doesn’t scale
It was 2012, and Mirantis OpenStack developers were getting frustrated with test results using VMs on laptops. They knew that customers would ultimately deploy on real live hardware, and while virtual tests were great for some things, they weren’t providing an accurate picture of what customers really needed to know: would OpenStack scale in the real world?"
The Mirantis OpenStack testers relayed the problem to Yury Koldobanov, head of Mirantis IT, who was getting additional feedback from other technical teams that testing products on VMs from a laptop didn’t paint an accurate picture of deployed software performance. “Though the product being tested would be used in a virtual environment, it would be deployed on physical hardware, so we needed to provide a real picture of deployed performance when testing,” Koldobanov explained. As a result, he started getting a lot of internal requests for dedicated servers to test company products and verify code.
Mirantis development teams began creating small physical test environments, ordering servers for individual projects, and configuring them in disparate corporate locations based on specific needs. But buying servers was only a stop-gap solution for the need to create test environments that would need to scale with growth. And with only a basic internet connection and an ordinary electrical network, Koldobanov and his team knew Mirantis didn’t have the infrastructure in place to host a scalability test lab.
So, while creating a configuration for a “starter” lab wouldn’t be especially complicated, IT still faced the danger of losing power for servers, which would then fail immediately. Not to mention that buying servers for testing was getting pricey, and as a software company, Mirantis wanted to maintain our focus on building software products, not shift our expertise to owning and maintaining a data center.
A strategic decision
The mounting bills for test servers kept hitting the desk of Oleg Goldman, Mirantis senior vice president of operations. “I could see the ad hoc purchases weren’t a viable long-term solution,” he said. “Continuing with even a small data center would have been a wildly expensive proposition for us when I considered the many infrastructure requirements for electricity, air conditioning, back-up systems, security, fire mitigation, and support, not to mention staff management,” Goldman continued.
He had to choose between complete control over a very costly internal data center or the uncertainty of a more affordable but risky solution with a host vendor. He made the strategic decision to outsource hardware servers labs, asking Koldobanov to find an external vendor to host the data center to give Mirantis a strategic, economically viable answer for test lab scalability moving forward. “It was up to Yury (Koldobanov) how to execute against the directive,” Goldman said. Koldobanov started working on the request at the end of 2012.
Evaluating scalability test lab options
Orders in hand, Koldobanov was now confronted with the challenges of outsourcing the data center, which included:
Difficulty finding a contractor that would lease and maintain servers to meet scalability requirements and wire the lab appropriately, while also supporting customized installations, and switch control. Instead, most data centers lease only servers in standard configurations with a standard network connection.
Geographically distant data centers. However, with Mirantis being located in various global locations, this detail was less important than it might have been to a business with only one location; wherever a data center was located, it was going to be far away from somebody on the team.
Slower speed of server administration. Using offsite data centers would prevent Mirantis from personally inspecting the test lab configuration, and require close communication with the data center, whose staff must be technically qualified to answer questions and solve problems.
He got to work.
An architect’s perspective
While investigating test scalability solutions, Koldobanov consulted Mirantis principal engineer Aleksandr Shaposhnikov, a member of the OpenStack Neutron networking project, who had experience building a physical test environment. “I knew he had first-hand knowledge of developing test plans run in physical environments,” said Koldobanov.
While working on Neutron, Shaposhnikov had successfully devised a network testing plan with twenty Neutron nodes and agents. With the efforts of the entire project team, the 20-nodes and agents test plan proved stable, and Shaposhnikov moved on to create similar testing for OpenStack and its various deployments, and now turned his expertise to evaluate infrastructure needs for Mirantis’ test lab with Koldobanov.
“When I helped to establish testing for OpenStack, I developed linear formulas to calculate loads and basic OpenStack customer scenarios,” Shaposhnikov related. “This was extremely helpful in my work at Mirantis, and I began to evaluate equipment needs and a budget to get Mirantis OpenStack 6.0 running on first a 20-node, then 100-node cluster with a defined load,” he said.
Finding the data center that could… and would
Meanwhile, Koldobanov had evaluated a number of outside companies interested in providing a data center, working with them to build small test labs that could handle the relatively small load generated by Mirantis teams for internal product testing, and placing his first orders in January 2013. Just as important as the different companies’ ability to host the data center, Koldobanov was evaluating their willingness and ability to scale the test lab and give Mirantis enough control over the environment setup, which would require close collaboration and frequent modifications -- not something data centers are historically known to provide.
After working on small-scale projects with several candidates, Koldobanov decided to go with Host-Telecom in the Czech Republic for a proof-of-concept test lab in the summer of 2014. A young, progressive company, Host-Telecom accommodated Mirantis’ various infrastructure customizations and completed a 20-node lab in August 2014.
In that hurried timeframe when evaluating the best company to proceed with, the emphasis was on Host-Telecom’s flexibility in collaborating with Mirantis to build the 20-node lab, a factor highly in its favor. The idea of creating a lab for specific standards that would analyze higher magnitude testing, such as the performance of one Linux OS versus another side by side in the same deployment, wasn’t yet on anyone's radar; the existing test environment wouldn’t support such a deployment. Right now, the only question on everyone's mind was whether Host-Telecom could continue to scale at the rate Mirantis demanded, when it demanded.
Getting from 20 to 100 nodes
The data center vendor decision made, Koldobanov now had to execute with Host-Telecom and create a physical lab capable of testing customers’ cloud deployments, with input from Mirantis sales saying that a 100-node deployment would be a good size for testing Mirantis OpenStack cloud deployments for businesses such as small banks.
But even with a 20-node lab established, scaling to a 100-node lab was proving to be an intense challenge. The test lab architecture had to be sound, and implementing the setup and wiring in the data center had to be impeccable to ensure Mirantis was testing the exact environment the customer would be using. Getting the latter done remotely with contractors required time, good communication, and knowledgeable, cooperative staff in the data center. QA’s role was also vitally important as they tested the deployments, ensuring stable behavior under all kinds of conditions.
It was rough, and Mirantis and Host-Telecom had to machete their way through undefined communication channels and technical issues to set up a lab that provided a true picture of a customer’s Mirantis OpenStack cloud deployment.
Working with the data center and making them like it
Partnering required tight coordination between the geographically dispersed Host-Telecom team and the Mirantis team, who couldn’t get their hands on the hardware. And like Koldobanov and Shaposhnikov’s teams at Mirantis, Pavel Chernobrov, director at Host-Telecom, and his data center team also had to stretch to deliver against requirements for the first expansion to a 100-node lab.
At Mirantis, Koldobanov and Shaposhnikov were urgently pushing the project forward right after establishing the first 20-node lab in August -- when much of Europe was on vacation. Sourcing equipment was difficult, and Chernobrov needed a healthy number of qualified engineers to wire the required servers at a greatly increased scope. Chernobrov also had to acquire an even greater reserve of hardware to be able to grow as Mirantis expanded scalability testing.
Even with hardware in hand, the circuitry to stretch to 100 nodes was complex, requiring much more effort to create workable testing schemes. “When we began it was no big deal starting with 20 nodes, but in that initial timeframe, we didn’t create a formal, standards-based testing lab. That was a mistake,” Koldobanov explained.
QA feels the pain and the team responds
No one understood the challenges of expanding the test lab better than Sergey Galkin, Mirantis Senior Quality Assurance Engineer. “When we increased the lab from 20 to 100 nodes, no one’s responsibilities were documented. We had no established tool for communication with our IT specialists, the different teams involved, or with Host-Telecom’s technical people,” he noted. “No one was setting up internal meetings for the different groups within Mirantis, let alone with the Host-Telecom people, and no one really knew who should be in charge of that because we had begun the whole test lab from the grass-roots level.” In addition, Galkin said different team members interpreted the existing sparse project documentation differently. Misunderstanding caused the team to lose time and drained the budget.
With communication issues causing chaos, architect Shaposhnikov saw his carefully constructed test scalability lab plans in serious danger of never being realized, and that was unacceptable. Shaposhnikov took full control of all operations in the data center and gave clear instructions to Host-Telecom’s staff and to the Mirantis IT team. He identified roles and got the whole team using Skype to communicate and troubleshoot, allowing them to move forward more quickly.
Documentation - It’s not just words on paper
The team got fastidious about documentation, compiling tables with key data such as connections and addresses of all lab equipment. “If you want to build a scalable test lab, the documentation absolutely takes time and effort, but it really simplifies the building process,” Galkin said. “The docs were also a huge help in preparing the lab quickly. With everything outlined, the engineers were able to set up a 100-node test lab in a week, where before getting up and running had taken two and half weeks,” said Galkin.
Reaching for daylight
In addition to the need for clear documentation, role definition, and defined communication processes, Galkin also identified serious technical challenges that the team faced as they expanded the lab:
The first 100-node lab lacked automation for test deployments, so the team had to invest a lot of time automating every lab process from beginning to end of a test scheme. And as noted, the poor communication between IT and engineering hindered testing at the beginning of the project.
Time to complete testing was long. When something failed, the team had to troubleshoot and then start again from the beginning, so getting results took hours.
Better communication remedied some of the issues, as did test automation efforts. Shaposhnikov also stepped in to create a set of tools to test and verify infrastructure, connectivity, and other areas to ensure proper MOS deployment. He worked with Galkin and the QA team on tools that enabled deployment and configuration of hundreds of servers and dozens of switches, as well as VLANs. In addition, Galkin used custom test tools to contact each server through another interface to investigate issues and file bugs. He directly consulted relevant team members to solve the identified problems.
After an arduous effort to get to a 100-node lab, the team now had the experience, defined roles, documentation, and processes to proceed. But some wrinkles remained.
Expanding to a 200-node lab, 300-node lab, and beyond
After the growing pains of getting to 100 nodes, Koldobanov changed his approach to building the environment when Mirantis increased the scalability test lab to 200 nodes in November 2014. “Right after we ordered resources from Host-Telecom, we discussed logical and physical testing schemes with our lead architect Aleksandr (Shaposhnikov) and the engineers, and we created testing schemes,” Koldobanov says.
Shaposhnikov added, “I worked closely with Yury (Koldobanov) and gave our IT department my requirements so they could produce the initial switch configuration.” The group then began to work with the technical staff at Host-Telecom, ensuring everyone understood the schemes and had an assigned role. Getting the working arrangement straightened out took time and caused some heartburn, but as with the first expansion, everyone worked through it, and getting to 200 nodes was not nearly as labor and time-intensive as going from 20 to 100 nodes.
With the increase to 200 nodes, Galkin saw great improvements in QA testing. At 200 nodes, he was able to divide the scalability test lab into different components and dedicate 100 nodes to automated testing using tools such as Rally.
In addition, Galkin and his team could see differences in how environments were configured; for example, perhaps one product worked well on 100 nodes in Ubuntu, but had problems on 100 nodes in CentOS. The QA team also took advantage of the additional 100 nodes to test the backlog of other release cycle tasks that the smaller scalability test labs hadn't been able to handle.
Evaluating Mirantis OpenStack performance in enterprise environments -- for real
The jump to a 300-node scalability test lab was fast on the heels of the jump to 200 nodes. With processes established over two expansions, increasing to 300 nodes proved to be a much faster setup than going from 20 to 100 nodes, or even from 100 to 200.
With the move to a 300-node lab, Mirantis can now create an exact physical model of an enterprise environment, so users know before deploying that the cloud is going to work. One node can approximate up to 60 VMs, based on RAM size, giving Mirantis the functionality of about 18,000 VMs in the test scale lab.
Engineers were now able to certify that Mirantis OpenStack works on large deployments "out of the box", and partnering with a knowledgeable, flexible data center such as Host-Telecom was key to making the entire work.
In the end...
Under the influence of cloud computing and rapidly changing needs from customers such as Mirantis, hosting providers are changing their sales model. Until recently, the data center provisioned hosting service hardware at the exact level a customer initially requested. If a client needed more dedicated servers, the data center would add only what was necessary for that request, and so on.
Data centers such as Host-Telecom now tend to anticipate customer needs, buying a surplus of servers for a test cluster and selling additional power at customer request. This new process works well for customers such as Mirantis, which can add resources to its scalability test lab quickly by working with a data center whose perspective is that, “We are ready at any time to add resources to your cloud ‘on the fly.’”
Host-Telecom wasn't the only organization changed as a result of this process. The Mirantis Services organization had always provided architectural and engineering experts to enterprises deploying OpenStack, and they passed as much of that knowledge as possible back to the engineers building the actual product. But there's just something about doing it yourself; the engineers building Mirantis OpenStack gained a profound understanding of the issues customers are trying to solve with the OpenStack solutions they were building. They need to know that products work the way they think they will in a physical environment, and not a fabricated virtualized one.
And now, when a bank, telco, or another enterprise says, "But will Mirantis OpenStack work?" they can look at the lab, and how it grew from humble origins as a 20-node internal product test lab to being able to accommodate small-and-medium sized business environments, and finally the enterprise, and say, "Absolutely."