Neutron versus Chaos Monkey: Is Neutron Reliable At Scale?
What would happen if you took a perfectly good OpenStack pod and then injected a heavy dose of network chaos?
A joint team from Big Switch Networks and Mirantis ran a Hadoop performance benchmark while forcing over 650 different network failures in a 30-minute period in a “chaos monkey”-style stress test inspired by the Netflix engineering team to help answer the question “Is Neutron Reliable Enough?”
This blog post is about what happened next.
For context, OpenStack’s Neutron network stack has traveled a long road, and the stability at scale of today’s implementations have been a source of friction. Below is a picture of the audience that assembled for a talk from Aaron Rosen and Salvatore Orlando’s on this topic at the Paris OpenStack Summit (slides here). I’ve included the picture only to point out that the crowd had gathered to hear not about any new features, but purely about reliability improvements. Wow.
Why has Neutron reliability brought on so much debate? The answer is, unfortunately, complicated. For readers unfamiliar with the stack, keep in mind that the name in practice spans three very distinct areas – vendor-provided plug-ins for production use, a framework codebase that hosts those plug-ins, and a reference plug-in. There are over 20 different Neutron plug-ins to choose from, representing a wide range of effort ranging from “certified for production” to “hacked together as a side project.” There is widespread use of the reference plug-in that, as many of its authors say often, was built for API reference and not for production. Even after choosing a production-grade Neutron plug-in, there are secondary design decisions about redundancy of Neutron and plug-in servers, placement of L3 agents, etc. In short, there is no easy way to characterize the reliability of Neutron in general, but we can take specific production-grade designs as examples, beat them up, and see what happens next.
For this exercise, a joint Mirantis/Big Switch team used Mirantis’ OpenStack distribution with Big Switch’s Big Cloud Fabric SDN networking software stack and corresponding bare metal switch hardware. Borrowing a large-scale Big Switch testbed with a 32 leaf switches/6 spine fabric, the team loaded three racks of compute nodes and then another 13 racks worth of traffic generators. The Fuel installer was used first to lay out the OpenStack distribution, followed by the Big Cloud Fabric plug-in installer scripts that both install the Neutron plug-in and optimize Neutron configurations as documented in the joint deployment guide.
The team needed a workload that would stress the network, and whose performance lent itself to network-sensitive measurement. We chose Hadoop Terasort, a Big Data benchmark. We needed scale, along with heavy amounts of network background chatter beyond what Terasort could produce. This was achieved with a combination of three racks of worker compute nodes and another 1024 ports from a massive scale traffic generator built from Ixia generators and bare metal switch hardware/software being used as port amplifiers.
Last, we needed chaos. The team added another 42,000 MAC addresses to the network (using the traffic generator) to make sure that network state across leaf switches, spine switches, and Big Cloud Fabric SDN Controllers was heavily loaded. We then added another 750 OpenStack projects, each with their own Neutron network to ensure both the Neutron servers and control channel to the Big Cloud Fabric Controllers were heavily loaded. We then forced the Big Cloud Fabric Controllers to reboot every 30 seconds, forced a random switch reboot every 8 seconds, and a random link failure every 4 seconds. This came out to a total of more than 650 network failures over the 30-minute duration of the test runs.
Terasort benchmark run time (mm:ss)
Run 1: 7:02
Run 2: 8:16
Run 3: 6:58
Run 4: 7:05
Run 1: 7:06
Run 2: 6:51
Run 3: 7:34
Run 4: 7:33
The result? Over the course of four baseline test runs and four stress test runs, there was no measurable difference in Terasort performance despite the failures and background traffic. This “no difference” was the exciting result.
For the networking/SDN heads out there, you’ll immediately notice that a traditional spanning tree network can’t converge this quickly, which means we’re already in the “modern era” of data center networking technologies. None of the non-SDN L2 fabrics can maintain stability under these conditions, and very few (if any) of the non-SDN L3 fabrics can either.
For OpenStack aficionados, this result shows that while the Neutron name spans a vast number of different plug-in codebases and configurations with many dead ends, there are Neutron options that are very stable at large scale. With tests like this, “Is Neutron stable enough?” really isn’t the right question to ask anymore, but will be replaced by “Which Neutron designs are stable enough?”
This is an exciting time to be involved in OpenStack networking.
We'll be putting together a presentation on this test for the Vancouver summit, so please vote for us. You can also get more details by emailing the team at email@example.com, or downloading the white paper.
Software Version Used for Testing:
Hadoop Version : 1.2.1 vanilla hadoop
Mirantis Openstack Version: Icehouse on Ubuntu 12.04.4(2014.1.1-5.1)
Big Cloud Fabric SDN Platform : BCF-2.0.1 (#43)
Authors (in alpha order by last name)
Kevin Benton, OpenStack Neutron Core Reviewer and Member of Technical Staff, Big Switch Networks
Kevin Benton is a member of the technical staff at Big Switch Networks. He is an active developer in the OpenStack community and a core reviewer of the Neutron project. He spends most of his effort improving the reliability and performance of Neutron to ensure OpenStack deployments are backed by a resilient networking infrastructure. Kevin is concurrently pursuing a PhD at Indiana University researching the security of next generation network protocols.
Kyle Forster, Founder, Big Switch Networks
Kyle Forster is the founder of Big Switch, and has led various teams at the company during its growth including Product, Sales, Business Development and Marketing. Prior to Big Switch, Kyle spent most of his career in Product Management at Cisco. Kyle holds a BSE in Electrical Engineering from Princeton, a MS in Computer Science and an MBA from the Stanford Graduate School of Business.
Kanzhe Jiang, Member Technical Staff, Big Switch Networks
Kanzhe Jiang is a member of the technical staff at Big Switch Networks, responsible for integration of Big Cloud Fabric with various cloud orchestration platforms. Prior to Big Switch, Kanzhe spent many years in startups in developing WAN optimization technologies. He holds a MS in Electrical Engineering from Stanford.
Prashanth Padubidry, Member Technical Staff, Big Switch Networks
Prashanth Padubidry is a member of the technical staff at Big Switch Networks, responsible for large scale fabric testing systems and system QA for the integration of Big Cloud Fabric and and various orchestration technologies. He has 15 years of networking industry experience at Juniper, Aruba, Extreme and Nortel with various networking technologies spanning switching, routing, datacenter (various) and wireless.
Jason Venner, Chief Architect, Mirantis
Jason provides guidance, advice and implementation services for strategic enterprise projects, with a specific focus on using and implementing IaaS and PaaS services for customer experience excellence, operational excellence and high velocity development practices.
Jason has more than twenty years experience as an architect, engineer, and author. His depth of experience includes creative solutions for high performance and highly available systems.