We’ve learned from experience that the truth will come out.
For quite a long time there has been a common misconception that Neutron is not production-ready and has performance issues. The MOS-Neutron team aspired to put an end to these rumors and perform Neutron-focused performance and scale testing. We ran a great number of control-plane, data-plane and density tests, and came to the conclusion that Neutron is ready for production.
- MOS 9.0 with Mitaka-based Neutron
- 3 hardware labs were used for testing
- The largest lab included 378 nodes
- Line-rate throughput was achieved
- Over 24500 VMs were launched on a 200-node lab
- …and yes, Neutron works at scale!
Table of contents
- Integrity test
- Density test
- Shaker tests
The purpose of this document is to describe the process and show the results of testing MOS 9.0 (Mitaka based) at scale performed by MOS Neutron team. The testing was focused on the Neutron component, but due to the integrated nature of the tests, all other components of the product (Rabbit cluster, DB, Nova, Ceph, Keystone, and so on) were tested as well.
The testing was performed on three environments with different HW configurations.
- ML2 OVS
- VxLAN/L2 POP
- rootwrap-daemon ON
- ovsdb native interface OFF
- ofctl native interface OFF
- agent report interval 10s
- agent downtime 30s
The idea of this test is to create a group of resources and verify that it stays persistent no matter what other operations are performed on the environment (resources creation/deletion, heavy workloads, etc.).
Create 20 instances in two server groups, `server-group-floating` and `server-group-non-floating`, in proportion 10:10, with each server group having the anti-affinity policy. Instances from different server groups are located in different subnets plugged into a router. Instances from `server-group-floating` have assigned floating IPs while instances from `server-group-non-floating` have only fixed IPs.
For each of the instances the following connectivity checks are made:
- SSH into an instance.
- Ping an external resource (eg. 126.96.36.199)
- Ping other VMs (by fixed or floating IPs)
Lists of IPs to ping from the VM are formed in a way to check all possible combinations with minimum redundancy. Having VMs from different subnets with and without floating IPs ping each other and an external resource (188.8.131.52) allows us to check that all possible traffic routes are working. For example:
- From fixed IP to fixed IP in different subnets
- From floating IP to fixed IP (same path as in 2)
- From floating IP to floating IP
- From fixed IP to floating IP
Steps to setup and run the test:
- Create integrity stack using Heat template
root@node-52:~/mos-scale-9.0# heat stack-create -f integrity_check/integrity_vm.hot -P \ "image=894bdf67-1151-49b8-9e4b-df13a1ed03c1;flavor=m1.micro;instance_count_floating=10;instance_count_non_floating=10" \ integrity_stack +--------------------------------------+-----------------+--------------------+---------------------+--------------+ │ | id | stack_name | stack_status | creation_time | updated_time | │ +--------------------------------------+-----------------+--------------------+---------------------+--------------+ │ | dfd99a76-694c-4425-8230-21b3e68be496 | integrity_stack | CREATE_IN_PROGRESS | 2016-09-07T11:34:15 | None | │ +--------------------------------------+-----------------+--------------------+---------------------+--------------+
- Assign floating IPs to instances
(integ) root@node-52:~/mos-scale-9.0# assign_floatingips --sg-floating nova_server_group_floating 2016-09-07 11:35:59,983 INFO:Discovering members of group nova_server_group_floating 2016-09-07 11:36:01,160 INFO:Created floating ip with address: 10.3.61.203 2016-09-07 11:36:03,044 INFO:Associated floating ip 10.3.61.203 with instance c49841d6-b49b-4374-a5f5-d5cc734905bc 2016-09-07 11:36:03,863 INFO:Created floating ip with address: 10.3.61.205 2016-09-07 11:36:05,469 INFO:Associated floating ip 10.3.61.205 with instance 712e2d37-f9e1-426e-b21d-81cfae2fbb06 …… 2016-09-07 11:36:21,304 INFO:Created floating ip with address: 10.3.61.232 2016-09-07 11:36:22,704 INFO:Associated floating ip 10.3.61.232 with instance cdefadd3-6439-4341-b93f-210c6d608963
- Run connectivity check
root@node-52:~/mos-scale-9.0# connectivity_check -s ~/ips.json 016-09-07 12:02:27,008 INFO:Loading instances' ips from /root/ips.json 2016-09-07 12:02:30,441 INFO:Check connectivity from 10.3.61.213 to 184.108.40.206 successful. 2016-09-07 12:02:33,494 INFO:Check connectivity from 10.3.61.213 to 10.3.61.203 successful. 2016-09-07 12:02:36,547 INFO:Check connectivity from 10.3.61.213 to 10.3.61.217 successful. 2016-09-07 12:02:39,600 INFO:Check connectivity from 10.3.61.213 to 10.3.61.205 successful. 2016-09-07 12:02:42,654 INFO:Check connectivity from 10.3.61.213 to 10.3.61.207 successful. .......... 2016-09-07 12:12:25,761 INFO:Check connectivity from 10.3.61.226 to 220.127.116.11 successful. 2016-09-07 12:12:28,816 INFO:Check connectivity from 10.3.61.226 to 18.104.22.168 successful. 2016-09-07 12:12:32,210 INFO:Check connectivity from 22.214.171.124 to 126.96.36.199 successful. 2016-09-07 12:12:35,263 INFO:Check connectivity from 188.8.131.52 to 184.108.40.206 successful. 2016-09-07 12:12:38,658 INFO:Check connectivity from 220.127.116.11 to 18.104.22.168 successful. 2016-09-07 12:12:38,809 INFO:Time: 0:10:11.802956
The check_connectivity test should be performed between other test runs.
The idea is to boot as many VMs as possible (in batches of 200-1000 VMs) and make sure they are properly wired and have access to the external network. The test allows us to measure the maximum number of VMs which can be deployed without issues with cloud operability, etc.
The external access is checked by the external server to which VMs connect upon spawning. The server logs incoming connections from provisioned VMs which send their IPs to this server via POST requests. Instances also report the number of attempts it took to get an IP address from the metadata server and connect to the HTTP server, respectively.
Density test overview
A Heat template was used for creating 1 network with a subnet, 1 DVR router, and 1 VM per compute node. Heat stacks were created in batches of 1 to 5 (5 most of the time), so 1 iteration effectively means 5 new networks/routers and 196 * 5 VMs. During the execution of the test we were constantly monitoring the lab’s status using the Grafana dashboard and checking agents’ status.
As a result we were able to successfully create 125 Heat stacks, which gives us a total of 24500 VMs, which is two times more than we were able to create on MOS 7.0 (according to the MOS 7.0 performance test report).
Here’s what the Grafana dashboard looks like during the density test:
Analysis of cluster state data from Grafana shows that average CPU consumption on controllers and computes increases gradually with the growing number of VMs. The main limitation in the density test is memory capacity of the compute nodes.
During testing, we encountered and patched several bugs, and tuned the node configuration in order to comply with the growing number of VMs per node. Adjustments included:
- Increasing ARP table size on compute nodes and controllers
- Raising cpu_allocation_ratio from 8.0 for 12.0 in nova.conf to prevent hitting nova vCPUs limit on computes
At ~16000 VMs we reached ARP table size limit on compute nodes, so Heat stack creation started to fail. Having increased maximum table size we decided to cleanup failed stacks. In attempting to do so we ran into a Nova issue (LP #1606825 nova-compute hangs while executing a blocking call to librbd): on VM deletion, nova-compute may hang for a while executing a call to librbd and eventually go down in nova service-list output. This issue was fixed with the help of the Nova team, and the fix was applied on the lab as a patch.
After launching ~20000 VMs the cluster started experiencing problems with RabbitMQ and Ceph; when the number of VMs reached 24500, services and agents started to massively go down. The initial failure might have been caused by the lack of allowed PIDs per OSD node (https://bugs.launchpad.net/fuel/+bug/1536271), and the resulting Ceph failure affected all services. For example, mysql errors in neutron server lead to agents going down and massive resource rescheduling/resync. After Ceph failure the cluster could not be recovered, so the density test had to be stopped before the capacity of compute nodes was exhausted.
The Ceph team commented that 3 Ceph monitors aren’t enough for over 20000 VMs (each having 2 drives) and recommended having at least 1 monitor per ~1000 client connections. You can also move them to dedicated nodes.
Note: Connectivity check of Integrity test passed 100% even when cluster went crazy. That is a good illustration of control plane failures not affecting data plane.
Final result: 24500 VMs on a cluster.
Shaker is a distributed data-plane testing tool for OpenStack. Shaker wraps around popular system network testing tools such as iperf3 and netperf (with the help of flent). Shaker is able to deploy OpenStack instances and networks in different topologies. The Shaker scenario specifies the deployment and list of tests to execute.
Shaker deploys the required topology using Heat templates and starts lightweight agents that execute tests and report the results back to server. In case of network testing, only master agents are involved, while slaves are used as back-ends handling incoming packets.
We tested several scenarios.
- L2 same domain
This scenario tests the bandwidth between pairs of instances in the same virtual network (L2 domain). Each instance is deployed on its own compute node. The test increases the load from 1 pair until all available instances are used.
- L3 east-west
This scenario tests the bandwidth between pairs of instances deployed in different virtual networks plugged into the same router. Each instance is deployed on its own compute node. The test increases the load from 1 pair until all available instances are used.
- L3 north-south
This scenario tests the bandwidth between pairs of instances deployed in different virtual networks. Instances with master agents are located in one network, instances with slave agents are reached via their floating IPs. Each instance is deployed on its own compute node. The test increases the load from 1 pair until all available instances are used.
Testing process and results
Our data plane performance testing started on the Lab A (QA 200-node lab) deployed with DVR/VxLAN/L2pop with standard configuration. Having run the Shaker test suite, we saw disquietingly low throughput: in east-west bi-directional tests upload/download throughput was about 561/528 Mbits/sec! These results suggested that it would be reasonable to update the MTU from the default, 1500, to 9000, which is commonly used in customer installations.
Making this change led to throughput increasing by almost 7 times and reaching 3615/3844 MBits/sec in the same test case.
Such a large difference in the results shows that performance, to a very real extent, depends on a lab’s configuration.
Lab A, L3 East-West, MTU 1500
Lab A, L3 East-West, MTU 9000
Shaker results for Lab A
|MTU 1500||MTU 9000|
The other feature that is important in terms of network performance is hardware offloads. They are especially needed when VxLAN tunneling segmentation (with 50 bytes overhead) comes in. VxLAN hardware offloads allow to significantly increase throughput while reducing load on CPU.  Not all hardware supports this feature, it is required to have modern NICs (Intel X540 or X710). To make full use of this feature one would have to install fresh kernel 4.7 that has most recent fixes and improvements.  Lab A didn’t comply with any of these requirements so we had to move on to a lab with more advanced hardware.
The next lab we could get our hands on consisted of 6 nodes (3 controllers + 3 computes) with X710 NICs and running on Ubuntu 14.04 (Lab B). Here we ran Shaker tests with different lab configurations: MTU 1500/9000 and hardware offloads on/off.
Note: To disable all hardware offloads on an interface the following command is used:
ethtool -K <interface> tx off rx off tso off gso off gro off tx-udp_tnl-segmentation off
Shaker results for Lab B
|MTU 1500||MTU 9000|
|HW offloads on||dense_l2||dense_l2|
|HW offloads off||full_l2||full_l3_east_west|
As it can be seen on the charts below, hardware offloads are most effective with smaller MTU (i.e. 1500), mostly due to segmentation offloads:
- x3.5 throughput increase in bi-directional test
- x2.5 throughput increase in download/upload tests
Increasing MTU from 1500 to 9000 also gives a significant boost:
- 75% throughput increase in bi-directional test (offloads on)
- 41% throughput increase in download/upload tests (offloads on)
Lab B, L3 East-West, MTU 9000, offloads on
These results show that it makes sense to enable jumbo frames and hardware offloads in production environments whenever possible.
Next step was deploying a large-scale lab which hardware configuration was equal or better than one of Lab B. The new Lab C contained 378 nodes (3 controllers, 375 computes) with each node having 4 bonded 10G X710 interfaces.
Shaker results, Lab C
|MTU 9000, offloads on|
On this lab we were able to achieve near line-rate results in L2 and L3 east-west Shaker tests even with concurrency >50:
- 9800 Mbits/sec in download/upload tests
- 6100 Mbits/sec each direction in bi-directional tests
Lab C, L3 East-West, MTU 9000
The charts below compare the results that were produced on Lab A and Lab C.
Here it can be seen that running the same test on a lab with enabled jumbo frames and supported hardware offloads leads to sufficient increase of throughput, that keeps stable even with high concurrency.
L3 North-South performance is still far from being perfect and may be further investigated and improved, though the resulting throughput depends on many factors including configuration of a switch and lab topology (whether nodes are situated in the same rack or not, etc.) AND MTU in the external network that must always considered to be 1500. It should also be remembered that in North-South test all the traffic goes through the controller which in case of high concurrency may get flooded.
The results shown below are the most important as in real environments there is usually traffic going in and out and therefore it is important that throughput is stable in both directions. Here we can see that on lab C the average throughput in both directions was almost 3 times higher than on Lab A with the same MTU 9000.
The average results that are shown on graphs above are often affected by corner cases when the channel gets stuck due to various reasons and throughput drops significantly. To have a fuller understanding of what throughput is achievable one can take a look at a chart with most successful results.
In comparison with results that were achieved on MOS 7.0 on lab with the same configuration as Lab A, MTU 9000, the results remain stable without any performance degradation. Throughput that was observed on a lab with more advanced NICs proves that Neutron DVR+VxLAN+L2pop installations are capable of very high performance with the only bottleneck being hardware configuration and MTU settings.
- No major issues in Neutron were found during testing (all labs, all tests).
- Issues found were either already fixed in upstream and backported to Mitaka or are in progress.
- Data-plane tests showed stable performance on all hardware. It was demonstrated that high network performance can be achieved even on old hardware, that doesn’t support VxLAN offloads, just need proper MTU settings. On servers with modern NICs throughput is almost line-rate.
- Data-plane connectivity is not lost even during serious issues with control plane.
- Density testing clearly demonstrated that Neutron is capable of managing over 24500 VMs on 200 nodes (3 controllers) without serious performance degradation. In fact we weren’t even able to spot significant bottlenecks in Neutron control plane as had to stop the test due to issues not related to Neutron.
- Neutron is ready for large-scale production deployments on 350+ nodes.