OpenStack Neutron Performance and Scalability: Testing summary

We’ve learned from experience that the truth will come out.
Richard Feynman

For quite a long time there has been a common misconception that Neutron is not production-ready and has performance issues. The MOS-Neutron team aspired to put an end to these rumors and perform Neutron-focused performance and scale testing. We ran a great number of control-plane, data-plane and density tests, and came to the conclusion that Neutron is ready for production.

This article is an excerpt of this report; you can download the full PDF here.

[NOTE: This report has been updated with Rally test results.]

Highlights:

  • MOS 9.0 with Mitaka-based Neutron
  • 3 hardware labs were used for testing
  • The largest lab included 378 nodes
  • Line-rate throughput was achieved
  • Over 24500 VMs were launched on a 200-node lab
  • …and yes, Neutron works at scale!

Table of contents

Overview

Document purpose

The purpose of this document is to describe the process and show the results of testing MOS 9.0 (Mitaka based) at scale performed by MOS Neutron team. The testing was focused on the Neutron component, but due to the integrated nature of the tests, all other components of the product (Rabbit cluster, DB, Nova, Ceph, Keystone, and so on) were tested as well.

Background

The testing was performed on three environments with different HW configurations.

Neutron configuration:

  • ML2 OVS
  • VxLAN/L2 POP
  • DVR
  • rootwrap-daemon ON
  • ovsdb native interface OFF
  • ofctl native interface OFF
  • agent report interval 10s
  • agent downtime 30s

Integrity test

The idea of this test is to create a group of resources and verify that it stays persistent no matter what other operations are performed on the environment (resources creation/deletion, heavy workloads, etc.).

Test scenario

servergroups

Create 20 instances in two server groups, `server-group-floating` and `server-group-non-floating`, in proportion 10:10, with each server group having the anti-affinity policy. Instances from different server groups are located in different subnets plugged into a router. Instances from `server-group-floating` have assigned floating IPs while instances from `server-group-non-floating` have only fixed IPs.

For each of the instances the following connectivity checks are made:

  1. SSH into an instance.
  2. Ping an external resource (eg. 8.8.8.8)
  3. Ping other VMs (by fixed or floating IPs)

trafficflowTraffic flow during connectivity check

Lists of IPs to ping from the VM are formed in a way to check all possible combinations with minimum redundancy. Having VMs from different subnets with and without floating IPs ping each other and an external resource (8.8.8.8) allows us to check that all possible traffic routes are working. For example:

  • From fixed IP to fixed IP in the same subnetfixediptofixedip
  • From fixed IP to fixed IP in different subnets

fixedtofixeddifferentsubnet

  • From floating IP to fixed IP (same path as in 2)

floatingiptofixedip

  • From floating IP to floating IP

floatingiptofloatingip

  • From fixed IP to floating IP

fixediptofloatingip

Steps to setup and run the test:

  • Create integrity stack using Heat template
root@node-52:~/mos-scale-9.0# heat stack-create -f integrity_check/integrity_vm.hot -P \
"image=894bdf67-1151-49b8-9e4b-df13a1ed03c1;flavor=m1.micro;instance_count_floating=10;instance_count_non_floating=10" \
integrity_stack

 +--------------------------------------+-----------------+--------------------+---------------------+--------------+                                                                                                                          
 | id                                   | stack_name      | stack_status       | creation_time       | updated_time |                                                                                                                          
 +--------------------------------------+-----------------+--------------------+---------------------+--------------+                                                                                                                          
 | dfd99a76-694c-4425-8230-21b3e68be496 | integrity_stack | CREATE_IN_PROGRESS | 2016-09-07T11:34:15 | None         |                                                                                                                          
 +--------------------------------------+-----------------+--------------------+---------------------+--------------+

integritystack

  • Assign floating IPs to instances
(integ) root@node-52:~/mos-scale-9.0# assign_floatingips --sg-floating nova_server_group_floating 
 2016-09-07 11:35:59,983 INFO:Discovering members of group nova_server_group_floating 
 2016-09-07 11:36:01,160 INFO:Created floating ip with address: 10.3.61.203   
 2016-09-07 11:36:03,044 INFO:Associated floating ip 10.3.61.203 with instance c49841d6-b49b-4374-a5f5-d5cc734905bc  
 2016-09-07 11:36:03,863 INFO:Created floating ip with address: 10.3.61.205 
 2016-09-07 11:36:05,469 INFO:Associated floating ip 10.3.61.205 with instance 712e2d37-f9e1-426e-b21d-81cfae2fbb06

……

2016-09-07 11:36:21,304 INFO:Created floating ip with address: 10.3.61.232
 2016-09-07 11:36:22,704 INFO:Associated floating ip 10.3.61.232 with instance cdefadd3-6439-4341-b93f-210c6d608963
  • Run connectivity check
root@node-52:~/mos-scale-9.0# connectivity_check -s ~/ips.json 
 016-09-07 12:02:27,008 INFO:Loading instances' ips from /root/ips.json  
 2016-09-07 12:02:30,441 INFO:Check connectivity from 10.3.61.213 to 8.8.8.8 successful.  
 2016-09-07 12:02:33,494 INFO:Check connectivity from 10.3.61.213 to 10.3.61.203 successful.
 2016-09-07 12:02:36,547 INFO:Check connectivity from 10.3.61.213 to 10.3.61.217 successful. 
 2016-09-07 12:02:39,600 INFO:Check connectivity from 10.3.61.213 to 10.3.61.205 successful. 
 2016-09-07 12:02:42,654 INFO:Check connectivity from 10.3.61.213 to 10.3.61.207 successful.
 ..........
 2016-09-07 12:12:25,761 INFO:Check connectivity from 10.3.61.226 to 20.20.20.8 successful.   
 2016-09-07 12:12:28,816 INFO:Check connectivity from 10.3.61.226 to 20.20.20.9 successful.  
 2016-09-07 12:12:32,210 INFO:Check connectivity from 20.20.20.8 to 8.8.8.8 successful.
 2016-09-07 12:12:35,263 INFO:Check connectivity from 20.20.20.8 to 20.20.20.9 successful.
 2016-09-07 12:12:38,658 INFO:Check connectivity from 20.20.20.9 to 8.8.8.8 successful. 
 2016-09-07 12:12:38,809 INFO:Time: 0:10:11.802956

The check_connectivity test should be performed between other test runs.

Density test

The idea is to boot as many VMs as possible (in batches of 200-1000 VMs) and make sure they are properly wired and have access to the external network. The test allows us to measure the maximum number of VMs which can be deployed without issues with cloud operability, etc.

The external access is checked by the external server to which VMs connect upon spawning. The server logs incoming connections from provisioned VMs which send their IPs to this server via POST requests. Instances also report the number of attempts it took to get an IP address from the metadata server and connect to the HTTP server, respectively.

Density test overview

A Heat template was used for creating 1 network with a subnet, 1 DVR router, and 1 VM per compute node. Heat stacks were created in batches of 1 to 5 (5 most of the time), so 1 iteration effectively means 5 new networks/routers and 196 * 5 VMs. During the execution of the test we were constantly monitoring the lab’s status using the Grafana dashboard and checking agents’ status.

As a result we were able to successfully create 125 Heat stacks, which gives us a total of 24500 VMs, which is two times more than we were able to create on MOS 7.0 (according to the MOS 7.0 performance test report).

iteration1

Iteration 1

iterationi

Iteration i

Here’s what the Grafana dashboard looks like during the density test:

grafana

Analysis of cluster state data from Grafana shows that average CPU consumption on controllers and computes increases gradually with the growing number of VMs. The main limitation in the density test is memory capacity of the compute nodes.

densitytestcpu

densitytestmemory

During testing, we encountered and patched several bugs, and tuned the node configuration in order to comply with the growing number of VMs per node. Adjustments included:

  • Increasing ARP table size on compute nodes and controllers
  • Raising cpu_allocation_ratio from 8.0 for 12.0 in nova.conf to prevent hitting nova vCPUs limit on computes

At ~16000 VMs we reached ARP table size limit on compute nodes, so Heat stack creation started to fail. Having increased maximum table size we decided to cleanup failed stacks. In attempting to do so we ran into a Nova issue (LP #1606825 nova-compute hangs while executing a blocking call to librbd): on VM deletion, nova-compute may hang for a while executing a call to librbd and eventually go down in nova service-list output. This issue was fixed with the help of the Nova team, and the fix was applied on the lab as a patch.

After launching ~20000 VMs the cluster started experiencing problems with RabbitMQ and Ceph; when the number of VMs reached 24500, services and agents started to massively go down. The initial failure might have been caused by the lack of allowed PIDs per OSD node (https://bugs.launchpad.net/fuel/+bug/1536271), and the resulting Ceph failure affected all services. For example, mysql errors in neutron server lead to agents going down and massive resource rescheduling/resync. After Ceph failure the cluster could not be recovered, so the density test had to be stopped before the capacity of compute nodes was exhausted.

The Ceph team commented that 3 Ceph monitors aren’t enough for over 20000 VMs (each having 2 drives) and recommended having at least 1 monitor per ~1000 client connections. You can also move them to dedicated nodes.

Note: Connectivity check of Integrity test passed 100% even when cluster went crazy. That is a good illustration of control plane failures not affecting data plane.

Final result: 24500 VMs on a cluster.

Rally tests

If we’re testing, we should look at Rally, which is designed specifically for benchmarking.

RS0 (Basic Neutron test suite)

Let’s look at the Rally neutron test suite with default configuration. It is most useful for validating cloud operability.

The following Rally test scenarios were executed:

create-and-list-floating-ips
create-and-list-networks
create-and-list-ports
create-and-list-routers
create-and-list-security-groups
create-and-list-subnets

create-and-delete-floating-ips
create-and-delete-networks
create-and-delete-ports
create-and-delete-routers
create-and-delete-security-groups
create-and-delete-subnets

create-and-update-networks
create-and-update-ports
create-and-update-routers
create-and-update-security-groups
create-and-update-subnets

rally1

RS1

We also performed tests with an increased number of iterations and concurrency in order to create sufficient load on the control plane. In our tests we used concurrency 50-100 with 2000-5000 iterations.

The following Rally test scenarios were executed:

create-and-list-networks
create-and-list-ports
create-and-list-routers
create-and-list-security-groups
create-and-list-subnets

boot-and-list-server
boot-and-delete-server-with-secgroups
boot-runcommand-delete

The Rally RS1 results on Lab A and Lab C are:

 

Scenario

 

Iterations/concurrency

 

Time, sec

 

Errors

Lab A Lab C Lab A Lab C Lab A Lab C
create-and-list-networks 3000/50 5000/50 avg 2.375

max 7.904

avg 3.654

max 11.669

1

Internal server error while processing your request

6

Internal server error while processing your request

create-and-list-ports 1000/50 2000/50 avg 123.97

max 277.977

avg 99.274

max 270.84

1

Internal server error while processing your request

0
create-and-list-routers 2000/50 2000/50 avg 15.59

max 29.006

avg 12.942

max 19.398

0 0
create-and-list-security-groups 50/1 1000/50 avg 210.706

max 210.706

avg 68.712

max 169.315

0 0
create-and-list-subnets 2000/50 2000/50 avg 25.973

max 64.553

avg 17.415

max 50.415

1

Internal server error while processing your request.

0
boot-and-list-server 4975/50 1000/50 avg 21.445

max 40.736

avg 14.375

max 25.21

0 0
boot-and-delete-server-with-secgroups 4975/200 1000/100 avg 190.772

max 443.518

avg 65.651

max 95.651

394

Server has ERROR status;

The server didn’t respond in time.

0
boot-runcommand-delete 2000/15 3000/50 avg 28.39

max 35.756

avg 28.587

max 85.659

34
Rally tired waiting for Host ip:<ip> to become (‘ICMP UP’) current status ICMP DOWN
1
Resource <Server: s_rally_b58e9bde_Y369JdPf> has ERROR status. Deadlock found when trying to get lock.

 

During execution of Rally were filed and fixed bugs affecting boot-and-delete-server-with-secgroups and boot-runcommand-delete scenarios on Lab A:

With these fixes applied on Lab C Rally RS1 scenarios passed successfully.

Other bugs that were faced:

Observed trends:

  • create_and_list_networks
    • The total time spent on each iteration grows linearly

rally2

  • create_and_list_routers
    • router list operation time gradually grows from 0.12 to 1.5 sec (2000 iterations)

rally3

    • The total load duration remains linear

rally4

  • create_and_list_subnets
    • subnet list operation time increases after 1750 iterations (4.5 sec at 1700th iteration to 10.48 at 1800th iteration)

rally5

    • creating subnets has time peaks after 1750 iterations

rally6

  • create_and_list_secgroups
    • secgroup list operation exposes the most rapid growth rate with time increasing from 0.548 sec in the first iteration to over 10 sec in last iterations.

rally7

RS2 (Neutron scale with many networks)

The aim of this test is to create a large number of networks, subnets, routers and security groups with rules per tenant. Each network has a single VM. In our tests 100 networks (each with a subnet, router and a VM) were created per each iteration.

Iterations/concurrency avg time, sec max time, sec Errors
10/1 1237.389 1294.549 0
20/3 1298.611 1425.878 1

HTTPConnectionPool Read time out

 

Load graph for run with 20 iterations/concurrency 3

RS3 (Neutron scale with many servers)

 

The outline of this test is almost the same as of RS2. The main difference is that during each iteration this test creates many number of VMs (100 in our case) per a single network, hence it is possible to check the case with a large number of ports per subnet.

Iterations/concurrency avg time, sec max time, sec Errors
10/1 100.422 104.315 0
20/3 119.767 147.107 0

 

Load graph for run with 20 iterations/concurrency 3

Rally results for all test suites

Lab A Lab C
RS0 link
RS1 create_and_list_networks
create_and_list_ports
create_and_list_routers
create_and_list_secgroups
create_and_list_subnetsboot_and_list_server
boot_and_delete_server_with_secgroups
boot_runcommand_delete
neutron_suite
boot_and_list_server
boot_and_delete_server_with_secgroups
boot_runcommand_delete
RS2 Concurrency 1, times 10

Concurrency 3, times 20

Concurrency 1, times 10

Concurrency 3, times 20

RS3 Concurrency 1, times 10

Concurrency 3, times 20

Concurrency 1, times 10

Concurrency 3, times 20


Results for all Rally test suites demonstrate
stable, almost line-rate, behavior on a loaded environment when a lot of resources are created concurrently.

Shaker tests

Shaker is a distributed data-plane testing tool for OpenStack. Shaker wraps around popular system network testing tools such as iperf3 and netperf (with the help of flent). Shaker is able to deploy OpenStack instances and networks in different topologies. The Shaker scenario specifies the deployment and list of tests to execute.

Shaker deploys the required topology using Heat templates and starts lightweight agents that execute tests and report the results back to server. In case of network testing, only master agents are involved, while slaves are used as back-ends handling incoming packets.

shakerarch

Shaker architecture

Scenarios

We tested several scenarios.

  1. L2 same domain
    This scenario tests the bandwidth between pairs of instances in the same virtual network (L2 domain). Each instance is deployed on its own compute node. The test increases the load from 1 pair until all available instances are used.l2samedomain
  2. L3 east-west
    This scenario tests the bandwidth between pairs of instances deployed in different virtual networks plugged into the same router. Each instance is deployed on its own compute node. The test increases the load from 1 pair until all available instances are used.l3eastwest
  3. L3 north-south
    This scenario tests the bandwidth between pairs of instances deployed in different virtual networks. Instances with master agents are located in one network, instances with slave agents are reached via their floating IPs. Each instance is deployed on its own compute node. The test increases the load from 1 pair until all available instances are used.l3northsouth

Testing process and results

Our data plane performance testing started on the Lab A (QA 200-node lab) deployed with DVR/VxLAN/L2pop with standard configuration. Having run the Shaker test suite, we saw disquietingly low throughput: in east-west bi-directional tests upload/download throughput was about 561/528 Mbits/sec!  These results suggested that it would be reasonable to update the MTU from the default, 1500, to 9000, which is commonly used in customer installations.  

Making this change led to throughput increasing by almost 7 times and reaching 3615/3844 MBits/sec in the same test case.

Such a large difference in the results shows that performance, to a very real extent, depends on a lab’s configuration.

Lab A, L3 East-West, MTU 1500

l31500

Lab A, L3 East-West, MTU 9000

l39000

Shaker results for Lab A

MTU 1500 MTU 9000
dense_l2
dense_l3_east_west
dense_l3_north_south
dense_l2
dense_l3_east_west
full_l2
full_l3_east_west
full_l3_north_south
full_l2
full_l3_east_west
perf_l2
perf_l3_east_west
perf_l3_north_south
perf_l2
perf_l3_east_west
udp_l2
udp_l3_east_west
udp_l3_north_south
udp_l2
udp_l3_east_west

 

The other feature that is important in terms of network performance is hardware offloads. They are especially needed when VxLAN tunneling segmentation (with 50 bytes overhead) comes in. VxLAN hardware offloads allow to significantly increase throughput while reducing load on CPU. [6] Not all hardware supports this feature, it is required to have modern NICs (Intel X540 or X710). To make full use of this feature one would have to install fresh kernel 4.7 that has most recent fixes and improvements. [7] Lab A didn’t comply with any of these requirements so we had to move on to a lab with more advanced hardware.

The next lab we could get our hands on consisted of 6 nodes (3 controllers + 3 computes) with X710 NICs and running on Ubuntu 14.04 (Lab B). Here we ran Shaker tests with different lab configurations: MTU 1500/9000 and hardware offloads on/off.

Note:  To disable all hardware offloads on an interface the following command is used:

ethtool -K <interface> tx off rx off tso off gso off gro off tx-udp_tnl-segmentation off

Shaker results for Lab B

MTU 1500 MTU 9000
HW offloads on dense_l2
dense_l3_east_west
dense_l3_north_south
full_l2
full_l3_east_west
full_l3_north_south
dense_l2
dense_l3_east_west
dense_l3_north_south
full_l2
full_l3_east_west
full_l3_north_south
perf_l2
perf_l3_east_west
perf_l3_north_south
udp_l2
udp_l3_east_west
udp_l3_north_south
HW offloads off full_l2
full_l3_east_west
full_l3_east_west

 

As it can be seen on the charts below, hardware offloads are most effective with smaller MTU (i.e. 1500), mostly due to segmentation offloads:

  • x3.5 throughput increase in bi-directional test
  • x2.5 throughput increase in download/upload tests

Increasing MTU from 1500 to 9000 also gives a significant boost:

  • 75% throughput increase in bi-directional test (offloads on)
  • 41% throughput increase in download/upload tests (offloads on)

l3eastwestfull

l3eastwestfulldownloads

l3eastwestfulluploads

Lab B, L3 East-West, MTU 9000, offloads on

shakerl3eastwestfull

These results show that it makes sense to enable jumbo frames and hardware offloads in production environments whenever possible.

Next step was deploying a large-scale lab which hardware configuration was equal or better than one of Lab B. The new Lab C contained 378 nodes (3 controllers, 375 computes) with each node having 4 bonded 10G X710 interfaces.

Shaker results, Lab C

MTU 9000, offloads on
dense_l2
dense_l3_east_west
dense_l3_north_south
full_l2
full_l3_east_west
full_l3_north_south
perf_l2
perf_l3_east_west

 

On this lab we were able to achieve near line-rate results in L2 and L3 east-west Shaker tests even with concurrency >50:

  • 9800 Mbits/sec in download/upload tests
  • 6100 Mbits/sec each direction in bi-directional tests

Lab C, L3 East-West, MTU 9000

labcshakerresults

The charts below compare the results that were produced on Lab A and Lab C.

l2avsc

l3avsc

Here it can be seen that running the same test on a lab with enabled jumbo frames and supported hardware offloads leads to sufficient increase of throughput, that keeps stable even with high concurrency.l3northsouthdownload

L3 North-South performance is still far from being perfect and may be further investigated and improved, though the resulting throughput depends on many factors including configuration of a switch and lab topology (whether nodes are situated in the same rack or not, etc.) AND MTU in the external network that must always considered to be 1500. It should also be remembered that in North-South test all the traffic goes through the controller which in case of high concurrency may get flooded.

The results shown below are the most important as in real environments there is usually traffic going in and out and therefore it is important that throughput is stable in both directions. Here we can see that on lab C the average throughput in both directions was almost 3 times higher than on Lab A with the same MTU 9000.

bidirectionalThe average results that are shown on graphs above are often affected by corner cases when the channel gets stuck due to various reasons and throughput drops significantly. To have a fuller understanding of what throughput is achievable one can take a look at a chart with most successful results.

bidirectionalabove

In comparison with results that were achieved on MOS 7.0 on lab with the same configuration as Lab A, MTU 9000, the results remain stable without any performance degradation. Throughput that was observed on a lab with more advanced NICs proves that Neutron DVR+VxLAN+L2pop installations are capable of very high performance with the only bottleneck being hardware configuration and MTU settings.

Outcomes

This article is an excerpt of this report; you can download the full PDF here.
  • No major issues in Neutron were found during testing (all labs, all tests).
  • Issues found were either already fixed in upstream and backported to Mitaka or are in progress.
    • Data-plane tests showed stable performance on all hardware. It was demonstrated that high network performance can be achieved even on old hardware, that doesn’t support VxLAN offloads, just need proper MTU settings. On servers with modern NICs throughput is almost line-rate.
    • Data-plane connectivity is not lost even during serious issues with control plane.
    • Density testing clearly demonstrated that Neutron is capable of managing over 24500 VMs on 200 nodes (3 controllers) without serious performance degradation. In fact we weren’t even able to spot significant bottlenecks in Neutron control plane as had to stop the test due to issues not related to Neutron.
  • Neutron is ready for large-scale production deployments on 350+ nodes.

5 responses to “OpenStack Neutron Performance and Scalability: Testing summary

  1. Could you elaborate on why these settings were chosen?
    ovsdb native interface OFF
    ofctl native interface OFF

    Thanks!

    1. Hi!

      We had plans to experiment with these options but then it was decided to stick to their default values in upstream: both of them are turned off in Mitaka.

Leave a Reply

Your email address will not be published. Required fields are marked *

NEWS VIA EMAIL

Recommendations

Archive

LIVE DEMO
Mirantis Cloud Platform
WEBINAR
Automate Upgrades with Mirantis DriveTrain
WEBINAR
ONAP Overview