We installed an OpenStack cluster with close to 1000 nodes on Kubernetes. Here’s what we found out.

Late last year, we did a number of tests that looked at deploying close to 1000 OpenStack nodes on a pre-installed Kubernetes cluster as a way of finding out what problems you might run into, and fixing them, if at all possible. In all we found several, and though in general, we were able to fix them, we thought it would still be good to go over the types of things you need to look for.

Overall we deployed an OpenStack cluster that contained more than 900 nodes using Fuel-CCP on a Kubernetes that had been deployed using Kargo. The Kargo tool is part of the Kubernetes Incubator project and uses the Large Kubernetes Cluster reference architecture as a baseline.

As we worked, we documented issues we found, and contributed fixes to both the deployment tool and reference design document where appropriate.  Here’s what we found.

The setup

We started with just over 175 bare metal machines, allocating 3 of them to be used for Kubernetes control plane services placement (API servers, ETCD, Kubernetes scheduler, etc.), others had 5 virtual machines on each node, where every VM was used as a Kubernetes minion node.

Each bare metal node had the following specifications:

  • HP ProLiant DL380 Gen9
  • CPU – 2x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
  • RAM – 264G
  • Storage – 3.0T on RAID on HP Smart Array P840 Controller, HDD – 12 x HP EH0600JDYTL
  • Network – 2x Intel Corporation Ethernet 10G 2P X710

The running OpenStack cluster (as far as Kubernetes is concerned) consists of:

  1. OpenStack control plane services running on close to 150 pods over 6 nodes
  2. Close to 4500 pods spread across all of the remaining nodes, at 5 pods per minion node

One major Prometheus problem

During the experiments we used Prometheus monitoring tool to verify resource consumption and the load put on the core system, Kubernetes, and OpenStack services. One note of caution when using Prometheus:  Deleting old data from Prometheus storage will indeed improve the Prometheus API speed — but it will also delete any previous cluster information, making it unavailable for post-run investigation. So make sure to document any observed issue and its debugging thoroughly!

Thankfully, we had in fact done that documentation, but one thing we’ve decided to do going forward to prevent this problem by configuring Prometheus to back up data to one of the persistent time series databases it supports, such as InfluxDB, Cassandra, or OpenTSDB. By default, Prometheus is optimized to be used as a real time monitoring / alerting system, and there is an official recommendation from the Prometheus developers team to keep monitoring data retention for only about 15 days to keep the tool working in a quick and responsive manner. By setting up the backup, we can store old data for an extended amount of time for post-processing needs.

Problems we experienced in our testing

Huge load on kube-apiserver

Symptoms

Initially, we had a setup with all nodes (including the Kubernetes control plane nodes) running on a virtualized environment, but the load was such that the API servers couldn’t function at all so they were moved to bare metal.  Still, both API servers running in the Kubernetes cluster were utilising up to 2000% of the available CPU (up to 45% of total node compute performance capacity), even after we migrated them to hardware nodes.

Root cause

All services that are not on Kubernetes masters (kubelet, kube-proxy on all minions) access kube-apiserver via a local NGINX proxy. Most of those requests are watch requests that lie mostly idle after they are initiated (most timeouts on them are defined to be about 5-10 minutes). NGINX was configured to cut idle connections in 3 seconds, which causes all clients to reconnect and (even worse) restart aborted SSL sessions. On the server side, this it makes kube-apiserver consume up to 2000% of the CPU resources, making other requests very slow.

Solution

Set the proxy_timeout parameter to 10 minutes in the nginx.conf configuration file, which should be more than long enough to prevent cutting SSL connections before te requests time out by themselves. After this fix was applied, one api-server consumed only 100% of CPU (about 2% of total node compute performance capacity), while the second one consumed about 200% (about 4% of total node compute performance capacity) of CPU (with average response time 200-400 ms).

Upstream issue status: fixed

Make the Kargo deployment tool set proxy_timeout to 10 minutes: issue fixed with pull request by Fuel CCP team.

KubeDNS cannot handle large cluster load with default settings

Symptoms

When deploying an OpenStack cluster on this scale, kubedns becomes unresponsive because of the huge load. This end up with a slew of errors appearing in the logs of the dnsmasq container in the kubedns pod:

Maximum number of concurrent DNS queries reached.

Also, dnsmasq containers sometimes get restarted due to hitting the high memory limit.

Root cause

First of all, kubedns often seems to fail often in this architecture, even even without load. During the experiment we observed continuous kubedns container restarts even on an empty (but large enough) Kubernetes cluster. Restarts are caused by liveness check failing, although nothing notable is observed in any logs.

Second, dnsmasq should have taken the load off kubedns, but it needs some tuning to behave as expected (or, frankly, at all) for large loads.

Solution

Fixing this problem requires several levels of steps:

  1. Set higher limits for dnsmasq containers: they take on most of the load.
  2. Add more replicas to kubedns replication controller (we decided to stop on 6 replicas, as it solved the observed issue – for bigger clusters it might be needed to increase this number even more).
  3. Increase number of parallel connections dnsmasq should handle (we used –dns-forward-max=1000 which is recommended parameter setup in dnsmasq manuals)
  4. Increase size of cache in dnsmasq: it has hard limit of 10000 cache entries which seems to be reasonable amount.
  5. Fix kubedns to handle this behaviour in proper way.

Upstream issue status: partially fixed

#1 and #2 are fixed by making them configurable in Kargo by Kubernetes team: issue, pull request.

Others – work has not yet started.

Kubernetes scheduler needs to be deployed on a separate node

Symptoms

During the huge OpenStack cluster deployment against Kubernetes, scheduler, controller-manager and kube-apiserver start fighting for CPU cycles as all of them are under a large load. Scheduler is the most resource-hungry, so we need a way to deploy it separately.

Solution

We moved the Kubernetes scheduler moved to a separate node manually; all other schedulers were manually killed to prevent them from moving to other nodes.

Upstream issue status: reported

Issue in Kargo.

Kubernetes scheduler is ineffective with pod antiaffinity

Symptoms

It takes a significant amount of time for the scheduler to process pods with pod antiaffinity rules specified on them. It is spending about 2-3 seconds on each pod, which makes the time needed to deploy an OpenStack cluster of 900 nodes unexpectedly long (about 3h for just scheduling). OpenStack deployment requires the use of antiaffinity rules to prevent several OpenStack compute nodes from being launched on a single Kubernetes minion node.

Root cause

According to profiling results, most of the time is spent on creating new Selectors to match existing pods against, which triggers the validation step. Basically we have O(N^2) unnecessary validation steps (where N = the number of pods), even if we have just 5 deployment entities scheduled to most of the nodes.

Solution

In this case, we needed a specific optimization that speeds up scheduling time up to about 300 ms/pod. It’s still slow in terms of common sense (about 30m spent just on pods scheduling for a 900 node OpenStack cluster), but it is at least close to reasonable. This solution lowers the number of very expensive operations to O(N), which is better, but still depends on the number of pods instead of deployments, so there is space for future improvement.

Upstream issue status: fixed

The optimization was merged into master (pull request) and backported to the 1.5 branch, and is part of the 1.5.2 release (pull request).

kube-apiserver has low default rate limit

Symptoms

Different services start receiving “429 Rate Limit Exceeded” HTTP errors, even though kube-apiservers can take more load. This problem was discovered through a scheduler bug (see below).

Solution

Raise the rate limit for the kube-apiserver process via the –max-requests-inflight option. It defaults to 400, but in our case it became workable at 2000. This number should be configurable in the Kargo deployment tool, as bigger deployments might require an even bigger increase.

Upstream issue status: reported

Issue in Kargo.

Kubernetes scheduler can schedule incorrectly

Symptoms

When creating a huge amount of pods (~4500 in our case) and faced with HTTP 429 errors from kube-apiserver (see above), the scheduler can schedule several pods of the same deployment on one node, in violation of the pod antiaffinity rule on them.

Root cause

See pull request below.

Upstream issue status: pull request

Fix from Mirantis team: pull request (merged, part of Kubernetes 1.6 release).

Docker sometimes becomes unresponsive

Symptoms

The Docker process sometimes hangs on several nodes, which results in timeouts in the kubelet logs. When this happens, pods cannot be spawned or terminated successfully on the affected minion node. Although many similar issues have been fixed in Docker since 1.11, we are still observing these symptoms.

Workaround

The Docker daemon logs do not contain any notable information, so we had to restart the docker service on the affected node. (During the experiments we used Docker 1.12.3, but we have observed similar symptoms in 1.13 release candidates as well.)

OpenStack services don’t handle PXC pseudo-deadlocks

Symptoms

When run in parallel, create operations of lots of resources were failing with DBError saying that Percona Xtradb Cluster identified a deadlock and the transaction should be restarted.

Root cause

oslo.db is responsible for wrapping errors received from the DB into proper classes so that services can restart transactions if similar errors occur, but it didn’t expect the error in the format that is being sent by Percona. After we fixed this, however, we still experienced similar errors, because not all transactions that could be restarted were properly decorated in Nova code.

Upstream issue status: fixed

The bug has been fixed by Roman Podolyaka’s CR and backported to Newton. It fixes Percona deadlock error detection, but there’s at least one place in Nova that still needs to be fixed.

Live migration failed with live_migration_uri configuration

Symptoms

With the live_migration_uri configuration, live migrations fails because one compute host can’t connect to a libvirt on another host.

Root cause

We can’t specify which IP address to use in the live_migration_uri template, so it was trying to use the address from the first interface that happened to be in the PXE network, while libvirt listens on the private network. We couldn’t use the live_migration_inbound_addr, which would solve this problem, because of a problem in upstream Nova.

Upstream issue status: fixed

A bug in Nova has been fixed and backported to Newton. We switched to using live_migration_inbound_addr after that.

3 responses to “We installed an OpenStack cluster with close to 1000 nodes on Kubernetes. Here’s what we found out.

Leave a Reply

Your email address will not be published. Required fields are marked *

NEWS VIA EMAIL

Recommendations

Archive

LIVE DEMO
Mirantis Cloud Platform
WEBINAR
Orchestrate Hybrid Cloud Apps with Spinnaker