Late last year, we did a number of tests that looked at deploying close to 1000 OpenStack nodes on a pre-installed Kubernetes cluster as a way of finding out what problems you might run into, and fixing them, if at all possible. In all we found several, and though in general, we were able to fix them, we thought it would still be good to go over the types of things you need to look for.
Overall we deployed an OpenStack cluster that contained more than 900 nodes using Fuel-CCP on a Kubernetes that had been deployed using Kargo. The Kargo tool is part of the Kubernetes Incubator project and uses the Large Kubernetes Cluster reference architecture as a baseline.
As we worked, we documented issues we found, and contributed fixes to both the deployment tool and reference design document where appropriate. Here’s what we found.
We started with just over 175 bare metal machines, allocating 3 of them to be used for Kubernetes control plane services placement (API servers, ETCD, Kubernetes scheduler, etc.), others had 5 virtual machines on each node, where every VM was used as a Kubernetes minion node.
Each bare metal node had the following specifications:
- HP ProLiant DL380 Gen9
- CPU – 2x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
- RAM – 264G
- Storage – 3.0T on RAID on HP Smart Array P840 Controller, HDD – 12 x HP EH0600JDYTL
- Network – 2x Intel Corporation Ethernet 10G 2P X710
The running OpenStack cluster (as far as Kubernetes is concerned) consists of:
- OpenStack control plane services running on close to 150 pods over 6 nodes
- Close to 4500 pods spread across all of the remaining nodes, at 5 pods per minion node
One major Prometheus problem
During the experiments we used Prometheus monitoring tool to verify resource consumption and the load put on the core system, Kubernetes, and OpenStack services. One note of caution when using Prometheus: Deleting old data from Prometheus storage will indeed improve the Prometheus API speed — but it will also delete any previous cluster information, making it unavailable for post-run investigation. So make sure to document any observed issue and its debugging thoroughly!
Thankfully, we had in fact done that documentation, but one thing we’ve decided to do going forward to prevent this problem by configuring Prometheus to back up data to one of the persistent time series databases it supports, such as InfluxDB, Cassandra, or OpenTSDB. By default, Prometheus is optimized to be used as a real time monitoring / alerting system, and there is an official recommendation from the Prometheus developers team to keep monitoring data retention for only about 15 days to keep the tool working in a quick and responsive manner. By setting up the backup, we can store old data for an extended amount of time for post-processing needs.
Problems we experienced in our testing
Huge load on kube-apiserver
Initially, we had a setup with all nodes (including the Kubernetes control plane nodes) running on a virtualized environment, but the load was such that the API servers couldn’t function at all so they were moved to bare metal. Still, both API servers running in the Kubernetes cluster were utilising up to 2000% of the available CPU (up to 45% of total node compute performance capacity), even after we migrated them to hardware nodes.
All services that are not on Kubernetes masters (kubelet, kube-proxy on all minions) access kube-apiserver via a local NGINX proxy. Most of those requests are watch requests that lie mostly idle after they are initiated (most timeouts on them are defined to be about 5-10 minutes). NGINX was configured to cut idle connections in 3 seconds, which causes all clients to reconnect and (even worse) restart aborted SSL sessions. On the server side, this it makes kube-apiserver consume up to 2000% of the CPU resources, making other requests very slow.
Set the proxy_timeout parameter to 10 minutes in the nginx.conf configuration file, which should be more than long enough to prevent cutting SSL connections before te requests time out by themselves. After this fix was applied, one api-server consumed only 100% of CPU (about 2% of total node compute performance capacity), while the second one consumed about 200% (about 4% of total node compute performance capacity) of CPU (with average response time 200-400 ms).
Upstream issue status: fixed
KubeDNS cannot handle large cluster load with default settings
When deploying an OpenStack cluster on this scale, kubedns becomes unresponsive because of the huge load. This end up with a slew of errors appearing in the logs of the dnsmasq container in the kubedns pod:
Maximum number of concurrent DNS queries reached.
Also, dnsmasq containers sometimes get restarted due to hitting the high memory limit.
First of all, kubedns often seems to fail often in this architecture, even even without load. During the experiment we observed continuous kubedns container restarts even on an empty (but large enough) Kubernetes cluster. Restarts are caused by liveness check failing, although nothing notable is observed in any logs.
Second, dnsmasq should have taken the load off kubedns, but it needs some tuning to behave as expected (or, frankly, at all) for large loads.
Fixing this problem requires several levels of steps:
- Set higher limits for dnsmasq containers: they take on most of the load.
- Add more replicas to kubedns replication controller (we decided to stop on 6 replicas, as it solved the observed issue – for bigger clusters it might be needed to increase this number even more).
- Increase number of parallel connections dnsmasq should handle (we used –dns-forward-max=1000 which is recommended parameter setup in dnsmasq manuals)
- Increase size of cache in dnsmasq: it has hard limit of 10000 cache entries which seems to be reasonable amount.
- Fix kubedns to handle this behaviour in proper way.
Upstream issue status: partially fixed
Others – work has not yet started.
Kubernetes scheduler needs to be deployed on a separate node
During the huge OpenStack cluster deployment against Kubernetes, scheduler, controller-manager and kube-apiserver start fighting for CPU cycles as all of them are under a large load. Scheduler is the most resource-hungry, so we need a way to deploy it separately.
We moved the Kubernetes scheduler moved to a separate node manually; all other schedulers were manually killed to prevent them from moving to other nodes.
Upstream issue status: reported
Issue in Kargo.
Kubernetes scheduler is ineffective with pod antiaffinity
It takes a significant amount of time for the scheduler to process pods with pod antiaffinity rules specified on them. It is spending about 2-3 seconds on each pod, which makes the time needed to deploy an OpenStack cluster of 900 nodes unexpectedly long (about 3h for just scheduling). OpenStack deployment requires the use of antiaffinity rules to prevent several OpenStack compute nodes from being launched on a single Kubernetes minion node.
According to profiling results, most of the time is spent on creating new Selectors to match existing pods against, which triggers the validation step. Basically we have O(N^2) unnecessary validation steps (where N = the number of pods), even if we have just 5 deployment entities scheduled to most of the nodes.
In this case, we needed a specific optimization that speeds up scheduling time up to about 300 ms/pod. It’s still slow in terms of common sense (about 30m spent just on pods scheduling for a 900 node OpenStack cluster), but it is at least close to reasonable. This solution lowers the number of very expensive operations to O(N), which is better, but still depends on the number of pods instead of deployments, so there is space for future improvement.
Upstream issue status: fixed
kube-apiserver has low default rate limit
Different services start receiving “429 Rate Limit Exceeded” HTTP errors, even though kube-apiservers can take more load. This problem was discovered through a scheduler bug (see below).
Raise the rate limit for the kube-apiserver process via the –max-requests-inflight option. It defaults to 400, but in our case it became workable at 2000. This number should be configurable in the Kargo deployment tool, as bigger deployments might require an even bigger increase.
Upstream issue status: reported
Issue in Kargo.
Kubernetes scheduler can schedule incorrectly
When creating a huge amount of pods (~4500 in our case) and faced with HTTP 429 errors from kube-apiserver (see above), the scheduler can schedule several pods of the same deployment on one node, in violation of the pod antiaffinity rule on them.
See pull request below.
Upstream issue status: pull request
Fix from Mirantis team: pull request (merged, part of Kubernetes 1.6 release).
Docker sometimes becomes unresponsive
The Docker process sometimes hangs on several nodes, which results in timeouts in the kubelet logs. When this happens, pods cannot be spawned or terminated successfully on the affected minion node. Although many similar issues have been fixed in Docker since 1.11, we are still observing these symptoms.
The Docker daemon logs do not contain any notable information, so we had to restart the docker service on the affected node. (During the experiments we used Docker 1.12.3, but we have observed similar symptoms in 1.13 release candidates as well.)
OpenStack services don’t handle PXC pseudo-deadlocks
When run in parallel, create operations of lots of resources were failing with DBError saying that Percona Xtradb Cluster identified a deadlock and the transaction should be restarted.
oslo.db is responsible for wrapping errors received from the DB into proper classes so that services can restart transactions if similar errors occur, but it didn’t expect the error in the format that is being sent by Percona. After we fixed this, however, we still experienced similar errors, because not all transactions that could be restarted were properly decorated in Nova code.
Upstream issue status: fixed
Live migration failed with live_migration_uri configuration
With the live_migration_uri configuration, live migrations fails because one compute host can’t connect to a libvirt on another host.
We can’t specify which IP address to use in the live_migration_uri template, so it was trying to use the address from the first interface that happened to be in the PXE network, while libvirt listens on the private network. We couldn’t use the live_migration_inbound_addr, which would solve this problem, because of a problem in upstream Nova.