OpenStack on Kubernetes Installation with Close to 1000 Nodes: What We Discovered
This guide will share our findings and provide support for your future OpenStack on Kubernetes integrations. This is especially useful for enterprise users who are interested in development projects running multiple clouds or networks.Overall we deployed an OpenStack cluster setup that contained more than 900 nodes using Fuel-CCP on a Kubernetes that had been deployed using Kargo. The Kargo tool is part of the Kubernetes Incubator project and uses the Large Kubernetes Cluster reference architecture as a baseline. As we worked, we documented issues we found, and contributed fixes to both the deployment tool and reference design document where appropriate. Here's what we found.
The setupWe started with just over 175 bare metal machines, allocating 3 of them to be used for Kubernetes control plane services placement (API servers, ETCD, Kubernetes scheduler, etc.), others had 5 virtual machines on each node. These VMs were then used as Kubernetes minion nodes. Each bare metal node had the following specifications:
- HP ProLiant DL380 Gen9
- CPU - 2x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
- RAM - 264G
- Storage - 3.0T on RAID on HP Smart Array P840 Controller, HDD - 12 x HP EH0600JDYTL
- Network - 2x Intel Corporation Ethernet 10G 2P X710
- OpenStack control plane services running on close to 150 pods over 6 nodes
- Close to 4500 pods spread across all of the remaining nodes, at 5 pods per minion node
One major Prometheus problemDuring the experiments we used a Prometheus monitoring tool to verify resource consumption and the load put on the core networking system, OpenStack, and Kubernetes services. One note of caution if you’re a Prometheus user reading this: Deleting old data from Prometheus storage will indeed improve the Prometheus API speed -- but it will also delete any previous cluster information, making it unavailable for post-run investigation. So make sure to document any observed issue and its debugging thoroughly! Thankfully, we had in fact done that documentation, but one thing we've decided to do going forward to prevent this problem by configuring Prometheus to back up data to one of the persistent time series databases it supports, such as InfluxDB, Cassandra, or OpenTSDB. By default, Prometheus is optimized to be used as a real time monitoring / alerting system, and there is an official recommendation from the Prometheus developers team to keep monitoring data retention for only about 15 days to keep the tool working in a quick and responsive manner. By setting up the backup, we can store old data for an extended amount of time for post-processing needs.
Problems we experienced in our OpenStack on Kubernetes testing
Huge load on kube-apiserver
SymptomsInitially, we had a setup with all nodes (including the Kubernetes control plane nodes) running on a virtualized environment, but the load was such that the API servers couldn't function at all so they were moved to bare metal. Still, both API servers running in the Kubernetes cluster were utilising up to 2000% of the available CPU (up to 45% of total node compute performance capacity), even after we migrated them to hardware nodes.
Root causeAll services that are not on Kubernetes masters (kubelet, kube-proxy on all minions) access kube-apiserver via a local NGINX proxy. Most of those requests are watch requests that lie mostly idle after they are initiated (most timeouts on them are defined to be about 5-10 minutes). NGINX was configured to cut idle connections in 3 seconds, which causes all clients to reconnect and (even worse) restart aborted SSL sessions. On the server side, this makes the kube-apiserver consume up to 2000% of the CPU resources, making other requests very slow.
SolutionSet the proxy_timeout parameter to 10 minutes in the nginx.conf configuration file, which should be more than long enough to prevent cutting SSL connections before the requests time out by themselves. After this fix was applied, one api-server consumed only 100% of CPU (about 2% of total node compute performance capacity), while the second one consumed about 200% (about 4% of total node compute performance capacity) of CPU (with average response time 200-400 ms).
Upstream issue status: fixedMake the Kargo deployment tool set proxy_timeout to 10 minutes: issue fixed with pull request by Fuel CCP team.
KubeDNS cannot handle large cluster load with default settings
SymptomsWhen deploying clusters with OpenStack on Kubernetes on this scale, kubedns becomes unresponsive because of the huge load. This end up with a slew of errors appearing in the logs of the dnsmasq container in the kubedns pod:
Maximum number of concurrent DNS queries reached.Also, dnsmasq containers sometimes get restarted due to hitting the high memory limit.
Root causeFirst of all, kubedns often seems to fail often in this architecture, even even without load. During the experiment we observed continuous kubedns container restarts even on an empty (but large enough) Kubernetes cluster. Restarts are caused by liveness check failing, although nothing notable is observed in any logs. Second, dnsmasq should have taken the load off kubedns, but it needs some tuning to behave as expected (or, frankly, at all) for large loads.
SolutionFixing this problem requires several levels of steps:
- Set higher limits for dnsmasq containers: they take on most of the load.
- Add more replicas to kubedns replication controller (we decided to stop on 6 replicas, as it solved the observed issue - for bigger clusters it might be needed to increase this number even more).
- Increase number of parallel connections dnsmasq should handle (we used --dns-forward-max=1000 which is recommended parameter setup in dnsmasq manuals)
- Increase size of cache in dnsmasq: it has a hard limit of 10000 cache entries which seems to be reasonable amount.
- Fix kubedns to handle this behaviour in the proper way.