In the previous two parts (part 1, part 2) we explained OpenStack deployment evolution, which led to containerization of everything (deployment workflow) and showed how to build and orchestrate OpenStack on top of Kubernetes. We proved how easily can OpenStack be built, deployed and upgraded in 10 minutes. We got amazing feedback from the OpenStack community, which pushes our limits further.
2 minute start or upgrade is awesome, but how does this behave under real workload, scale, and during long term operation? High Availability of OpenStack components is easy, but how to run MySQL Galera or RabbitMQ cluster on Kubernetes? How to run Kubernetes itself in HA? Should we run Kubernetes components in manifests as well? These are the questions, which must be answered before putting your workload on this setup.
The following blog post is divided into 3 sections; the first introduces a new OpenStack-Salt formula for Kubernetes deployment, followed by thoughts from Underlay Architecture. After that High Availability for OpenStack support services is introduced. Finally, we present performance testing & scaling results from various scenarios.
OpenStack-Salt Kubernetes released
Recently our new OpenStack-Salt Kubernetes Formula became official. This formula should provide stable procedure how to deploy, scale and manage
production ready Kubernetes underlay.
At first glance you might ask why would I use this salt formula instead of official one? To make
production ready we had to provide the following functions:
- build debian-based packages instead of downloading binaries, init scripts, or reusing existing docker containers with unknown origin
- use single or high available cluster setup
- opt out to run control services in systems not only in manifests
- cleanup from mixing bash, salt, and unrelated features for production
- automaticlly generate and orchestrate kubernetes manifests
- integrate with Calico, Flannel, and OpenContrail
- automatically generate labels and namespace
- manage native SSL cert by Salt
- pull images from private docker registry with authentication
Only with these features can you deliver supportable and operational ready Kubernetes to enterprise. Kubernetes Formulas as well as whole OpenStack-Salt offer open way how to run production grade environment. If you are missing some features, please join us on IRC openstack-salt.
Basic architecture was shown and explained in part 2; however, high availability for Kubernetes itself is missing. Therefore, we worked mostly on this part last month. As already mentioned, we build packages from source for Ubuntu Xenial. Currently there is ETCD v3 and Kubernetes 1.2.x (1.3 would be soon).
Minimum setup should be 5 nodes – 3 controllers and 2 computes (6 in case of salt master separation). To deploy these 5 nodes we use Ubuntu MaaS as baremetal provisioning, which very easily deploy OS with salt-minion agent.
Kubernetes master nodes should be separated from pools with OpenStack control plane for large scale environment with hundreds of computes. Several notes or points from Kubernetes HA setup:
- do not run kubernetes itself through manifests as containers – when docker crashes, kubernetes api, scheduler, and etcd crash, too. We got into several break points when we hit some bug in docker, which broke up whole cluster control plane.
- keep alive for VIP and HAProxy – use as simple setup as possible without corosync/pacemaker or other complex cluster tooling.
- HA on client side – do not use haproxy as workaround for etcd read/write operation. Sample
- separate Calico ETCD from default container – by default Calico launches own etcd, which is not suitable for production environment. Reuse the existing ETCD kubernetes cluster or launch another ETCD cluster just for Calico (preferred way).
- do not use “–api-servers“ HA for kubelet – multiple kube-api server addresses in kubelet configuration seem to be HA on client side. Unfortunately, it takes just the first address. More information here
All these points are included by default in OpenStack-Salt Kubernetes formula.
High Availability for OpenStack Support services
As already mentioned, HA for OpenStack services running on Kubernetes is not so difficult, because APIs are stateless. Statefull services are mysql galera, rabbitmq, etc. In this case we cannot scale pods and use native iptables DNAT balancing. Kubernetes offers external advance balancing, when it runs on GCE or AWS. Unfortunately, default statefull balancing for on-premise deployments is missing. Therefore, we have to extend kubernetes cluster by HAProxy, which is not natively managed by kubectl. Lets take a look at several support services.
MySQL Galera is most tricky part and we spent some time on dealing several scenarios especially How to load balance, bootstrap, restart, restore Galera cluster?
Kubernetes 1.3 introduces PetSet resource for statefull services like galera, cassandra, etc. This should provide deterministic hostname, persistent storage, boostrap procedure. Unfortunately there are still several limitations, where the biggest one is load balancing. On-premise Kubernetes does not offer advanced balancing method, which galera requires. This led us to stay with standard Service/Deployment resources and use external HAProxy with Keepalived Virtual IP.
The following diagram shows how the structure looks. Each galera node is defined by Deployment resource with replica 1 and specific controller label to pin pod on physical controller. Then Service resource is created with specific IP addressed used in HAProxy member configuration.
Galera also has special boostrap procedure, where the first node in the cluster must be started without connection to other peers. There we reused existing approach from official kubernetes sample, where looped is through kubernetes shared environment variables and set
WSREP_CLUSTER_ADRESS based on number of launched cluster nodes. This enables also easy Disaster Recovery, when we have to start cluster from any node not just the first one.
Persistent storage is provided by hostPath volume from physical host. Kubernetes schedules galera pod on a specific host through node selector and node label.
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: pxc-node1 ... volumeMounts: - name: mysql mountPath: /var/lib/mysql #Container mountpoint readOnly: False volumes: - name: mysql hostPath: path: /var/lib/mysql #Host mount path ... nodeSelector: openstack: controller01 #Label of controller01
RabbitMQ cluster is easier to instal on Kubernetes than Galera. Load Balancing on HAProxy is not required, because OpenStack service can use the parameter
rabbitmq_hosts with list of cluster members. We had to figure out how to boostrap the cluster and make a persistent hostname, because the cluster is hostname specific and we cannot rely on randomly generated hostname.
RabbitMQ cluster boostrap requires
set_cluster_name on the first node and after that the other nodes can join to this cluster. We had to put this logic into container entrypoint. The entrypoint.sh contains loop, which set a role – master or slave – into the salt pillar based on existing number of rabbitmq service hosts in kubernetes shared environment variables. Then salt highstate sets up all other parameters required for rabbitmq cluster like user, HA queues, etc.
Second issue was hostname, which must be known before the pod starts. Kubernetes metadata annotations provide a possibility to predefine specific hostname of pod. Then we can reference this for cluster setup.
apiVersion: extensions/v1beta1 kind: Deployment ... spec: replicas: 1 template: metadata: labels: app: rabbitmq-server-node01 annotations: pod.beta.kubernetes.io/hostname: rabbitmq-server-node01 #Hostname definition ...
Disaster Recovery for RabbitMQ does not require specific manual action like in the case of galera. So containers can be simply restarted without impact.
OpenContrail is very difficult for containerization because of huge complexity. It contains around 15 services running in 5 supervisors – analytics, control, config, database, and webui. These services can run as microservices, which shows this repository maintained and developed by Michael Henkel. However, this solves only container builds and running without high availability as single setup.
The problem except HA is the missing process status event checks provided by Contrail Nodemgr. Events from nodemgr are collected and displayed in Contrail WebUI. This utility depends on specific supervisors’ names and in some cases on own init scripts as contrail-database (replacement of cassandra init script). For this reason we decided to run Kubernetes Deployment pod per supervisor opencontrail role even for cassandra, kafka, and zookeeper. It goes against microservice segmentation service per container, because we have to run supervisor on foreground with several services as show in the following output from opencontrail-control pod. However, we are able to deliver the green status and fully functional OpenContrail, which is the same as in standard server deployment.
root@opencontrail-control-550694332-8bf3m:/# supervisorctl -s unix:///tmp/supervisord_control.sock status contrail-control RUNNING pid 164, uptime 5 days, 7:38:25 contrail-control-nodemgr RUNNING pid 163, uptime 5 days, 7:38:25 contrail-dns RUNNING pid 165, uptime 5 days, 7:38:25 contrail-named RUNNING pid 315, uptime 5 days, 7:38:22
In the end, we run the following list of kubernetes deployment to get OpenContrail in High Availability. There are 2 cassandra clusters for opencontrail config (oc-database0X) and analytics (oc-nal-database0X) databases. A similar approach is used for Service/Deployment components like for galera cluster.
root@kvm01:~# kubectl get deployment NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE opencontrail-collector 3 3 3 3 5d opencontrail-config 3 3 3 3 5d opencontrail-control 3 3 3 3 5d opencontrail-web 1 1 1 1 5d oc-database01 1 1 1 1 5d oc-database02 1 1 1 1 5d oc-database03 1 1 1 1 5d oc-nal-database01 1 1 1 1 5d oc-nal-database02 1 1 1 1 5d oc-nal-database03 1 1 1 1 5d root@kvm02:~# kubectl get services NAME CLUSTER-IP PORT(S) opencontrail-collector 10.254.77.56 8081/TCP,8086/TCP,6379/TCP opencontrail-config 10.254.0.19 8082/TCP,5998/TCP,8443/TCP opencontrail-control 10.254.25.20 8083/TCP,53/TCP,5269/TCP,8092/TCP,8093/TCP opencontrail-web 10.254.0.21 8080/TCP,8143/TCP oc-database01 10.254.29.161 9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP oc-database02 10.254.107.22 9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP oc-database03 10.254.2.227 9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP oc-nal-database01 10.254.99.190 9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP oc-nal-database02 10.254.38.94 9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP oc-nal-database03 10.254.118.49 9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP
We hope in the future replacement of supervisor by native systems and rewriting of nodemgr to be more flexible to run in docker containers.
Last modification is a manually set ifmap server url in /etc/contrail/contrail-control.conf pointing to contrail-config service without reading from contrail-discovery. This seems to be a bug, which normally cannot be discovered, because control and config usually run on the same server.
... [IFMAP] server_url=https://10.254.0.19:8443 # contrail-config service ip ...
Nova Compute & Libvirt
Running libvirt and nova-compute in a container requires several configurations, but it is not difficult. There are a couple of blogs about e.g. Atomic. It was difficult to make Virtual Machines persistent and running after container crashes and still manage them after libvirt started again. Let’s take a look at the required configurations.
Switch nova-compute connection to libvirt from socket to TCP. Those are separate containers and we do not want to share
/var/run for socket. This setup also enables a theoretical launch of nova-compute on a different host.
#Set libvirt to TCP instead of socket in entrypoint.sh crudini --set /etc/nova/nova.conf libvirt connection_uri qemu+tcp://$NOVA_COMPUTE_LOCAL_HOST/system
Localhost is used as an address for the connection to libvirt; therefore, host networking must be configured in nova-compute manifest.
... spec: hostNetwork: True ...
Libvirt and nova-compute require access to multiple host directories, especially with OpenContrail vRouter plugin. Containers have to run in privileged mode to be able access these directories.
# Libvirt securityContext: privileged: True volumeMounts: - name: nova-instances mountPath: /var/lib/nova/instances readOnly: False - name: modules mountPath: /lib/modules readOnly: True - name: libvirt mountPath: /var/lib/libvirt readOnly: False - name: cgroups mountPath: /sys/fs/cgroup readOnly: False - name: qemu mountPath: /etc/libvirt/qemu readOnly: False - name: run mountPath: /run readOnly: False # Nova Compute securityContext: privileged: True volumeMounts: - name: nova-instances mountPath: /var/lib/nova/instances readOnly: False - name: lib mountPath: /usr/lib/python2.7/dist-packages/vnc_api/ readOnly: True - name: cfgm mountPath: /usr/lib/python2.7/dist-packages/cfgm_common/ readOnly: True
After all these configurations are done, you will be able to run libvirt and nova-compute in a single pod with 2 containers and instances can be provisioned. The problem is what happens if your container crashes? VM crashes, too. Therefore, the libvirt container has to use the host’s PID namespace. It is the
--pid=host parameter in a standard docker run. The Kubernetes manifest contains the following spec:
... spec: hostPID: True ...
Then you can try to kill libvirt container and re-run
root@node098:~# docker exec -it 54d7b8ee750a virsh list Id Name State ---------------------------------------------------- 22 instance-0000076c running
Because your VM running as qemu process in host namespace.
root@node098:~# ps -ef | grep qemu root 4444 61150 5 Aug01 ? 02:13:41 qemu-system-x86_64 -enable-kvm -name instance-0000076c -S -machine pc-i440fx-trusty ........
Now when we have a fully High Available scaled OpenStack and Kubernetes and we can move on to performance testing & scaling.
Benchmark testing & scaling
In the previous blog posts we have shown how to live upgrade in 2 minutes without any impact on running VM. Now we would like to try how the environment behaves under load and in scale of 50 compute nodes. In this section we describe the solution for collection and visualization of metrics and demonstrate a rally task for nova boot/delete of 1000 instances.
Physical infrastructure provisioning
Before that, let’s talk more about the compute note provisioning. We deploy all 50 computes by Ubuntu MaaS with preconfigured salt-minion through curtin_userdata. Then salt automatically configures Kubernetes node with Calico and OpenContrail. Finally, we automatically launch nova-compute/libvirt deployment, which starts a pod with 2 docker containers. This whole procedure on 50 computes takes about 40 minutes, which is super fast and fully automated.
root@kvm02:~# docker exec -it 69140b5696b6 nova-manage service list Binary Host Zone Status State nova-scheduler nova-controller-3741740494-qxkfz internal enabled :-) nova-conductor nova-controller-3741740494-qxkfz internal enabled :-) nova-consoleauth nova-controller-3741740494-qxkfz internal enabled :-) # 3 replicas for control services nova-conductor nova-controller-3741740494-0kj30 internal enabled :-) nova-cert nova-controller-3741740494-0kj30 internal enabled :-) nova-consoleauth nova-controller-3741740494-0kj30 internal enabled :-) nova-scheduler nova-controller-3741740494-0kj30 internal enabled :-) nova-compute node071 nova enabled :-) nova-compute node062 nova enabled :-) nova-compute node068 nova enabled :-) nova-compute node067 nova enabled :-) nova-compute node065 nova enabled :-) nova-compute node070 nova enabled :-) nova-compute node073 nova enabled :-) nova-compute node075 nova enabled :-) # 50 computes nova-compute node095 nova enabled :-) nova-compute node105 nova enabled :-) nova-compute node102 nova enabled :-) nova-compute node106 nova enabled :-)
Cluster Performance Monitoring & Visualization
Performance monitoring and visualization have two parts – Kubernetes workload and Underlay hosts.
Kubernetes provides native support for Heapster, which enables Container Cluster Monitoring and Performance Analysis. We launch Heapster through manifests as an addon. By default, it stores metrics in the InfluxDB backend, which can run as container, too. We had some issues with upstream InfluxDB docker image, so we decided to reuse our external production InfluxDB cluster.
Underlay hosts use collectd to collect system performance statistics periodically and send them to Graphite. Graphite is another type of time series database. Host metric collection is configured automatically by OpenStack-Salt during setup.
Both sources – InfluxDB and Graphite – are visualized in Grafana, which is a very elegant and powerful way to create, explore, and share dashboards. We created 3 dashboards in Grafana to provide detailed performance analysis for our OpenStack benchmarking. All dashboards are shown in the following benchmark section with real workload.
- Overall underlay hosts – dashboard per host or global dashboard with an average load, network traffic, disk I/O, used memory, etc. These charts enable us to see the impact on physical nodes.
- K8S Cluster – predefined dashboard from upstream k8s Grafana with basic information from physical nodes. It does not provide as detailed information as the previous dashboard.
- K8S Pods – show graphs per pod with CPU usage, individual memory usage, individual network usage, and filesystem usage.
OpenStack Rally was used for benchmark testing and reports. We decided to show only one rally task due to post space and prepare an extra post only for benchmarking. We used a task for nova boot and delete instance with cirros image, 1000 times, concurrency 50 in 3 tenants with 2 users per tenant. All testing is done on OpenStack Mitaka release with OpenContrail 3.0.2 as Neutron plugin.
The rally output shows that booting and deleting of 1000 instances with 50 concurrency took 706 seconds and we hit 6 failures. These failures are caused by statistical error of the rally. We do at least 10 runs with different inputs and trace logs.
As you can see, Rally provides nice graph reports from testing, but it does not show how the load of system looks and where can the bottleneck be. Therefore, we have several dashboards in Grafana, where the first one shows performance metrics from a nova controller pod during rally test. This post consists of 6 docker containers. As you can see, the highest cpu usage is consumed by nova-conductor and nova-api.
We compare these numbers with the information from other 2 nova pods and see that the numbers are almost the same. Mapping with support services like galera or rabbitmq pod also shows relationship between them.
The following dashboard shows CPU overall for underlay physical hosts. As you can see, the second controller (kvm02) has the highest usage of CPU.
The following screen shows the overall dashboard from underlay hosts. These dashboards help us to identify Disk I/O request on galera and identify bottlenecks on physical layer.
We would like to provide the detailed regression analysis and information from testing in an extra blog post very soon. Check the conclusion for more information.
In this blog, we introduced a powerful Salt formula for Kubernetes deployment and orchestration, then we shared the most important thoughts from running OpenStack on Kubernetes in High Availability. Finally, we shared how to do benchmarking with several reports.
Further details from the performance analysis and benchmarking will be part of the next post. We have a lot of information, which we cannot get into in this post due to space restrictions. Therefore, we would like to offer to the community to give us some ideas about what they want to know or see.
Help us to bring interesting analysis and stability to OpenStack on Kubernetes. Do you want to see how long we provision 1M instances? Or what happens with reliability when we cut off one of the controller cluster? Write down your ideas in the following etherpad https://etherpad.openstack.org/p/openstack-kubernetes-benchmark . You can also register for community live webinar, where we can demonstrate this testing.
We take these ideas and prepare the next awesome blog post with the results from the testing. #WeAreOpenStack
Jakub Pavlik & Marek Celoud