Making OpenStack Production Ready with Kubernetes and OpenStack-Salt – Part 3

In the previous two parts (part 1, part 2) we explained OpenStack deployment evolution, which led to containerization of everything (deployment workflow) and showed how to build and orchestrate OpenStack on top of Kubernetes. We proved how easily can OpenStack be built, deployed and upgraded in 10 minutes. We got amazing feedback from the OpenStack community, which pushes our limits further.

2 minute start or upgrade is awesome, but how does this behave under real workload, scale, and during long term operation? High Availability of OpenStack components is easy, but how to run MySQL Galera or RabbitMQ cluster on Kubernetes? How to run Kubernetes itself in HA? Should we run Kubernetes components in manifests as well? These are the questions, which must be answered before putting your workload on this setup.

The following blog post is divided into 3 sections; the first introduces a new OpenStack-Salt formula for Kubernetes deployment, followed by thoughts from Underlay Architecture. After that High Availability for OpenStack support services is introduced. Finally, we present performance testing & scaling results from various scenarios.

OpenStack-Salt Kubernetes released

Recently our new OpenStack-Salt Kubernetes Formula became official. This formula should provide stable procedure how to deploy, scale and manage production ready Kubernetes underlay.

At first glance you might ask why would I use this salt formula instead of official one? To make production ready we had to provide the following functions:

  • build debian-based packages instead of downloading binaries, init scripts, or reusing existing docker containers with unknown origin
  • use single or high available cluster setup
  • opt out to run control services in systems not only in manifests
  • cleanup from mixing bash, salt, and unrelated features for production
  • automaticlly generate and orchestrate kubernetes manifests
  • integrate with Calico, Flannel, and OpenContrail
  • automatically generate labels and namespace
  • manage native SSL cert by Salt
  • pull images from private docker registry with authentication

Only with these features can you deliver supportable and operational ready Kubernetes to enterprise. Kubernetes Formulas as well as whole OpenStack-Salt offer open way how to run production grade environment. If you are missing some features, please join us on IRC openstack-salt.

Underlay Architecture

Basic architecture was shown and explained in part 2; however, high availability for Kubernetes itself is missing. Therefore, we worked mostly on this part last month. As already mentioned, we build packages from source for Ubuntu Xenial. Currently there is ETCD v3 and Kubernetes 1.2.x (1.3 would be soon).

Minimum setup should be 5 nodes – 3 controllers and 2 computes (6 in case of salt master separation). To deploy these 5 nodes we use Ubuntu MaaS as baremetal provisioning, which very easily deploy OS with salt-minion agent.

../_images/ha-logical-topology.png

Kubernetes master nodes should be separated from pools with OpenStack control plane for large scale environment with hundreds of computes. Several notes or points from Kubernetes HA setup:

  • do not run kubernetes itself through manifests as containers – when docker crashes, kubernetes api, scheduler, and etcd crash, too. We got into several break points when we hit some bug in docker, which broke up whole cluster control plane.
  • keep alive for VIP and HAProxy – use as simple setup as possible without corosync/pacemaker or other complex cluster tooling.
  • HA on client side – do not use haproxy as workaround for etcd read/write operation. Sample --etcd-servers=http://10.0.111.201:4001,http://10.0.111.202:4001,http://10.0.111.203:4001
  • separate Calico ETCD from default container – by default Calico launches own etcd, which is not suitable for production environment. Reuse the existing ETCD kubernetes cluster or launch another ETCD cluster just for Calico (preferred way).
  • do not use “–api-servers“ HA for kubelet – multiple kube-api server addresses in kubelet configuration seem to be HA on client side. Unfortunately, it takes just the first address. More information here

All these points are included by default in OpenStack-Salt Kubernetes formula.

High Availability for OpenStack Support services

As already mentioned, HA for OpenStack services running on Kubernetes is not so difficult, because APIs are stateless. Statefull services are mysql galera, rabbitmq, etc. In this case we cannot scale pods and use native iptables DNAT balancing. Kubernetes offers external advance balancing, when it runs on GCE or AWS. Unfortunately, default statefull balancing for on-premise deployments is missing. Therefore, we have to extend kubernetes cluster by HAProxy, which is not natively managed by kubectl. Lets take a look at several support services.

MySQL Galera

MySQL Galera is most tricky part and we spent some time on dealing several scenarios especially How to load balance, bootstrap, restart, restore Galera cluster?

Kubernetes 1.3 introduces PetSet resource for statefull services like galera, cassandra, etc. This should provide deterministic hostname, persistent storage, boostrap procedure. Unfortunately there are still several limitations, where the biggest one is load balancing. On-premise Kubernetes does not offer advanced balancing method, which galera requires. This led us to stay with standard Service/Deployment resources and use external HAProxy with Keepalived Virtual IP.

The following diagram shows how the structure looks. Each galera node is defined by Deployment resource with replica 1 and specific controller label to pin pod on physical controller. Then Service resource is created with specific IP addressed used in HAProxy member configuration.

../_images/galera-svc.png

Galera also has special boostrap procedure, where the first node in the cluster must be started without connection to other peers. There we reused existing approach from official kubernetes sample, where looped is through kubernetes shared environment variables and set WSREP_CLUSTER_ADRESS based on number of launched cluster nodes. This enables also easy Disaster Recovery, when we have to start cluster from any node not just the first one.

Persistent storage is provided by hostPath volume from physical host. Kubernetes schedules galera pod on a specific host through node selector and node label.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: pxc-node1
...
              volumeMounts:
            - name:  mysql
              mountPath: /var/lib/mysql #Container mountpoint
              readOnly: False
      volumes:
        - name: mysql
          hostPath:
            path: /var/lib/mysql #Host mount path
...
      nodeSelector:
        openstack: controller01 #Label of controller01

RabbitMQ cluster

RabbitMQ cluster is easier to instal on Kubernetes than Galera. Load Balancing on HAProxy is not required, because OpenStack service can use the parameter rabbitmq_hosts with list of cluster members. We had to figure out how to boostrap the cluster and make a persistent hostname, because the cluster is hostname specific and we cannot rely on randomly generated hostname.

RabbitMQ cluster boostrap requires set_cluster_name on the first node and after that the other nodes can join to this cluster. We had to put this logic into container entrypoint. The entrypoint.sh contains loop, which set a role – master or slave – into the salt pillar based on existing number of rabbitmq service hosts in kubernetes shared environment variables. Then salt highstate sets up all other parameters required for rabbitmq cluster like user, HA queues, etc.

Second issue was hostname, which must be known before the pod starts. Kubernetes metadata annotations provide a possibility to predefine specific hostname of pod. Then we can reference this for cluster setup.

apiVersion: extensions/v1beta1
kind: Deployment
...
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: rabbitmq-server-node01
      annotations:
        pod.beta.kubernetes.io/hostname: rabbitmq-server-node01 #Hostname definition
...

Disaster Recovery for RabbitMQ does not require specific manual action like in the case of galera. So containers can be simply restarted without impact.

OpenContrail scaling

OpenContrail is very difficult for containerization because of huge complexity. It contains around 15 services running in 5 supervisors – analytics, control, config, database, and webui. These services can run as microservices, which shows this repository maintained and developed by Michael Henkel. However, this solves only container builds and running without high availability as single setup.

The problem except HA is the missing process status event checks provided by Contrail Nodemgr. Events from nodemgr are collected and displayed in Contrail WebUI. This utility depends on specific supervisors’ names and in some cases on own init scripts as contrail-database (replacement of cassandra init script). For this reason we decided to run Kubernetes Deployment pod per supervisor opencontrail role even for cassandra, kafka, and zookeeper. It goes against microservice segmentation service per container, because we have to run supervisor on foreground with several services as show in the following output from opencontrail-control pod. However, we are able to deliver the green status and fully functional OpenContrail, which is the same as in standard server deployment.

root@opencontrail-control-550694332-8bf3m:/# supervisorctl -s unix:///tmp/supervisord_control.sock  status
contrail-control                 RUNNING    pid 164, uptime 5 days, 7:38:25
contrail-control-nodemgr         RUNNING    pid 163, uptime 5 days, 7:38:25
contrail-dns                     RUNNING    pid 165, uptime 5 days, 7:38:25
contrail-named                   RUNNING    pid 315, uptime 5 days, 7:38:22

In the end, we run the following list of kubernetes deployment to get OpenContrail in High Availability. There are 2 cassandra clusters for opencontrail config (oc-database0X) and analytics (oc-nal-database0X) databases. A similar approach is used for Service/Deployment components like for galera cluster.

root@kvm01:~# kubectl get deployment
NAME                      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
opencontrail-collector    3         3         3            3           5d
opencontrail-config       3         3         3            3           5d
opencontrail-control      3         3         3            3           5d
opencontrail-web          1         1         1            1           5d
oc-database01             1         1         1            1           5d
oc-database02             1         1         1            1           5d
oc-database03             1         1         1            1           5d
oc-nal-database01         1         1         1            1           5d
oc-nal-database02         1         1         1            1           5d
oc-nal-database03         1         1         1            1           5d

root@kvm02:~# kubectl get services
NAME                      CLUSTER-IP     PORT(S)
opencontrail-collector    10.254.77.56   8081/TCP,8086/TCP,6379/TCP
opencontrail-config       10.254.0.19    8082/TCP,5998/TCP,8443/TCP
opencontrail-control      10.254.25.20   8083/TCP,53/TCP,5269/TCP,8092/TCP,8093/TCP
opencontrail-web          10.254.0.21    8080/TCP,8143/TCP
oc-database01             10.254.29.161  9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP
oc-database02             10.254.107.22  9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP
oc-database03             10.254.2.227   9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP
oc-nal-database01         10.254.99.190  9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP
oc-nal-database02         10.254.38.94   9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP
oc-nal-database03         10.254.118.49  9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP

We hope in the future replacement of supervisor by native systems and rewriting of nodemgr to be more flexible to run in docker containers.

Last modification is a manually set ifmap server url in /etc/contrail/contrail-control.conf pointing to contrail-config service without reading from contrail-discovery. This seems to be a bug, which normally cannot be discovered, because control and config usually run on the same server.

...
[IFMAP]
server_url=https://10.254.0.19:8443 # contrail-config service ip
...

Nova Compute & Libvirt

Running libvirt and nova-compute in a container requires several configurations, but it is not difficult. There are a couple of blogs about e.g. Atomic. It was difficult to make Virtual Machines persistent and running after container crashes and still manage them after libvirt started again. Let’s take a look at the required configurations.

Switch nova-compute connection to libvirt from socket to TCP. Those are separate containers and we do not want to share /var/run for socket. This setup also enables a theoretical launch of nova-compute on a different host.

#Set libvirt to TCP instead of socket in entrypoint.sh
crudini --set /etc/nova/nova.conf libvirt connection_uri qemu+tcp://$NOVA_COMPUTE_LOCAL_HOST/system

Localhost is used as an address for the connection to libvirt; therefore, host networking must be configured in nova-compute manifest.

...
spec:
  hostNetwork: True
...

Libvirt and nova-compute require access to multiple host directories, especially with OpenContrail vRouter plugin. Containers have to run in privileged mode to be able access these directories.

# Libvirt
securityContext:
  privileged: True
volumeMounts:
  - name:  nova-instances
    mountPath: /var/lib/nova/instances
    readOnly: False
  - name:  modules
    mountPath: /lib/modules
    readOnly: True
  - name:  libvirt
    mountPath: /var/lib/libvirt
    readOnly: False
  - name:  cgroups
    mountPath: /sys/fs/cgroup
    readOnly: False
  - name:  qemu
    mountPath: /etc/libvirt/qemu
    readOnly: False
  - name:  run
    mountPath: /run
    readOnly: False

# Nova Compute
securityContext:
  privileged: True
volumeMounts:
  - name:  nova-instances
    mountPath: /var/lib/nova/instances
    readOnly: False
  - name:  lib
    mountPath: /usr/lib/python2.7/dist-packages/vnc_api/
    readOnly: True
  - name:  cfgm
    mountPath: /usr/lib/python2.7/dist-packages/cfgm_common/
    readOnly: True

After all these configurations are done, you will be able to run libvirt and nova-compute in a single pod with 2 containers and instances can be provisioned. The problem is what happens if your container crashes? VM crashes, too. Therefore, the libvirt container has to use the host’s PID namespace. It is the --pid=host parameter in a standard docker run. The Kubernetes manifest contains the following spec:

...
spec:
  hostPID: True
...

Then you can try to kill libvirt container and re-run virsh list

root@node098:~# docker exec -it 54d7b8ee750a virsh list
 Id    Name                           State
----------------------------------------------------
 22    instance-0000076c              running

Because your VM running as qemu process in host namespace.

root@node098:~# ps -ef | grep qemu
root      4444 61150  5 Aug01 ?        02:13:41 qemu-system-x86_64 -enable-kvm -name instance-0000076c -S -machine pc-i440fx-trusty ........

Now when we have a fully High Available scaled OpenStack and Kubernetes and we can move on to performance testing & scaling.

Benchmark testing & scaling

In the previous blog posts we have shown how to live upgrade in 2 minutes without any impact on running VM. Now we would like to try how the environment behaves under load and in scale of 50 compute nodes. In this section we describe the solution for collection and visualization of metrics and demonstrate a rally task for nova boot/delete of 1000 instances.

Physical infrastructure provisioning

Before that, let’s talk more about the compute note provisioning. We deploy all 50 computes by Ubuntu MaaS with preconfigured salt-minion through curtin_userdata. Then salt automatically configures Kubernetes node with Calico and OpenContrail. Finally, we automatically launch nova-compute/libvirt deployment, which starts a pod with 2 docker containers. This whole procedure on 50 computes takes about 40 minutes, which is super fast and fully automated.

root@kvm02:~# docker exec -it 69140b5696b6 nova-manage service list
Binary           Host                                 Zone             Status     State
nova-scheduler   nova-controller-3741740494-qxkfz     internal         enabled    :-)
nova-conductor   nova-controller-3741740494-qxkfz     internal         enabled    :-)
nova-consoleauth nova-controller-3741740494-qxkfz     internal         enabled    :-)
# 3 replicas for control services
nova-conductor   nova-controller-3741740494-0kj30     internal         enabled    :-)
nova-cert        nova-controller-3741740494-0kj30     internal         enabled    :-)
nova-consoleauth nova-controller-3741740494-0kj30     internal         enabled    :-)
nova-scheduler   nova-controller-3741740494-0kj30     internal         enabled    :-)
nova-compute     node071                              nova             enabled    :-)
nova-compute     node062                              nova             enabled    :-)
nova-compute     node068                              nova             enabled    :-)
nova-compute     node067                              nova             enabled    :-)
nova-compute     node065                              nova             enabled    :-)
nova-compute     node070                              nova             enabled    :-)
nova-compute     node073                              nova             enabled    :-)
nova-compute     node075                              nova             enabled    :-)
# 50 computes
nova-compute     node095                              nova             enabled    :-)
nova-compute     node105                              nova             enabled    :-)
nova-compute     node102                              nova             enabled    :-)
nova-compute     node106                              nova             enabled    :-)

Cluster Performance Monitoring & Visualization

Performance monitoring and visualization have two parts – Kubernetes workload and Underlay hosts.

Kubernetes provides native support for Heapster, which enables Container Cluster Monitoring and Performance Analysis. We launch Heapster through manifests as an addon. By default, it stores metrics in the InfluxDB backend, which can run as container, too. We had some issues with upstream InfluxDB docker image, so we decided to reuse our external production InfluxDB cluster.

Underlay hosts use collectd to collect system performance statistics periodically and send them to Graphite. Graphite is another type of time series database. Host metric collection is configured automatically by OpenStack-Salt during setup.

Both sources – InfluxDB and Graphite – are visualized in Grafana, which is a very elegant and powerful way to create, explore, and share dashboards. We created 3 dashboards in Grafana to provide detailed performance analysis for our OpenStack benchmarking. All dashboards are shown in the following benchmark section with real workload.

  • Overall underlay hosts – dashboard per host or global dashboard with an average load, network traffic, disk I/O, used memory, etc. These charts enable us to see the impact on physical nodes.
  • K8S Cluster – predefined dashboard from upstream k8s Grafana with basic information from physical nodes. It does not provide as detailed information as the previous dashboard.
  • K8S Pods – show graphs per pod with CPU usage, individual memory usage, individual network usage, and filesystem usage.

Rally benchmark

OpenStack Rally was used for benchmark testing and reports. We decided to show only one rally task due to post space and prepare an extra post only for benchmarking. We used a task for nova boot and delete instance with cirros image, 1000 times, concurrency 50 in 3 tenants with 2 users per tenant. All testing is done on OpenStack Mitaka release with OpenContrail 3.0.2 as Neutron plugin.

The rally output shows that booting and deleting of 1000 instances with 50 concurrency took 706 seconds and we hit 6 failures. These failures are caused by statistical error of the rally. We do at least 10 runs with different inputs and trace logs.

../_images/rally-nova-boot-delete-1000x50.png

As you can see, Rally provides nice graph reports from testing, but it does not show how the load of system looks and where can the bottleneck be. Therefore, we have several dashboards in Grafana, where the first one shows performance metrics from a nova controller pod during rally test. This post consists of 6 docker containers. As you can see, the highest cpu usage is consumed by nova-conductor and nova-api.

../_images/nova-boot-and-delete-1000x50-controllerpod.png

We compare these numbers with the information from other 2 nova pods and see that the numbers are almost the same. Mapping with support services like galera or rabbitmq pod also shows relationship between them.

The following dashboard shows CPU overall for underlay physical hosts. As you can see, the second controller (kvm02) has the highest usage of CPU.

../_images/nova-boot-and-delete-1000x50-cpu-cluster.png

The following screen shows the overall dashboard from underlay hosts. These dashboards help us to identify Disk I/O request on galera and identify bottlenecks on physical layer.

../_images/nova-boot-and-delete-1000x50-overall.png

We would like to provide the detailed regression analysis and information from testing in an extra blog post very soon. Check the conclusion for more information.

Conclusion

In this blog, we introduced a powerful Salt formula for Kubernetes deployment and orchestration, then we shared the most important thoughts from running OpenStack on Kubernetes in High Availability. Finally, we shared how to do benchmarking with several reports.

Further details from the performance analysis and benchmarking will be part of the next post. We have a lot of information, which we cannot get into in this post due to space restrictions. Therefore, we would like to offer to the community to give us some ideas about what they want to know or see.

Help us to bring interesting analysis and stability to OpenStack on Kubernetes. Do you want to see how long we provision 1M instances? Or what happens with reliability when we cut off one of the controller cluster? Write down your ideas in the following etherpad https://etherpad.openstack.org/p/openstack-kubernetes-benchmark . You can also register for community live webinar, where we can demonstrate this testing.

We take these ideas and prepare the next awesome blog post with the results from the testing. #WeAreOpenStack

Jakub Pavlik & Marek Celoud

tcp cloud

Leave a Reply

Your email address will not be published. Required fields are marked *

NEWS VIA EMAIL

Recommendations

Archive

On-Demand Webinar
Introducing Mirantis Cloud Platform