OpenStack-Salt Kubernetes releasedRecently our new OpenStack-Salt Kubernetes Formula became official. This formula should provide stable procedure how to deploy, scale and manage
production readyKubernetes underlay. At first glance you might ask why would I use this salt formula instead of official one? To make
production readywe had to provide the following functions:
- build debian-based packages instead of downloading binaries, init scripts, or reusing existing docker containers with unknown origin
- use single or high available cluster setup
- opt out to run control services in systems not only in manifests
- cleanup from mixing bash, salt, and unrelated features for production
- automaticlly generate and orchestrate kubernetes manifests
- integrate with Calico, Flannel, and OpenContrail
- automatically generate labels and namespace
- manage native SSL cert by Salt
- pull images from private docker registry with authentication
Underlay ArchitectureBasic architecture was shown and explained in part 2; however, high availability for Kubernetes itself is missing. Therefore, we worked mostly on this part last month. As already mentioned, we build packages from source for Ubuntu Xenial. Currently there is ETCD v3 and Kubernetes 1.2.x (1.3 would be soon). Minimum setup should be 5 nodes – 3 controllers and 2 computes (6 in case of salt master separation). To deploy these 5 nodes we use Ubuntu MaaS as baremetal provisioning, which very easily deploy OS with salt-minion agent. Kubernetes master nodes should be separated from pools with OpenStack control plane for large scale environment with hundreds of computes. Several notes or points from Kubernetes HA setup:
- do not run kubernetes itself through manifests as containers – when docker crashes, kubernetes api, scheduler, and etcd crash, too. We got into several break points when we hit some bug in docker, which broke up whole cluster control plane.
- keep alive for VIP and HAProxy – use as simple setup as possible without corosync/pacemaker or other complex cluster tooling.
- HA on client side – do not use haproxy as workaround for etcd read/write operation. Sample
- separate Calico ETCD from default container – by default Calico launches own etcd, which is not suitable for production environment. Reuse the existing ETCD kubernetes cluster or launch another ETCD cluster just for Calico (preferred way).
- do not use “–api-servers“ HA for kubelet – multiple kube-api server addresses in kubelet configuration seem to be HA on client side. Unfortunately, it takes just the first address. More information here
High Availability for OpenStack Support servicesAs already mentioned, HA for OpenStack services running on Kubernetes is not so difficult, because APIs are stateless. Statefull services are mysql galera, rabbitmq, etc. In this case we cannot scale pods and use native iptables DNAT balancing. Kubernetes offers external advance balancing, when it runs on GCE or AWS. Unfortunately, default statefull balancing for on-premise deployments is missing. Therefore, we have to extend kubernetes cluster by HAProxy, which is not natively managed by kubectl. Lets take a look at several support services.
MySQL GaleraMySQL Galera is most tricky part and we spent some time on dealing several scenarios especially How to load balance, bootstrap, restart, restore Galera cluster? Kubernetes 1.3 introduces PetSet resource for statefull services like galera, cassandra, etc. This should provide deterministic hostname, persistent storage, boostrap procedure. Unfortunately there are still several limitations, where the biggest one is load balancing. On-premise Kubernetes does not offer advanced balancing method, which galera requires. This led us to stay with standard Service/Deployment resources and use external HAProxy with Keepalived Virtual IP. The following diagram shows how the structure looks. Each galera node is defined by Deployment resource with replica 1 and specific controller label to pin pod on physical controller. Then Service resource is created with specific IP addressed used in HAProxy member configuration. Galera also has special boostrap procedure, where the first node in the cluster must be started without connection to other peers. There we reused existing approach from official kubernetes sample, where looped is through kubernetes shared environment variables and set
WSREP_CLUSTER_ADRESSbased on number of launched cluster nodes. This enables also easy Disaster Recovery, when we have to start cluster from any node not just the first one. Persistent storage is provided by hostPath volume from physical host. Kubernetes schedules galera pod on a specific host through node selector and node label.
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: pxc-node1 ... volumeMounts: - name: mysql mountPath: /var/lib/mysql #Container mountpoint readOnly: False volumes: - name: mysql hostPath: path: /var/lib/mysql #Host mount path ... nodeSelector: openstack: controller01 #Label of controller01
RabbitMQ clusterRabbitMQ cluster is easier to instal on Kubernetes than Galera. Load Balancing on HAProxy is not required, because OpenStack service can use the parameter
rabbitmq_hostswith list of cluster members. We had to figure out how to boostrap the cluster and make a persistent hostname, because the cluster is hostname specific and we cannot rely on randomly generated hostname. RabbitMQ cluster boostrap requires
set_cluster_nameon the first node and after that the other nodes can join to this cluster. We had to put this logic into container entrypoint. The entrypoint.sh contains loop, which set a role – master or slave – into the salt pillar based on existing number of rabbitmq service hosts in kubernetes shared environment variables. Then salt highstate sets up all other parameters required for rabbitmq cluster like user, HA queues, etc. Second issue was hostname, which must be known before the pod starts. Kubernetes metadata annotations provide a possibility to predefine specific hostname of pod. Then we can reference this for cluster setup.
apiVersion: extensions/v1beta1 kind: Deployment ... spec: replicas: 1 template: metadata: labels: app: rabbitmq-server-node01 annotations: pod.beta.kubernetes.io/hostname: rabbitmq-server-node01 #Hostname definition ...
OpenContrail scalingOpenContrail is very difficult for containerization because of huge complexity. It contains around 15 services running in 5 supervisors – analytics, control, config, database, and webui. These services can run as microservices, which shows this repository maintained and developed by Michael Henkel. However, this solves only container builds and running without high availability as single setup. The problem except HA is the missing process status event checks provided by Contrail Nodemgr. Events from nodemgr are collected and displayed in Contrail WebUI. This utility depends on specific supervisors’ names and in some cases on own init scripts as contrail-database (replacement of cassandra init script). For this reason we decided to run Kubernetes Deployment pod per supervisor opencontrail role even for cassandra, kafka, and zookeeper. It goes against microservice segmentation service per container, because we have to run supervisor on foreground with several services as show in the following output from opencontrail-control pod. However, we are able to deliver the green status and fully functional OpenContrail, which is the same as in standard server deployment.
root@opencontrail-control-550694332-8bf3m:/# supervisorctl -s unix:///tmp/supervisord_control.sock status contrail-control RUNNING pid 164, uptime 5 days, 7:38:25 contrail-control-nodemgr RUNNING pid 163, uptime 5 days, 7:38:25 contrail-dns RUNNING pid 165, uptime 5 days, 7:38:25 contrail-named RUNNING pid 315, uptime 5 days, 7:38:22
root@kvm01:~# kubectl get deployment NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE opencontrail-collector 3 3 3 3 5d opencontrail-config 3 3 3 3 5d opencontrail-control 3 3 3 3 5d opencontrail-web 1 1 1 1 5d oc-database01 1 1 1 1 5d oc-database02 1 1 1 1 5d oc-database03 1 1 1 1 5d oc-nal-database01 1 1 1 1 5d oc-nal-database02 1 1 1 1 5d oc-nal-database03 1 1 1 1 5d root@kvm02:~# kubectl get services NAME CLUSTER-IP PORT(S) opencontrail-collector 10.254.77.56 8081/TCP,8086/TCP,6379/TCP opencontrail-config 10.254.0.19 8082/TCP,5998/TCP,8443/TCP opencontrail-control 10.254.25.20 8083/TCP,53/TCP,5269/TCP,8092/TCP,8093/TCP opencontrail-web 10.254.0.21 8080/TCP,8143/TCP oc-database01 10.254.29.161 9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP oc-database02 10.254.107.22 9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP oc-database03 10.254.2.227 9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP oc-nal-database01 10.254.99.190 9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP oc-nal-database02 10.254.38.94 9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP oc-nal-database03 10.254.118.49 9160/TCP,9042/TCP,7000/TCP,2181/TCP,9092/TCP,2888/TCP,3888/TCP
... [IFMAP] server_url=https://10.254.0.19:8443 # contrail-config service ip ...
Nova Compute & LibvirtRunning libvirt and nova-compute in a container requires several configurations, but it is not difficult. There are a couple of blogs about e.g. Atomic. It was difficult to make Virtual Machines persistent and running after container crashes and still manage them after libvirt started again. Let’s take a look at the required configurations. Switch nova-compute connection to libvirt from socket to TCP. Those are separate containers and we do not want to share
/var/runfor socket. This setup also enables a theoretical launch of nova-compute on a different host.
#Set libvirt to TCP instead of socket in entrypoint.sh crudini --set /etc/nova/nova.conf libvirt connection_uri qemu+tcp://$NOVA_COMPUTE_LOCAL_HOST/system
... spec: hostNetwork: True ...
# Libvirt securityContext: privileged: True volumeMounts: - name: nova-instances mountPath: /var/lib/nova/instances readOnly: False - name: modules mountPath: /lib/modules readOnly: True - name: libvirt mountPath: /var/lib/libvirt readOnly: False - name: cgroups mountPath: /sys/fs/cgroup readOnly: False - name: qemu mountPath: /etc/libvirt/qemu readOnly: False - name: run mountPath: /run readOnly: False # Nova Compute securityContext: privileged: True volumeMounts: - name: nova-instances mountPath: /var/lib/nova/instances readOnly: False - name: lib mountPath: /usr/lib/python2.7/dist-packages/vnc_api/ readOnly: True - name: cfgm mountPath: /usr/lib/python2.7/dist-packages/cfgm_common/ readOnly: True
--pid=hostparameter in a standard docker run. The Kubernetes manifest contains the following spec:
... spec: hostPID: True ...
root@node098:~# docker exec -it 54d7b8ee750a virsh list Id Name State ---------------------------------------------------- 22 instance-0000076c running
root@node098:~# ps -ef | grep qemu root 4444 61150 5 Aug01 ? 02:13:41 qemu-system-x86_64 -enable-kvm -name instance-0000076c -S -machine pc-i440fx-trusty ........
Benchmark testing & scalingIn the previous blog posts we have shown how to live upgrade in 2 minutes without any impact on running VM. Now we would like to try how the environment behaves under load and in scale of 50 compute nodes. In this section we describe the solution for collection and visualization of metrics and demonstrate a rally task for nova boot/delete of 1000 instances.
Physical infrastructure provisioningBefore that, let’s talk more about the compute note provisioning. We deploy all 50 computes by Ubuntu MaaS with preconfigured salt-minion through curtin_userdata. Then salt automatically configures Kubernetes node with Calico and OpenContrail. Finally, we automatically launch nova-compute/libvirt deployment, which starts a pod with 2 docker containers. This whole procedure on 50 computes takes about 40 minutes, which is super fast and fully automated.
root@kvm02:~# docker exec -it 69140b5696b6 nova-manage service list Binary Host Zone Status State nova-scheduler nova-controller-3741740494-qxkfz internal enabled :-) nova-conductor nova-controller-3741740494-qxkfz internal enabled :-) nova-consoleauth nova-controller-3741740494-qxkfz internal enabled :-) # 3 replicas for control services nova-conductor nova-controller-3741740494-0kj30 internal enabled :-) nova-cert nova-controller-3741740494-0kj30 internal enabled :-) nova-consoleauth nova-controller-3741740494-0kj30 internal enabled :-) nova-scheduler nova-controller-3741740494-0kj30 internal enabled :-) nova-compute node071 nova enabled :-) nova-compute node062 nova enabled :-) nova-compute node068 nova enabled :-) nova-compute node067 nova enabled :-) nova-compute node065 nova enabled :-) nova-compute node070 nova enabled :-) nova-compute node073 nova enabled :-) nova-compute node075 nova enabled :-) # 50 computes nova-compute node095 nova enabled :-) nova-compute node105 nova enabled :-) nova-compute node102 nova enabled :-) nova-compute node106 nova enabled :-)
Cluster Performance Monitoring & VisualizationPerformance monitoring and visualization have two parts – Kubernetes workload and Underlay hosts. Kubernetes provides native support for Heapster, which enables Container Cluster Monitoring and Performance Analysis. We launch Heapster through manifests as an addon. By default, it stores metrics in the InfluxDB backend, which can run as container, too. We had some issues with upstream InfluxDB docker image, so we decided to reuse our external production InfluxDB cluster. Underlay hosts use collectd to collect system performance statistics periodically and send them to Graphite. Graphite is another type of time series database. Host metric collection is configured automatically by OpenStack-Salt during setup. Both sources – InfluxDB and Graphite – are visualized in Grafana, which is a very elegant and powerful way to create, explore, and share dashboards. We created 3 dashboards in Grafana to provide detailed performance analysis for our OpenStack benchmarking. All dashboards are shown in the following benchmark section with real workload.
- Overall underlay hosts – dashboard per host or global dashboard with an average load, network traffic, disk I/O, used memory, etc. These charts enable us to see the impact on physical nodes.
- K8S Cluster – predefined dashboard from upstream k8s Grafana with basic information from physical nodes. It does not provide as detailed information as the previous dashboard.
- K8S Pods – show graphs per pod with CPU usage, individual memory usage, individual network usage, and filesystem usage.