In the world of cloud native applications, etcd is the only stateful component of the Kubernetes control plane. This makes matters for an administrator simpler, but the Kubernetes 1.6 release throws a wrench into the process of maintaining 9s of reliability. But don’t fear, I’ll make sure you’re covered.
etcd data store format: v2 or v3?
etcd version 3.0.0 and up supports two different data stores: v2 and v3, but it’s important to know what version you’re using because it impacts your ability to back up your information.
In Kubernetes 1.5, the default data store format was v2, but v3 was still available if you set it explicitly. For Kubernetes v1.6, however, the default data store is etcd v3, but you will still need to think about which format, for the various components that surround it.
For example, Calico, Canal, and Flannel can only write data to the etcd v2 data store, so combining their etcd data with the Kubernetes etcd data store can complicate the maintenance of etcd if Kubernetes is using v3.
Users blindly upgrading from Kubernetes v1.5 to v1.6 may be in for a surprise. (Just one reason it’s important to always read the release notes!) Kubernetes v1.6 changes the default etcd backend from v2 to v3, so make sure that before you start, you manually migrate etcd to v3. This way, you can ensure data consistency, which requires shutting down all kube-apiservers.
If you don’t want to migrate just yet, you can pin kube-apiserver back to v2 etcd with the following option:
--storage-backend=etcd2
Backing up etcd
All configuration data for Kubernetes is stored inside etcd, so in the event of an irrecoverable disaster, an operator can use an etcd backup to recover all data. Etcd creates snapshots regularly on its own, but daily backups stored on a separate host are a good strategy for disaster recovery for Kubernetes.
Backup methods
etcd has different backup methods for v2 and v3, and each has its own advantages and disadvantages. The v3 backup is much cleaner and consists of a single, compact file, but it has one major drawback: it won’t backup or recover v2 data.
This means that if you have only etcd v3 data (for example, if your network plugin doesn’t consume etcd), you can use the v3 backup, but if you have any v2 data–even if it’s mixed with v3 data–you must use the v2 backup method.
Let’s look at each of these methods.
Etcd v2 backups
The etcd v2 backup method creates a directory structure with a single WAL file. You can perform a backup online without interrupting etcd cluster operations. To back up an etcd v2+v3 data store, use the following command:
etcdctl backup --data-dir /var/lib/etcd/ --backup-dir /backupdir You can find the official procedure for etcd v2 restore here, but here is an overview of the basic steps. The challenging part is to rebuild the cluster one node at a time.
You can see these steps in the following script: #!/bin/bash -e # Change as necessary RESTORE_PATH=${RESTORE_PATH:-/tmp/member} #Extract node data from etcd config source /etc/etcd.env || source /etc/default/etcd function with_retries { local retries=3 set -o pipefail for try in $(seq 1 $retries); do ${@} [ $? -eq 0 ] && break if [[ "$try" == "$retries" ]]; then exit 1 fi sleep 3 done set +o pipefail } this_node=$ETCD_NAME node_names=($(echo $ETCD_INITIAL_CLUSTER | \ awk -F'[=,]' '{for (i=1;i<=NF;i+=2) { print $i }}')) node_endpoints=($(echo $ETCD_INITIAL_CLUSTER | \ awk -F'[=,]' '{for (i=2;i<=NF;i+=2) { print $i }}')) node_ips=($(echo $ETCD_INITIAL_CLUSTER | \ awk -F'://|:[0-9]' '{for (i=2;i<=NF;i+=2) { print $i }}')) num_nodes=${#node_names[@]} # Stop and purge etcd data for i in `seq 0 $((num_nodes - 1))`; do ssh ${node_ips[$i]} sudo service etcd stop ssh ${node_ips[$i]} sudo docker rm -f ${node_names[$i]} \ || : # Kargo specific ssh ${node_ips[$i]} sudo rm -rf /var/lib/etcd/member done # Restore on first node if [[ "$this_node" == ${node_names[0]} ]]; then sudo cp -R $RESTORE_PATH /var/lib/etcd/ else rsync -vaz -e "ssh" --rsync-path="sudo rsync" \ "$RESTORE_PATH" ${node_ips[0]}:/var/lib/etcd/ fi ssh ${node_ips[0]} "sudo etcd --force-new-cluster 2> \ /tmp/etcd-restore.log" & echo "Sleeping 5s to wait for etcd up" sleep 5 # Fix member endpoint on first node member_id=$(with_retries ssh ${node_ips[0]} \ ETCDCTL_ENDPOINTS=https://localhost:2379 \ etcdctl member list | cut -d':' -f1) ssh ${node_ips[0]} ETCDCTL_ENDPOINTS=https://localhost:2379 \ etcdctl member update $member_id ${node_endpoints[0]} echo "Waiting for etcd to reconfigure peer URL" sleep 4 # Add other nodes initial_cluster="${node_names[0]}=${node_endpoints[0]}" for i in `seq 1 $((num_nodes -1))`; do echo "Adding node ${node_names[$i]} to ETCD cluster..." initial_cluster=\ "$initial_cluster,${node_names[$i]}=${node_endpoints[$i]}" with_retries ssh ${node_ips[0]} \ ETCDCTL_ENDPOINTS=https://localhost:2379 \ etcdctl member add ${node_names[$i]} ${node_endpoints[$i]} ssh ${node_ips[$i]} \ "sudo etcd --initial-cluster="$initial_cluster" &>/dev/null" & sleep 5 with_retries ssh ${node_ips[0]} \ ETCDCTL_ENDPOINTS=https://localhost:2379 etcdctl member list done echo "Restarting etcd on all nodes" for i in `seq 0 $((num_nodes -1))`; do ssh ${node_ips[$i]} sudo service etcd restart done sleep 5 echo "Verifying cluster health" with_retries ssh ${node_ips[0]} \ ETCDCTL_ENDPOINTS=https://localhost:2379 etcdctl cluster-health |
Etcd v3 backups
The etcd v3 backup creates a single compressed file. Remember, while v2 backups surprisingly also copy v3 data, the v3 backup cannot be used to back up etcd v2 data, so be careful before using this method. To create a v3 backup, run the command:
ETCDCTL_API=3 etcdctl snapshot save /backupdir |
The official procedure for etcd v3 restore is documented here, but as you can see, the general process is much simpler than it was for v2; the v3 restore process is capable of rebuilding the cluster without such granular steps.
The steps required are as follows:
- Stop etcd on all hosts
- Purge /var/lib/etcd/member on all hosts
- Copy the backup file to each etcd host
- source /etc/default/etcd on each host and run the following command:
ETCDCTL_API=3 etcdctl snapshot restore BACKUP_FILE \ --name $ETCD_NAME--initial-cluster "$ETCD_INITIAL_CLUSTER" \ --initial-cluster-token “$ETCD_INITIAL_CLUSTER_TOKEN” \ --initial-advertise-peer-urls $ETCD_INITIAL_ADVERTISE_PEER_URLS \ --data-dir $ETCD_DATA_DIR |
Tuning etcd
Because etcd is used to store Kubernetes’ configuration information, its performance is crucial to the efficient performance of your cluster. Fortunately, etcd can be tuned to better operate under various deployment conditions. All write operations require synchronization between all etcd nodes, which leads us to the following functional requirements:
- etcd needs fast access to disk
- etcd needs low latency to other etcd nodes, and thus fast networking
- etcd needs to synchronize data across all etcd nodes before writing data to disk
Therefore, the following recommendations can be made:
- The etcd store should not be located on the same disk as a disk-intensive service (such as Ceph)
- etcd nodes should not be spread across datacenters or, in the case of public clouds, availability zones
- The number of etcd nodes should be 3; you need an odd number to prevent “split brain” problems, but more than 3 can be a drag on performance
The default etcd settings are not ideal for low disk I/O scenarios typically seen in test environments. As a result, set the following values:
ETCD_ELECTION_TIMEOUT=5000 #default 1000ms ETCD_HEARTBEAT_INTERVAL=250 #default 100ms |
Note that raising these values higher has a negative impact on read/write performance. It also creates a time penalty for the cluster to perform election, as the system takes longer to realize something is wrong. If these values are too low, however, the cluster will assume there’s a problem and perform re-elections frequently if there is poor network or disk latency.
Troubleshooting etcd
Here are some problems we’ve run into with etcd, and the solutions we came up with to fix them.
Problem | Solution |
My restore fails and I see “etcdmain: database file (/var/lib/etcd/member/snap/db) of the backend is missing” in my etcd log. | The etcd v2 backup took place while etcd was writing a snapshot file. This backup file is not usable. The only solution is to restore from another backup file. |
Why is etcd not listening on port 2379? | There are several possible reasons. First, ensure that the etcd service is running. Next, check etcd service logs on each host to see if there are issues with election and/or quorum. At least 51% of the cluster must be online — the actual formula is N/2 + 1 — in order for any data to be read or written, to prevent split brain problems; this way you won’t find yourself in a situation where different data is written across the cluster. That means a 3 node cluster must have at least 2 functional nodes. |
Why does etcd perform so many re-elections? | Try raising ETCD_ELECTION_TIMEOUT and ETCD_HEARTBEAT_INTERVAL. Also, try reducing the amount of load on the host. You can find more information here. |
Your turn
So that’s our take on etcd and the issues you need to think of when it comes to Kubernetes. Do you know of any tips we left out, or did we miss your troubleshooting question? Let us know in the comments!
> ETCDCTL_API=3 etcdctl snapshot save /backupdir
By the way `snapshot save` argument is filename, not a directory.
Hello Matthew. Thank you so much for this article. It has helped me a lot in my understanding. However I have a couple of questions.
1. On etcd3 backups, on point number 4, what do you mean “source /etc/default/etcd on each host” ? Could you elaborate more on that please.
2. After I restore the etcd cluster, my kubernetes cluster does not list the nodes. I restarted the kube-apiserver process (container) and does not help. I restarted the kubelet and that allows me to see the nodes again, but all my previous k8s cluster state is gone (no namespaces, no deployments, no services, etcd)
Have you encounter this or could you please point me to right direction?
I am seeing something similar. Post restore etcd comes back up and i am able to use etcdctl to get data from etcd. However the k8s control plane is not able to make progress. But i see that kubectl commands were responsive. The control plane itself seems to be in a livelock. The apiserver and controller manager logs are filled messages of mismatch. I am using etcd3
etcdctl backup only saves v2 data. The reason you see v3 data is because there’s probably v3 requests in the WAL. Any v3 data that’s been snapshotted away from the WAL will not appear in the restored cluster.
If I have a 1 node cluster and want to add a second node to it, will I need the –force-new-cluster setting? Currently, I am able to add the second node successfully and the cluster reports to be in a healthy state but all the existing data in the cluster disappears. When I remove the new node, the data reappears after a restart.
I am using the “new” setting instead of “force-new-cluster” setting.
Any ideas?
Hi,
My company would like to use etcd to store different configurations, but do you think that using the internal kubernetes etcd for that is a good idea?
thank you