In the world of cloud native applications, etcd is the only stateful component of the Kubernetes control plane. This makes matters for an administrator simpler, but the Kubernetes 1.6 release throws a wrench into the process of maintaining 9s of reliability. But don't fear, I'll make sure you're covered.
etcd data store format: v2 or v3?
etcd version 3.0.0 and up supports two different data stores: v2 and v3, but it's important to know what version you're using because it impacts your ability to back up your information.
In Kubernetes 1.5, the default data store format was v2, but v3 was still available if you set it explicitly. For Kubernetes v1.6, however, the default data store is etcd v3, but you will still need to think about which format, for the various components that surround it.
For example, Calico, Canal, and Flannel can only write data to the etcd v2 data store, so combining their etcd data with the Kubernetes etcd data store can complicate the maintenance of etcd if Kubernetes is using v3.
Users blindly upgrading from Kubernetes v1.5 to v1.6 may be in for a surprise. (Just one reason it's important to always read the
release notes!) Kubernetes v1.6 changes the default etcd backend from v2 to v3, so make sure that before you start, you manually
migrate etcd to v3. This way, you can ensure data consistency, which requires shutting down all kube-apiservers.
If you don't want to migrate just yet, you can pin kube-apiserver back to v2 etcd with the following option:
--storage-backend=etcd2
Backing up etcd
All configuration data for Kubernetes is stored inside etcd, so in the event of an irrecoverable disaster, an operator can use an etcd backup to recover all data. Etcd creates snapshots regularly on its own, but daily backups stored on a separate host are a good strategy for disaster recovery for Kubernetes.
Backup methods
etcd has different backup methods for v2 and v3, and each has its own advantages and disadvantages. The v3 backup is much cleaner and consists of a single, compact file, but it has one major drawback: it won't backup or recover v2 data.
This means that if you have only etcd v3 data (for example, if your network plugin doesn't consume etcd), you can use the v3 backup, but if you have any v2 data--even if it's mixed with v3 data--you must use the v2 backup method.
Let's look at each of these methods.
Etcd v2 backups
The etcd v2 backup method creates a directory structure with a single WAL file. You can perform a backup online without interrupting etcd cluster operations. To back up an etcd v2+v3 data store, use the following command:
etcdctl backup --data-dir /var/lib/etcd/ --backup-dir /backupdir You can find the official procedure for etcd v2 restore here, but here is an overview of the basic steps. The challenging part is to rebuild the cluster one node at a time.
- Stop etcd on all hosts
- Purge /var/lib/etcd/member on all hosts
- Copy the backup to /var/lib/etcd/member on the first etcd host
- Start up etcd on the first etcd host with --force-new-cluster
- Set the correct the PeerURL on the first etcd host to the IP of the node instead of 127.0.0.1.
- Add the next host to the cluster
- Start etcd on the next host with --initial-cluster set to existing etcd hosts + itself
- Repeat 5 and 6 until all etcd nodes are joined
- Restart etcd normally (using existing settings)
You can see these steps in the following script:
#!/bin/bash -e
# Change as necessary RESTORE_PATH=${RESTORE_PATH:-/tmp/member}
#Extract node data from etcd config source /etc/etcd.env || source /etc/default/etcd function with_retries { local retries=3 set -o pipefail for try in $(seq 1 $retries); do ${@} [ $? -eq 0 ] && break if [[ "$try" == "$retries" ]]; then exit 1 fi sleep 3 done set +o pipefail }
this_node=$ETCD_NAME node_names=($(echo $ETCD_INITIAL_CLUSTER | \ awk -F'[=,]' '{for (i=1;i<=NF;i+=2) { print $i }}')) node_endpoints=($(echo $ETCD_INITIAL_CLUSTER | \ awk -F'[=,]' '{for (i=2;i<=NF;i+=2) { print $i }}')) node_ips=($(echo $ETCD_INITIAL_CLUSTER | \ awk -F'://|:[0-9]' '{for (i=2;i<=NF;i+=2) { print $i }}')) num_nodes=${#node_names[@]}
# Stop and purge etcd data for i in `seq 0 $((num_nodes - 1))`; do ssh ${node_ips[$i]} sudo service etcd stop ssh ${node_ips[$i]} sudo docker rm -f ${node_names[$i]} \ || : # Kargo specific ssh ${node_ips[$i]} sudo rm -rf /var/lib/etcd/member done
# Restore on first node if [[ "$this_node" == ${node_names[0]} ]]; then sudo cp -R $RESTORE_PATH /var/lib/etcd/ else rsync -vaz -e "ssh" --rsync-path="sudo rsync" \ "$RESTORE_PATH" ${node_ips[0]}:/var/lib/etcd/ fi
ssh ${node_ips[0]} "sudo etcd --force-new-cluster 2> \ /tmp/etcd-restore.log" & echo "Sleeping 5s to wait for etcd up" sleep 5
# Fix member endpoint on first node member_id=$(with_retries ssh ${node_ips[0]} \ ETCDCTL_ENDPOINTS=https://localhost:2379 \ etcdctl member list | cut -d':' -f1) ssh ${node_ips[0]} ETCDCTL_ENDPOINTS=https://localhost:2379 \ etcdctl member update $member_id ${node_endpoints[0]} echo "Waiting for etcd to reconfigure peer URL" sleep 4
# Add other nodes initial_cluster="${node_names[0]}=${node_endpoints[0]}" for i in `seq 1 $((num_nodes -1))`; do echo "Adding node ${node_names[$i]} to ETCD cluster..." initial_cluster=\ "$initial_cluster,${node_names[$i]}=${node_endpoints[$i]}" with_retries ssh ${node_ips[0]} \ ETCDCTL_ENDPOINTS=https://localhost:2379 \ etcdctl member add ${node_names[$i]} ${node_endpoints[$i]} ssh ${node_ips[$i]} \ "sudo etcd --initial-cluster="$initial_cluster" &>/dev/null" & sleep 5 with_retries ssh ${node_ips[0]} \ ETCDCTL_ENDPOINTS=https://localhost:2379 etcdctl member list done
echo "Restarting etcd on all nodes" for i in `seq 0 $((num_nodes -1))`; do ssh ${node_ips[$i]} sudo service etcd restart done
sleep 5
echo "Verifying cluster health" with_retries ssh ${node_ips[0]} \ ETCDCTL_ENDPOINTS=https://localhost:2379 etcdctl cluster-health |
Etcd v3 backups
The etcd v3 backup creates a single compressed file. Remember, while v2 backups surprisingly also copy v3 data, the v3 backup cannot be used to back up etcd v2 data, so be careful before using this method. To create a v3 backup, run the command:
ETCDCTL_API=3 etcdctl snapshot save /backupdir |
The official procedure for etcd v3 restore is documented
here, but as you can see, the general process is much simpler than it was for v2; the v3 restore process is capable of rebuilding the cluster without such granular steps.
The steps required are as follows:
- Stop etcd on all hosts
- Purge /var/lib/etcd/member on all hosts
- Copy the backup file to each etcd host
- source /etc/default/etcd on each host and run the following command:
ETCDCTL_API=3 etcdctl snapshot restore BACKUP_FILE \ --name $ETCD_NAME--initial-cluster "$ETCD_INITIAL_CLUSTER" \ --initial-cluster-token “$ETCD_INITIAL_CLUSTER_TOKEN” \ --initial-advertise-peer-urls $ETCD_INITIAL_ADVERTISE_PEER_URLS \ --data-dir $ETCD_DATA_DIR |
Tuning etcd
Because etcd is used to store Kubernetes' configuration information, its performance is crucial to the efficient performance of your cluster. Fortunately, etcd can be tuned to better operate under various deployment conditions. All write operations require synchronization between all etcd nodes, which leads us to the following functional requirements:
- etcd needs fast access to disk
- etcd needs low latency to other etcd nodes, and thus fast networking
- etcd needs to synchronize data across all etcd nodes before writing data to disk
Therefore, the following recommendations can be made:
- The etcd store should not be located on the same disk as a disk-intensive service (such as Ceph)
- etcd nodes should not be spread across datacenters or, in the case of public clouds, availability zones
- The number of etcd nodes should be 3; you need an odd number to prevent "split brain" problems, but more than 3 can be a drag on performance
The default etcd settings are not ideal for low disk I/O scenarios typically seen in test environments. As a result, set the following values:
ETCD_ELECTION_TIMEOUT=5000 #default 1000ms ETCD_HEARTBEAT_INTERVAL=250 #default 100ms |
Note that raising these values higher has a negative impact on read/write performance. It also creates a time penalty for the cluster to perform election, as the system takes longer to realize something is wrong. If these values are too low, however, the cluster will assume there's a problem and perform re-elections frequently if there is poor network or disk latency.
Troubleshooting etcd
Here are some problems we've run into with etcd, and the solutions we came up with to fix them.
Problem | Solution |
My restore fails and I see “etcdmain: database file (/var/lib/etcd/member/snap/db) of the backend is missing” in my etcd log. | The etcd v2 backup took place while etcd was writing a snapshot file. This backup file is not usable. The only solution is to restore from another backup file. |
Why is etcd not listening on port 2379? | There are several possible reasons. First, ensure that the etcd service is running. Next, check etcd service logs on each host to see if there are issues with election and/or quorum. At least 51% of the cluster must be online -- the actual formula is N/2 + 1 -- in order for any data to be read or written, to prevent split brain problems; this way you won't find yourself in a situation where different data is written across the cluster. That means a 3 node cluster must have at least 2 functional nodes. |
Why does etcd perform so many re-elections? | Try raising ETCD_ELECTION_TIMEOUT and ETCD_HEARTBEAT_INTERVAL. Also, try reducing the amount of load on the host. You can find more information here. |
Your turn
So that's our take on etcd and the issues you need to think of when it comes to Kubernetes. Do you know of any tips we left out, or did we miss your troubleshooting question? Let us know in the comments!