Everything you ever wanted to know about using etcd with Kubernetes v1.6 (but were afraid to ask)

In the world of cloud native applications, etcd is the only stateful component of the Kubernetes control plane. This makes matters for an administrator simpler, but the Kubernetes 1.6 release throws a wrench into the process of maintaining 9s of reliability. But don’t fear, I’ll make sure you’re covered.

etcd data store format: v2 or v3?

etcd version 3.0.0 and up supports two different data stores: v2 and v3, but it’s important to know what version you’re using because it impacts your ability to back up your information.

In Kubernetes 1.5, the default data store format was v2, but v3 was still available if you set it explicitly. For Kubernetes v1.6, however, the default data store is etcd v3, but you will still need to think about which format, for the various components that surround it.

For example, Calico, Canal, and Flannel can only write data to the etcd v2 data store, so combining their etcd data with the Kubernetes etcd data store can complicate the maintenance of etcd if Kubernetes is using v3.

Users blindly upgrading from Kubernetes v1.5 to v1.6 may be in for a surprise. (Just one reason it’s important to always read the release notes!) Kubernetes v1.6 changes the default etcd backend from v2 to v3, so make sure that before you start, you manually migrate etcd to v3. This way, you can ensure data consistency, which requires shutting down all kube-apiservers.

If you don’t want to migrate just yet, you can pin kube-apiserver back to v2 etcd with the following option:

--storage-backend=etcd2

Backing up etcd

All configuration data for Kubernetes is stored inside etcd, so in the event of an irrecoverable disaster, an operator can use an etcd backup to recover all data. Etcd creates snapshots regularly on its own, but daily backups stored on a separate host are a good strategy for disaster recovery for Kubernetes.

Backup methods

etcd has different backup methods for v2 and v3, and each has its own advantages and disadvantages.  The v3 backup is much cleaner and consists of a single, compact file, but it has one major drawback: it won’t backup or recover v2 data.

This means that if you have only etcd v3 data (for example, if your network plugin doesn’t consume etcd), you can use the v3 backup, but if you have any v2 data–even if it’s mixed with v3 data–you must use the v2 backup method.

Let’s look at each of these methods.

Etcd v2 backups

The etcd v2 backup method creates a directory structure with a single WAL file. You can perform a backup online without interrupting etcd cluster operations. To back up an etcd v2+v3 data store, use the following command:

etcdctl backup --data-dir /var/lib/etcd/ --backup-dir /backupdir

You can find the official procedure for etcd v2 restore here, but here is an overview of the basic steps. The challenging part is to rebuild the cluster one node at a time.

  1. Stop etcd on all hosts
  2. Purge /var/lib/etcd/member on all hosts
  3. Copy the backup to /var/lib/etcd/member on the first etcd host
  4. Start up etcd on the first etcd host with –force-new-cluster
  5. Set the correct the PeerURL on the first etcd host to the IP of the node instead of 127.0.0.1.
  6. Add the next host to the cluster
  7. Start etcd on the next host with –initial-cluster set to existing etcd hosts + itself
  8. Repeat 5 and 6 until all etcd nodes are joined
  9. Restart etcd normally (using existing settings)

You can see these steps in the following script:

#!/bin/bash -e

# Change as necessary
RESTORE_PATH=${RESTORE_PATH:-/tmp/member}

#Extract node data from etcd config
source /etc/etcd.env || source /etc/default/etcd
function with_retries {
  local retries=3
  set -o pipefail
  for try in $(seq 1 $retries); do
    ${@}
    [ $? -eq 0 ] && break
    if [[ "$try" == "$retries" ]]; then
      exit 1
    fi
    sleep 3
  done
  set +o pipefail
}

this_node=$ETCD_NAME
node_names=($(echo $ETCD_INITIAL_CLUSTER | \
  awk -F'[=,]' '{for (i=1;i<=NF;i+=2) { print $i }}'))
node_endpoints=($(echo $ETCD_INITIAL_CLUSTER | \
  awk -F'[=,]' '{for (i=2;i<=NF;i+=2) { print $i }}'))
node_ips=($(echo $ETCD_INITIAL_CLUSTER | \
  awk -F'://|:[0-9]' '{for (i=2;i<=NF;i+=2) { print $i }}'))
num_nodes=${#node_names[@]}

# Stop and purge etcd data
for i in `seq 0 $((num_nodes - 1))`; do
  ssh ${node_ips[$i]} sudo service etcd stop
  ssh ${node_ips[$i]} sudo docker rm -f ${node_names[$i]} \
    || : # Kargo specific
  ssh ${node_ips[$i]} sudo rm -rf /var/lib/etcd/member
done

# Restore on first node
if [[ "$this_node" == ${node_names[0]} ]]; then
  sudo cp -R $RESTORE_PATH /var/lib/etcd/
else
  rsync -vaz -e "ssh" --rsync-path="sudo rsync" \
    "$RESTORE_PATH" ${node_ips[0]}:/var/lib/etcd/
fi 

ssh ${node_ips[0]} "sudo etcd --force-new-cluster 2> \
  /tmp/etcd-restore.log" &
echo "Sleeping 5s to wait for etcd up"
sleep 5

# Fix member endpoint on first node
member_id=$(with_retries ssh ${node_ips[0]} \
  ETCDCTL_ENDPOINTS=https://localhost:2379 \
  etcdctl member list | cut -d':' -f1)
ssh ${node_ips[0]} ETCDCTL_ENDPOINTS=https://localhost:2379 \
  etcdctl member update $member_id ${node_endpoints[0]}
echo "Waiting for etcd to reconfigure peer URL"
sleep 4

# Add other nodes
initial_cluster="${node_names[0]}=${node_endpoints[0]}"
for i in `seq 1 $((num_nodes -1))`; do
  echo "Adding node ${node_names[$i]} to ETCD cluster..."
  initial_cluster=\
    "$initial_cluster,${node_names[$i]}=${node_endpoints[$i]}"
  with_retries ssh ${node_ips[0]} \
    ETCDCTL_ENDPOINTS=https://localhost:2379 \
    etcdctl member add ${node_names[$i]} ${node_endpoints[$i]}
  ssh ${node_ips[$i]} \
    "sudo etcd --initial-cluster="$initial_cluster" &>/dev/null" &
  sleep 5
  with_retries ssh ${node_ips[0]} \
    ETCDCTL_ENDPOINTS=https://localhost:2379 etcdctl member list
done

echo "Restarting etcd on all nodes"
for i in `seq 0 $((num_nodes -1))`; do
  ssh ${node_ips[$i]} sudo service etcd restart
done

sleep 5

echo "Verifying cluster health"
with_retries ssh ${node_ips[0]} \
  ETCDCTL_ENDPOINTS=https://localhost:2379 etcdctl cluster-health

Etcd v3 backups

The etcd v3 backup creates a single compressed file. Remember, while v2 backups surprisingly also copy v3 data, the v3 backup cannot be used to back up etcd v2 data, so be careful before using this method. To create a v3 backup, run the command:

ETCDCTL_API=3 etcdctl snapshot save /backupdir

The official procedure for etcd v3 restore is documented here, but as you can see, the general process is much simpler than it was for v2; the v3 restore process is capable of rebuilding the cluster without such granular steps.

The steps required are as follows:

  1. Stop etcd on all hosts
  2. Purge /var/lib/etcd/member on all hosts
  3. Copy the backup file to each etcd host
  4. source /etc/default/etcd on each host and run the following command:
ETCDCTL_API=3 etcdctl snapshot restore BACKUP_FILE \
--name $ETCD_NAME--initial-cluster "$ETCD_INITIAL_CLUSTER" \
--initial-cluster-token “$ETCD_INITIAL_CLUSTER_TOKEN” \
--initial-advertise-peer-urls $ETCD_INITIAL_ADVERTISE_PEER_URLS \
--data-dir $ETCD_DATA_DIR

Tuning etcd

Because etcd is used to store Kubernetes’ configuration information, its performance is crucial to the efficient performance of your cluster. Fortunately, etcd can be tuned to better operate under various deployment conditions. All write operations require synchronization between all etcd nodes, which leads us to the following functional requirements:

  • etcd needs fast access to disk
  • etcd needs low latency to other etcd nodes, and thus fast networking
  • etcd needs to synchronize data across all etcd nodes before writing data to disk

Therefore, the following recommendations can be made:

  • The etcd store should not be located on the same disk as a disk-intensive service (such as Ceph)
  • etcd nodes should not be spread across datacenters or, in the case of public clouds, availability zones
  • The number of etcd nodes should be 3; you need an odd number to prevent “split brain” problems, but more than 3 can be a drag on performance

The default etcd settings are not ideal for low disk I/O scenarios typically seen in test environments. As a result, set the following values:

ETCD_ELECTION_TIMEOUT=5000 #default 1000ms
ETCD_HEARTBEAT_INTERVAL=250 #default 100ms

Note that raising these values higher has a negative impact on read/write performance. It also creates a time penalty for the cluster to perform election, as the system takes longer to realize something is wrong. If these values are too low, however, the cluster will assume there’s a problem and perform re-elections frequently if there is poor network or disk latency.

Troubleshooting etcd

Here are some problems we’ve run into with etcd, and the solutions we came up with to fix them.

Problem Solution
My restore fails and I see “etcdmain: database file (/var/lib/etcd/member/snap/db) of the backend is missing” in my etcd log. The etcd v2 backup took place while etcd was writing a snapshot file. This backup file is not usable. The only solution is to restore from another backup file.
Why is etcd not listening on port 2379? There are several possible reasons. First, ensure that the etcd service is running. Next, check etcd service logs on each host to see if there are issues with election and/or quorum. At least 51% of the cluster must be online — the actual formula is N/2 + 1 — in order for any data to be read or written, to prevent split brain problems; this way you won’t find yourself in a situation where different data is written across the cluster. That means a 3 node cluster must have at least 2 functional nodes.
Why does etcd perform so many re-elections? Try raising ETCD_ELECTION_TIMEOUT and ETCD_HEARTBEAT_INTERVAL. Also, try reducing the amount of load on the host. You can find more information here.

 

Your turn

So that’s our take on etcd and the issues you need to think of when it comes to Kubernetes. Do you know of any tips we left out, or did we miss your troubleshooting question? Let us know in the comments!

3 responses to “Everything you ever wanted to know about using etcd with Kubernetes v1.6 (but were afraid to ask)

  1. Hello Matthew. Thank you so much for this article. It has helped me a lot in my understanding. However I have a couple of questions.

    1. On etcd3 backups, on point number 4, what do you mean “source /etc/default/etcd on each host” ? Could you elaborate more on that please.

    2. After I restore the etcd cluster, my kubernetes cluster does not list the nodes. I restarted the kube-apiserver process (container) and does not help. I restarted the kubelet and that allows me to see the nodes again, but all my previous k8s cluster state is gone (no namespaces, no deployments, no services, etcd)

    Have you encounter this or could you please point me to right direction?

    1. I am seeing something similar. Post restore etcd comes back up and i am able to use etcdctl to get data from etcd. However the k8s control plane is not able to make progress. But i see that kubectl commands were responsive. The control plane itself seems to be in a livelock. The apiserver and controller manager logs are filled messages of mismatch. I am using etcd3

Leave a Reply

Your email address will not be published. Required fields are marked *

NEWS VIA EMAIL

Recommendations

Archive

LIVE DEMO
Mirantis Cloud Platform
WEBINAR
Automate Upgrades with Mirantis DriveTrain
WEBINAR
Kubernetes & Docker Mini-Bootcamp