Shared storage for OpenStack based on DRBD
May 19, 2011
Storage is a tricky part of the cloud environment. We want it to be fast, to be network-accessible and to be as reliable as possible. One way is to go to the shop and buy yourself a SAN solution from a prominent vendor for solid money. Another way is to take commodity hardware and use open source magic to turn it into distributed network storage. Guess what we did?
We have several primary goals ahead. First, our storage has to be reliable. We want to survive both minor and major hardware crashes – from HDD failure to host power loss. Second, it must be flexible enough to slice it fast and easily and resize slices as we like. Third, we will manage and mount our storage from cloud nodes over the network. And, last but not the least, we want decent performance from it.
For now, we have decided on the DRBD driver for our storage. DRBD® refers to block devices designed as a building block to form high availability (HA) clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network-based RAID-1. It has lots of features, has been tested and is reasonably stable.
DRBD has been supported by the Linux kernel since version 2.6.33. It is implemented as a kernel module and included in the mainline. We can install the DRBD driver and command line interface tools using a standard package distribution mechanism; in our case it is Fedora 14:
Other sections of the common configuration are usually left blank and can be redefined in per-resource configuration files. To create a usable resource, we must create a configuration file for our resource in /etc/drbd.d/drbd0.res. Basic parameters for the resource are:
As we need write access to the resource on both nodes, we must make it ‘primary’ on both nodes. A DRBD device in the primary role can be used unrestrictedly for read and write operations. This mode is called ‘dual-primary’ mode. Dual-primary mode requires additional configuration. In the ‘startup’ section directive, ‘become-primary-on’ is set to ‘both’. In the ‘net’ section, the following is recommended:
The ‘allow-two-primaries‘ directive allows both ends to send data.
Resource configuration with all of these considerations applied will be as follows:
Enabling Resource For The First Time
After the front-end device is created, we bring the resource up:
This command set must be executed on both nodes. We may collapse the stepsdrbdadm attach, drbdadm syncer, and drbdadm connect into one, by using the shorthand command drbdadm up.
We must now synchronize resources on both nodes. If we want to replicate data that are already on one of the drives, it’s important to run the next command on the host which contains data. Otherwise, this can be issued on any of two hosts.
This command puts the node host1 in ‘primary’ mode and makes it the synchronization source. This is reflected in the status file /proc/drbd:
We can adjust the syncer rate to make initial and background synchronization faster. To speed up the initial sync drbdsetup command used:
This allows us to consume almost all bandwidth of Gigabit Ethernet. The background syncer rate is configured in the corresponding config file section:
The exact rate depends on available bandwidth and must be about 0.3 of the slowest I/O subsystem (network or disk). DRBD seems to make it slower if it interferes with data flow.
LVM Over DRBD Configuration
This command writes LVM Physical Volume data on the drbd0 device and also on the underlying md3 device. This can pose a problem as LVM default behavior is to scan all block devices for the LVM PV signatures. This means two devices with the same UUID will be detected and an error issued. This can be avoided by excluding /mnt/md3 from scanning in the /etc/lvm/lvm.conf file by using the ‘filter’ parameter:
The vgscan command must be executed after the file is changed. It forces LVM to discard its configuration cache and re-scan the devices for PV signatures.
It is also nessesary to disable the LVM write cache:
These steps must be repeated on the peer node. Now we can create a Volume Group using the configured PV /dev/drbd0 and Logical Volume in this VG. Execute these commands on one of nodes:
To make use of this VG and LV on the peer node, we must make it active on it:
When the new PV is configured, it is possible to proceed to adding it to the Volume Group or creating a new one from it. This VG can be used to create Logical Volumes as usual.
With DRBD we can survive any I/O errors on one of nodes. DRBD internal error handling can be configured to mask any errors and go to diskless mode. In this mode, all I/O operations are transparently redirected from the failed node to the replicant. This gives us time to restore a faulty disk system.
If we have a major system crash, we still have all of the data on the second node. We can use them to restore or replace the failed system. Network failure can put us into a ‘split brain’ situation, when data differs between hosts. This is dangerous, but DRBD also has rather powerful mechanisms to deal with these kinds of problems.4 comments