Recently, SwiftStack published an interesting overview of their approach to multi-regional OpenStack Swift Object Storage clusters. This approach is perfectly aligned with a design we have been working on with Webex for a geographically distributed Swift cluster with a reduced number of replicas (3+1 instead of 3+3, for example). I’d like to go over our approach and elaborate on the implementation plan and proposed changes to the Swift code.
State-of-the-art with OpenStack Swift
Let me start with a brief overview of the current Swift algorithms to make it clearer exactly what we’re doing to get the multi-region cluster working.
The standard Swift ring is a data structure that lets you divide storage devices into buckets or zones. The Essex Swift ring builder makes sure that no object replicas end up in the same zone.
Ring structure includes the following components:
- The list of devices: This includes all storage devices (disks) known to the ring. Each element of this list is a dictionary (aka “hash”), which includes ID, device name, zone ID, node IP, port, weight, and metadata.
- The partition assignment list: The array with the number of arrays equal to the number of replicas. Each element of the array is a device ID (from the list of devices) where partition replicas are identified by the location of the array index.
- The partition shift number: The number of bits from the object’s path MD5 sum that identifies the partition.
In the Folsom release, changes to the ring file format were introduced that drastically improve the processing efficiency and redefine the ring balancing algorithm. A strict condition that required distribution of replicas to different zones was replaced by a much more flexible algorithm that organizes zones, nodes, and devices into tiers.
In our proposal, the region is specified in a distinctive field in the devs dictionary.
The proxy server exposes Swift Public API to clients, performing basic operations on objects, containers, and accounts, including writing with a PUT request and reading with a GET request.
While serving PUT requests, the proxy server follows roughly this algorithm:
- Calculate the MD5 sum of the /account[/container[/object]] path string.
- Calculate the partition number by taking the first N bits of the MD5 sum.
- Select devices from the partitions assignment list that stores replicas of the calculated partition.
- Select the node IP address and port from the devices list for all devices found at step #3.
- Try to establish connections to all nodes on appropriate ports, but if at least half of nodes can’t be contacted, refuse the PUT request.
- Try to upload the object (or create an account or container) on all nodes that were connected successfully, but if at least half of uploads were not successful, refuse the PUT request.
- If the upload completed successfully on more then half of the nodes found in step #4, confirm the successful PUT request to the client.
While serving GET requests, the proxy server follows roughly this algorithm:
- Repeat steps 1-4 from the PUT request algorithm and identify the list of nodes that store object replicas.
- Shuffle the list of nodes and connect to the first.
- If a connection can’t be established, pop the next node from the list.
- If a connection is established, start streaming data to the client in response to the request.
Replication in Swift operates on partitions, not individual objects. The replicator process is started in periodic configurable intervals. By default, the interval is 30 seconds.
The replicator roughly follows this algorithm:
- Create a replication job. That is, scan all devices on the node, scan through all found devices and create a list of all partitions, and for each partition create a dictionary of replication actions.
’nodes’: [ replica_node1, replica_node2, … ],
- <path_to_partition> is a file system path to partition (/srv/node/<device>/objects/<partition>)
- [ replica_node1, replica_node2, … ] is a list of nodes that store partition replicas. This list is imported from the object ring.
- ‘delete’ is set to “true” if the number of partition replicas exceeds the configured number of replicas in the cluster.
- <integer> is a partition ID number.
- Process each partition in accordance with the replication job. That is:
- If the partition is marked for deletion, the replicator pushes every suffix subdir of the
job[‘path’]directory to all nodes from the job[‘nodes’] list, sends a REPLICATE request to each node holding a replica, and removes the directory job[‘path’].
- If the partition is not marked for deletion, the replicator calculates hashes of the contents of all subdirectories of job[‘path’] (i.e., the account/container databases and object files in the partition). The replicator issues a REPLICATE request to all replicas of the partition job[‘partition’] and receives hash maps of remote partition subdirs in response. It compares hash maps and uses rsync to push the changed suffix subdirs to remote nodes. Successful replication is verified by resending the REPLICATE request.
- If the partition is marked for deletion, the replicator pushes every suffix subdir of the
- If the replica is not accessible, the replicator uses the
get_more_nodemethod of the ring class. This method uses a sophisticated algorithm to determine the set of nodes that should be used for hand-off replication of the current partition. The algorithm identifies the zone the failed device belongs to, and selects a device from another zone for hand-off. If this device is not accessible either, another node from yet another zone is selected, and the cycle continues until the zones are depleted.
Proposed changes for OpenStack Swift
Introduce regions in the ring
We are proposing to add a region field to the devices list. This parameter must be used by the RingBuilder class when balancing the ring in the fashion described below. The region parameter represents an additional level of tiering, or a group of zones, so all the devices that belong to zones constituting a single region must belong to this region.
Alternatively, regions may be added to the ring as an additional structure—a dictionary with regions as keys and a list of zones as a value, for example:
|Key (region)||Value (zones list)|
Note that every zone must belong to only one region.
In this case, regions are used similarly to their previous uses, but the ring class has to include additional code to parse the region zones assignment dictionary and identify the region the particular device belongs to.
Default region zone assignment must assign all zones to a single default region to reproduce standard Swift behavior.
Tweak the RingBuilder balancing algorithm
The RingBuilder balancing algorithm must recognize the region parameter in the device list. The algorithm could be pluggable to allow different distributions of replicas. See the algorithm implementation proposed below.
Devices should be assigned to replicas of partition under the following conditions:
- No pair of devices assigned to a particular partition’s replicas can belong to a single tier, i.e., zone/node/device inside the region (standard Swift constraint).
- For N replicas and M regions (groups of zones), the number of replicas that goes to each region is equal to the integer part of the quotient of N/M. The remainder of replicas are added to a single region (called primary for this partition).
- A region cannot accommodate more replicas than the number of zones in the region.
For example, if N = 3 and M = 2, with this algorithm we’ll have a ring where one replica goes to every region (integer of 3/2 is 1), and the remaining one replica goes to one of two regions, selected randomly. The following scheme depicts variants of distributing replicas across regions in the example above.
A direct PUT from the proxy server to the storage node in a remote region is not that simple: We’re not going to have access to the internal cluster network from the outside in most cases. So, for the initial implementation, we assume that only local replicas are written on the PUT, and remote region replicas are created by the replication process.
In the default case, the number of replicas is three, and number of regions is one. This case should reproduce the standard Swift configuration and ring balance algorithm.
Get_more_nodes one more time
We propose changes to the get_more_nodes method of the Ring class that will recognize regions when selecting hand-off zones. The algorithm should sort candidates for hand-off zones so that nodes from the region that contained the lost replica are selected first. If no zones in the local region are accessible (e.g., the network connection between regions was cut), the algorithm will return the node that belongs to a zone from one of the foreign regions. The following two schemes describe the algorithm for two corner cases.
The regional proxy server
We propose making the proxy server aware of the region it belongs to. Basically, this should be as simple as adding a region parameter to the
[DEFAULT] section of the proxy server configuration file (
proxy-server.conf), for example:
region = san-jose
This parameter must be used by the proxy server for ring reading operations, and also while selecting nodes for serving GET requests. Our aim is to make the proxy server prefer reading from nodes from local zones (i.e., zones that belong to the same region as the proxy server).
This feature is referred in the SwiftStack article as proxy affinity.
The proxy server should not read from nodes that belong to a foreign region if a local replica is available to reduce the load on inter-region network links.
We then replace the shuffle operation at step #2 of the GET request handling algorithm (see above) with a procedure that will order nodes in a way that nodes belonging to the local region of the proxy server go first in the list. After such sorting, lists of local-region and foreign-regions nodes are shuffled independently, and then the list of foreign-region nodes is attached to the list of local-region ones.
Some final thoughts on OpenStack Swift replication
Replication between geographically distributed locations works for regions basically just as it does for a single-region cluster. However, this process can generate a huge number of REPLICATE requests between clusters over a WAN connection. This can pose a problem when the connection is relatively slow.
A simple workaround for this issue might be adding a counter to the replicator so partitions are pushed to remote region devices on every Nth replication run. A more sophisticated solution might include dedicated replicator gateways in peer regions.