Configuring a multi-region cluster for cloud object storage with OpenStack Swift
October 11, 2012
Recently, SwiftStack published an interesting overview of their approach to multi-regional OpenStack Swift Object Storage clusters. This approach is perfectly aligned with a design we have been working on with Webex for a geographically distributed Swift cluster with a reduced number of replicas (3+1 instead of 3+3, for example). I’d like to go over our approach and elaborate on the implementation plan and proposed changes to the Swift code.
State-of-the-art with OpenStack Swift
Let me start with a brief overview of the current Swift algorithms to make it clearer exactly what we’re doing to get the multi-region cluster working.
The standard Swift ring is a data structure that lets you divide storage devices into buckets or zones. The Essex Swift ring builder makes sure that no object replicas end up in the same zone.
Ring structure includes the following components:
In the Folsom release, changes to the ring file format were introduced that drastically improve the processing efficiency and redefine the ring balancing algorithm. A strict condition that required distribution of replicas to different zones was replaced by a much more flexible algorithm that organizes zones, nodes, and devices into tiers.
The ring balancer then tries to put replicas as far away from each other as possible; preferably to different zones, but if only one zone is available, then to different nodes, and if only one node is available, then to different devices on that node. This as-far-as-possible algorithm has the potential to support a geographically distributed cluster, as SwiftStack outlined in their blog. This can be achieved by adding another tier to the picture: a region tier. A region is essentially a group of zones sharing a location, whether it’s a rack or a data center.
In our proposal, the region is specified in a distinctive field in the devs dictionary.
The proxy server exposes Swift Public API to clients, performing basic operations on objects, containers, and accounts, including writing with a PUT request and reading with a GET request.
While serving PUT requests, the proxy server follows roughly this algorithm:
While serving GET requests, the proxy server follows roughly this algorithm:
Replication in Swift operates on partitions, not individual objects. The replicator process is started in periodic configurable intervals. By default, the interval is 30 seconds.
The replicator roughly follows this algorithm:
Proposed changes for OpenStack Swift
Introduce regions in the ring
We are proposing to add a region field to the devices list. This parameter must be used by the RingBuilder class when balancing the ring in the fashion described below. The region parameter represents an additional level of tiering, or a group of zones, so all the devices that belong to zones constituting a single region must belong to this region.
Alternatively, regions may be added to the ring as an additional structure—a dictionary with regions as keys and a list of zones as a value, for example:
Note that every zone must belong to only one region.
In this case, regions are used similarly to their previous uses, but the ring class has to include additional code to parse the region zones assignment dictionary and identify the region the particular device belongs to.
Default region zone assignment must assign all zones to a single default region to reproduce standard Swift behavior.
Tweak the RingBuilder balancing algorithm
The RingBuilder balancing algorithm must recognize the region parameter in the device list. The algorithm could be pluggable to allow different distributions of replicas. See the algorithm implementation proposed below.
Devices should be assigned to replicas of partition under the following conditions:
For example, if N = 3 and M = 2, with this algorithm we’ll have a ring where one replica goes to every region (integer of 3/2 is 1), and the remaining one replica goes to one of two regions, selected randomly. The following scheme depicts variants of distributing replicas across regions in the example above.
A direct PUT from the proxy server to the storage node in a remote region is not that simple: We’re not going to have access to the internal cluster network from the outside in most cases. So, for the initial implementation, we assume that only local replicas are written on the PUT, and remote region replicas are created by the replication process.
In the default case, the number of replicas is three, and number of regions is one. This case should reproduce the standard Swift configuration and ring balance algorithm.
region = san-jose
This parameter must be used by the proxy server for ring reading operations, and also while selecting nodes for serving GET requests. Our aim is to make the proxy server prefer reading from nodes from local zones (i.e., zones that belong to the same region as the proxy server).
This feature is referred in the SwiftStack article as proxy affinity.
The proxy server should not read from nodes that belong to a foreign region if a local replica is available to reduce the load on inter-region network links.
We then replace the shuffle operation at step #2 of the GET request handling algorithm (see above) with a procedure that will order nodes in a way that nodes belonging to the local region of the proxy server go first in the list. After such sorting, lists of local-region and foreign-regions nodes are shuffled independently, and then the list of foreign-region nodes is attached to the list of local-region ones.
Some final thoughts on OpenStack Swift replication
Replication between geographically distributed locations works for regions basically just as it does for a single-region cluster. However, this process can generate a huge number of REPLICATE requests between clusters over a WAN connection. This can pose a problem when the connection is relatively slow.
A simple workaround for this issue might be adding a counter to the replicator so partitions are pushed to remote region devices on every Nth replication run. A more sophisticated solution might include dedicated replicator gateways in peer regions.4 comments
Continuing the Discussion
Configuring a multi-region cluster for cloud object storage with OpenStack Swift | Scala & Cloud Playing | Scoop.it
[...] Recently, SwiftStack published an interesting overview of their approach to multi-regional OpenStack Swift Object Storage clusters. This approach is perfectly aligned with a design we have been wor… [...]December 26, 201217:19