Recently, SwiftStack published an interesting overview of their approach to multi-regional OpenStack Swift Object Storage clusters. This approach is perfectly aligned with a design we have been working on with Webex for a geographically distributed Swift cluster with a reduced number of replicas (3+1 instead of 3+3, for example). I’d like to go over our approach and elaborate on the implementation plan and proposed changes to the Swift code.
Let me start with a brief overview of the current Swift algorithms to make it clearer exactly what we’re doing to get the multi-region cluster working.
The standard Swift ring is a data structure that lets you divide storage devices into buckets or zones. The Essex Swift ring builder makes sure that no object replicas end up in the same zone.
Ring structure includes the following components:
In the Folsom release, changes to the ring file format were introduced that drastically improve the processing efficiency and redefine the ring balancing algorithm. A strict condition that required distribution of replicas to different zones was replaced by a much more flexible algorithm that organizes zones, nodes, and devices into tiers.
In our proposal, the region is specified in a distinctive field in the devs dictionary.
The proxy server exposes Swift Public API to clients, performing basic operations on objects, containers, and accounts, including writing with a PUT request and reading with a GET request.
While serving PUT requests, the proxy server follows roughly this algorithm:
While serving GET requests, the proxy server follows roughly this algorithm:
Replication in Swift operates on partitions, not individual objects. The replicator process is started in periodic configurable intervals. By default, the interval is 30 seconds.
The replicator roughly follows this algorithm:
’nodes’: [ replica_node1, replica_node2, ... ],
job[‘path’]directory to all nodes from the job[‘nodes’] list, sends a REPLICATE request to each node holding a replica, and removes the directory job[‘path’].
get_more_nodemethod of the ring class. This method uses a sophisticated algorithm to determine the set of nodes that should be used for hand-off replication of the current partition. The algorithm identifies the zone the failed device belongs to, and selects a device from another zone for hand-off. If this device is not accessible either, another node from yet another zone is selected, and the cycle continues until the zones are depleted.
We are proposing to add a region field to the devices list. This parameter must be used by the RingBuilder class when balancing the ring in the fashion described below. The region parameter represents an additional level of tiering, or a group of zones, so all the devices that belong to zones constituting a single region must belong to this region.
Alternatively, regions may be added to the ring as an additional structure—a dictionary with regions as keys and a list of zones as a value, for example:
|Key (region)||Value (zones list)|
Note that every zone must belong to only one region.
In this case, regions are used similarly to their previous uses, but the ring class has to include additional code to parse the region zones assignment dictionary and identify the region the particular device belongs to.
Default region zone assignment must assign all zones to a single default region to reproduce standard Swift behavior.
The RingBuilder balancing algorithm must recognize the region parameter in the device list. The algorithm could be pluggable to allow different distributions of replicas. See the algorithm implementation proposed below.
Devices should be assigned to replicas of partition under the following conditions:
For example, if N = 3 and M = 2, with this algorithm we’ll have a ring where one replica goes to every region (integer of 3/2 is 1), and the remaining one replica goes to one of two regions, selected randomly. The following scheme depicts variants of distributing replicas across regions in the example above.
A direct PUT from the proxy server to the storage node in a remote region is not that simple: We’re not going to have access to the internal cluster network from the outside in most cases. So, for the initial implementation, we assume that only local replicas are written on the PUT, and remote region replicas are created by the replication process.
In the default case, the number of replicas is three, and number of regions is one. This case should reproduce the standard Swift configuration and ring balance algorithm.
Get_more_nodesone more time
We propose changes to the get_more_nodes method of the Ring class that will recognize regions when selecting hand-off zones. The algorithm should sort candidates for hand-off zones so that nodes from the region that contained the lost replica are selected first. If no zones in the local region are accessible (e.g., the network connection between regions was cut), the algorithm will return the node that belongs to a zone from one of the foreign regions. The following two schemes describe the algorithm for two corner cases.
We propose making the proxy server aware of the region it belongs to. Basically, this should be as simple as adding a region parameter to the
[DEFAULT] section of the proxy server configuration file (
proxy-server.conf), for example:
region = san-jose
This parameter must be used by the proxy server for ring reading operations, and also while selecting nodes for serving GET requests. Our aim is to make the proxy server prefer reading from nodes from local zones (i.e., zones that belong to the same region as the proxy server).
This feature is referred in the SwiftStack article as proxy affinity.
The proxy server should not read from nodes that belong to a foreign region if a local replica is available to reduce the load on inter-region network links.
We then replace the shuffle operation at step #2 of the GET request handling algorithm (see above) with a procedure that will order nodes in a way that nodes belonging to the local region of the proxy server go first in the list. After such sorting, lists of local-region and foreign-regions nodes are shuffled independently, and then the list of foreign-region nodes is attached to the list of local-region ones.
Replication between geographically distributed locations works for regions basically just as it does for a single-region cluster. However, this process can generate a huge number of REPLICATE requests between clusters over a WAN connection. This can pose a problem when the connection is relatively slow.
A simple workaround for this issue might be adding a counter to the replicator so partitions are pushed to remote region devices on every Nth replication run. A more sophisticated solution might include dedicated replicator gateways in peer regions.4 comments
[...] Recently, SwiftStack published an interesting overview of their approach to multi-regional OpenStack Swift Object Storage clusters. This approach is perfectly aligned with a design we have been wor… [...]December 26, 201217:19