Under the hood of Swift. The Ring
February 14, 2012
This is the first post in series that summarizes our analysis of Swift architecture. We’ve tried to highlight some points that are not clear enough in the official documentation. Our primary base was an in-depth look into the source code. The Ring is the vital part of Swift architecture. This half database, half configuration file keeps track of where all data resides in the cluster. For each possible path to any stored entity in the cluster, the Ring points to the particular device on the particular physical node.
There are three types of entities that Swift recognizes: accounts, containers and objects. Each type has the ring of its own, but all three rings are put up the same way. Swift services use the same source code to create and query all three rings. Two Swift classes are responsible for this tasks: RingBuilder and Ring respectively.
Ring data structure
Some device management can be performed using values in the list. First, for the removed devices, the ‘id’ value is set to ‘None’. Device IDs are generally not reused. Second, setting ‘weight’ to 0.0 disables the device temporarily, as no partitions will be assigned to that device.
Partitions assignment list
The ring takes a configurable number of bits from a path’s MD5 hash and converts it to the long integer number. This number is used as an index into the array(‘H’). This index points to the array element that designates an ID of the device to which the partition is mapped. Number of bits kept from the hash is known as the partition power, and 2 to the partition power indicates the partition count.
For a given partition number, each replica’s device will not be in the same zone as any other replica’s device. Zones can be used to group devices based on physical locations, power separations, network separations, or any other attribute that could make multiple replicas unavailable at the same time.
Partition Shift Value
account.ring.gz container.ring.gz object.ring.gz
There are no internal Swift mechanisms that can guarantee that the ring is consistent, i.e. gzip file is not corrupt and can be read. Swift services have no way to tell if all nodes have the same version of rings. Maintenance of ring files is administrator’s responsibility. These tasks can be automated by means external to the Swift itself, of course.
The Ring allows any Swift service to identify which Storage node to query for the particular storage entity. Method Ring.get_nodes(account, container=None, obj=None) is used for identification of target Storage node for the given path (/account[/container[/object]]). It returns the tuple of partition and dictionary of nodes. The partition is used for constructing the local path to object file or account/container database. Nodes dictionary elements have the same structure as the devices in list of devices (see above).
When the temporary builder file structure is created, administrator should add devices to the Ring. For each device, required values are zone number, IP address of the Storage node, port on which server is listening, device name (e.g. sdb1), optional device meta-data (e.g., model name, installation date or anything else) and device weight:
Device weight is used to distribute partitions between the devices. More the device weight, more partitions are going to be assigned to that device. Recommended initial approach is to use the same size devices across the cluster and set weight 100.0 to each device. For devices added later, weight should be proportional to the capacity. At this point, all devices that will initially be in the cluster, should be added to the Ring. Consistency of the builder file can be verified before creating actual Ring file:
In case of successful verification, the next step is to distribute partitions between devices and create actual Ring file. It is called ‘rebalance’ the Ring. This process is designed to move as few partitions as possible to minimize the data exchange between nodes, so it is important that all necessary changes to the Ring are made before rebalancing it:
The whole procedure must be repeated for all three rings: account, container and object. The resulting .ring.gz files should be pushed to all nodes in cluster. Builder files are also needed for the future changes to rings, so they should be backed up and kept in safe place. One of approaches is to put them to the Swift storage as ordinary objects.
Physical disk usage
The partition mapped to the storage device is actually a directory in structure under/srv/node/<dev_name>. The disk space used by this directory may vary from partition to partition, depending on size of objects that have been placed to this partition by mapping hash of object path to the Ring.
In conclusion it should be said that the Swift Ring is a beautiful structure, though it lacks a degree of automation and synchronization between nodes. I’m going to write about how to solve these problems in one of the following posts.