Large Objects in Cloud Storage
When using cloud storage, such as OpenStack Swift or Amazon S3, at some point you might face a situation when you have to upload relatively large files (i.e. a couple of gigabytes or more). For a number of reasons this brings in difficulties for both client and server. From client point of view, uploading such a large file is quite cumbersome, especially if connection is not very good. From storage point of view, handling very large objects is also not trivial.
For that reason, both Swift and Amazon S3 limit the maximum object size. It's 5 GB for Amazon S3 and also 5 GB for Swift by default (but it's configurable). In order to upload larger files one should use a special API feature which allows uploading files by parts. In Amazon S3, it's called 'Multipart Upload' and in Swift world it doesn't seem to have a special name, in the documentation it's referred to as 'Large Object Support'.
Let's see how this API looks like for both S3 and Swift.
Amazon S3 API
As mentioned above, the feature is called 'Multipart Upload'. It's a set of API calls designed exclusively to be used for managing multipart uploads.
These calls are:
- Initiate Multipart Upload
This operation is used to tell the service that you're about to start multipart upload and it returns an upload ID which will be later used to identify the upload when uploading parts.
- Upload PartThis operation uploads a part. The user should specify the part number and upload ID (obtained using initiate multipart upload call previously). Part number could be from 1 to 10000 and it specifies the order in which parts will be concatenated.
- List PartsThis shows a list of parts already uploaded and associated with the specified upload ID.
- Complete Multipart UploadThis is called when all parts are uploaded to concatenate them.
- Abort Multipart UploadThis operation terminates multipart upload and also removes the parts that have been uploaded and associated with the specified upload ID.
As you can see, this API is pretty extensive and straightforward. Let's see how it looks like in Swift.
Swift API for multipart upload is significantly more minimalistic. It doesn't differentiate object parts from objects themselves. In order to upload an object to Swift, you upload its parts like a regular object using Create/Upload Object call, with the only difference being that you name your parts like object/00001, object/00002, where 00001 and 00002 are part numbers and / is a separator. You might use any numbering and any separator you want. When all the parts are upload, you create a manifest by making a PUT request with empty body and
X-Object-Manifest which value should be set to common prefix of the uploaded parts. In our case it would be X-Object-Manifest: container/object/.
Native swift client doesn't support multipart upload explicitly, but due to the simpleness of the interface, it's easily implementable from user side.
State of the multi-API libraries:
The status is given as of libcloud 0.10.1 and jclouds 1.3. However, we have implemented multipart upload for swift in jclouds, so it will be available in the 1.5 release. We have also implemented mutlipart upload support for swift in libcloud which found its way in 0.10.1 version.
As you can see, Amazon S3 API is more high-level while Swift API for large objects is pretty raw. Swift doesn't make a distinction between objects and object parts. This means it's the user's duty to take care of the parts. E.g., you should make sure that the prefix in the manifest doesn't match other objects by mistake. If you want to delete an object, you have to remove its parts as well, and so on.
It feels like the S3 API is designed to be more user facing, while the swift API is good to build more specific things on top of it.
Just for the record: Swift has swift3 middleware which adds support of the S3 API on top of the native Swift API. Unfortunately, it doesn't support multipart upload.
In the next post we will go more deeply into the usage of multipart from libcloud and jclouds, look at some code and discuss performance implications of using multipart uploads.