Please note: Mirantis has realigned its portfolio and renamed several products. These include Docker Enterprise Container Cloud (now Mirantis Container Cloud), Docker Enterprise/UCP (now Mirantis Kubernetes Engine), Docker Engine - Enterprise (now Mirantis Container Runtime), and Docker Trusted Registry (now Mirantis Secure Registry).

Travel less, save more: introducing the OpenStack volume affinity filter

It is a common desire to have some storage space associated with an instance running on cloud. It is also a common desire to have access to it be as fast as possible. One obvious way to achieve this is to place the instance on the same host to which the volume of interest physically belongs. What is not so obvious is how to achieve this objective with OpenStack.

For better or worse, OpenStack doesn’t provide a way to fine-tune this particular option out of the box. Fortunately, it’s easy to extend OpenStack to provide almost any option you can imagine. In this blog post, I am going to discuss how we have implemented just such an extension, what roadblocks have already been encountered, and what problems one may encounter when using it.

Let’s begin by narrowing our goal, focusing for now on how to place an instance on a specific host.

As you may know, nova-scheduler is responsible for placing instances on a particular host, so to achieve what we want, we need to somehow tweak its behavior.  We can do that using filters, which can affect the scheduler’s choice of host.  Those filters can in turn be affected by command-line options provided via the CLI client.  These command-line options specify characteristic features of the desired host – such as a given volume belonging to it.

Several built-in filters already exist (see OpenStack Docs for reference). If none of them suffice, it’s possible to implement your own. That’s the approach we will take in this post, with a discussion about implementing a filter to choose hosts containing specified volumes.

A few words about filters in OpenStack

The idea behind filtered scheduling is pretty simple: you specify characteristics a host should meet, and the scheduler selects a set of hosts that satisfy this specification.  From there, an instance can be scheduled to use one of those hosts based on each host’s load, as well as some other properties. The last part of the process is called “weighting,” but it’s not important right now. Instead, we are interested in the second stage, filtering, because, as you can guess from its name, is where filters actually come into play.

It is common to have several filters available, and to use them simultaneously. It works like this: the filter scheduler applies each of the filters specified in the config to the set of available hosts, and reduces it to only include those hosts which pass the current filter. Putting it simply, a filter’s only job is to decide whether given host meets certain criteria or not. So let’s have a closer look at how exactly filters do that.

Each active filter is an object of one of the filter classes, with at least one method – host_passes() – defined.  The host_passes() method must accept the host name and filtering criteria, and must return True or False. All filter classes must inherit from BaseHostFilters, defined in nova.scheduler.filters. When the scheduler is started, it imports all modules specified in the list of available filters. Then, when the user issues a command to boot an instance, the scheduler instantiates each of the filters and uses them successively to sieve out unfit hosts. It’s important to note that each instance exists only during the actual scheduling session.

When the filter object is in use, the scheduler calls its host_passes() method and passes the host state and filter properties to it.  The filter properties specify the criteria used to weed out unsuitable hosts. Based on the host properties and the filter properties, host_passes() returns either True or False.

Consider, for example, the RAM filter. It is a standard filter shipping with nova by default. Its structure is quite characteristic and gives  you a good idea of how to build a filter:

class RamFilter(filters.BaseHostFilter):
    """Ram Filter with over subscription flag"""

    def host_passes(self, host_state, filter_properties):
        """Only return hosts with sufficient available RAM."""
        instance_type = filter_properties.get('instance_type')
        requested_ram = instance_type['memory_mb']
        free_ram_mb = host_state.free_ram_mb
        total_usable_ram_mb = host_state.total_usable_ram_mb

        memory_mb_limit = total_usable_ram_mb * FLAGS.ram_allocation_ratio
        used_ram_mb = total_usable_ram_mb - free_ram_mb
        usable_ram = memory_mb_limit - used_ram_mb
        if not usable_ram >= requested_ram:
            LOG.debug(_("%(host_state)s does not have %(requested_ram)s MB "
                    "usable ram, it only has %(usable_ram)s MB usable ram."),
            return False

        # save oversubscription limit for compute node to test against:
        host_state.limits['memory_mb'] = memory_mb_limit
        return True

To know whether a host is suitable for an instance (i.e., a VM), the filter needs to know how much memory is available on the host at that moment, and how much is required by the instance. If the host  has less memory than is required, host_passes() returns False, effectively rejecting it. Note that all the knowledge about the host to be tested resides in the host_state argument, and the knowledge needed to make the decision resides mainly in filter_properties. Constants which reflect some general scheduling strategy, such as ram_allocation_ratio, may be stored externally, but this is not a strict rule.  Almost everything a filter needs to know to make right decision can be transferred to it via a simple mechanism called “scheduler hints”.

So what are you hinting at?

As it happens, scheduler hints is nothing more than a dictionary of key:value pairs. Whenever you issue the nova boot … command, the resulting request contains this dictionary. If you do nothing to populate it, it is empty and nothing happens. If, on the other hand, you decide to pass a hint, you can do it easily using the –hint switch, as in nova boot … –hint your_hint_name=desired_value. That’s it.  The hints dictionary is no longer empty; it contains this pair. Now, if there is an extension that already recognizes this hint, you have just passed it a value to take into consideration. If there isn’t, then nothing happens and this is a trivial case. Today we are interested in the case where somebody is waiting to receive a particular hint.

Obtaining hints is a very straightforward approach: you simply get everything stored under the scheduler_hints key in filter_properties, and then pick the property you’re looking for. The following snippet is somewhat self-explanatory:

        scheduler_hints = filter_properties['scheduler_hints']
        important_hint = scheduler_hints.get('important_hint', False)

In nova, scheduler scheduler_hints are always there and you won’t get any surprises, but it is better to be cautious when trying to pick something else. Now you have your hint and can process it however you need to.

Approaching volume affinity with OpenStack

Armed with this knowledge, it’s quite easy to come up with a high-level design for a filter which lets you schedule instances on hosts containing specific volumes. Obviously, we need to be able to somehow identify a volume we want to use, and it is not difficult to come up with volume id string as it is unique inside a pool and won’t produce any ambiguity. From the volume id, we have to deduce the host name to which it belongs, and then compare this hostname with the hostnames of all hosts in the pool. Both tasks can and should be performed by the filter; to make it all work, we can use the hints mechanism to inform this filter about that specific volume.

We already know how to pass the volume id to the filter through the “hints” mechanism.  We can use that to pass the volume id to the filter, and reserve the same_host_volume_id hint name for this purpose. What is not so obvious is how to extract the host association data from this value. Unfortunately, there seems to be no simple and direct way to do it, so we need to address our question directly to the entity responsible for volumes: cinder.

To do this, we may use the appropriate API call to retrieve data associated with the provided volume id, and then extract the hostname from it, but today we will use a simpler approach. We use cinderclient’s ability to issue proper API calls and use the object it returns:

volume = cinder.cinderclient(context).volumes.get(volume_id)
vol_host = getattr(volume, 'os-vol-host-attr:host', None)

Please note: this approach only works in Grizzly and later releases, as an extension it relies upon first appeared in that release.

From here on it’s a straight shot – we have to compare vol_host with all hostnames and return True only when they are equal. Congratulate yourself: you’ve managed to create a fine-tuning extension for OpenStack, and now can delve into optimizing instance distribution over the network.

For implementation details, you can try either the standalone package for Grizzly, or consider the implementation shown in our blueprint.

Can we do better?

Definitely; this proposed approach is not optimal. Currently there is an issue with multiple calls to cinder, which are expensive, and excessive comparisons with hosts other than the one containing the desired volume. Both are minor problems for small scale OpenStack clusters, but can cause accumulation of latencies in larger clouds. This is especially true for the multiple calls to Cinder. To improve things we have to add some extra tweaks – namely a cache for hostname, which will allow us to make only one cinder call per boot, and a flag which will allow us to skip the entire test once we’ve found a host that satisfies the search criteria.

Harnessing the power of affinity is still a work in progress, and the current version is definitely not the final one. There is still plenty of room to perfect it.

A short afterword

The example we worked here shows how to build a filter for nova scheduler with a single distinctive feature separating it from other existing filters. This filter uses another major component’s API to get its job done. While it does allow greater flexibility, this approach may be a serious performance hit, as the target component may be far removed from the scheduler. A possible solution to  this problem is to separate all service schedulers into one entity having access to all cloud characteristics; at the moment, however, there seems to be no simple and direct way to do this.

What's New in Kubernetes 1.18