NEW! Dynamic Resource Balancer in Mirantis OpenStack for Kubernetes 24.2   |   Learn More

< BLOG HOME

How to use GPU virtualization with Mirantis OpenStack for Kubernetes

Pavlo Shchelokovskyy - February 08, 2022
image

Many high-performance and deep learning workloads like virtual reality gaming and Artificial Intelligence/Machine Learning (AI/ML) immensely benefit when executed on GPUs instead of general-purpose CPUs. This is due to the highly parallel nature of GPUs, which can provide more robust performance and greater computational capability. To better serve these use cases, Mirantis OpenStack for Kubernetes now includes support for virtual GPUs in OpenStack Nova, enabling virtual machines to efficiently leverage GPU resources. This blog explains what GPU virtualization is, how Nova approaches OpenStack GPU virtualization, and how you can enable vGPU support in Mirantis OpenStack for Kubernetes.


What is GPU virtualization?

  • GPU - Graphics processing unit, general term for anything that facilitates the graphics output
  • pGPU - Physical GPU, actual physical extension card plugged into the computer
  • vGPU - Virtual GPU, an ephemeral slice of a pGPU that can be used as an independent GPU
  • MOSK - Mirantis OpenStack for Kubernetes
  • TryMOSK - Single node minimal installation of MOSK for evaluation

In virtualized environments, the most straightforward way to give an application running inside the virtual machine access to a host GPU is via a “PCI-passthru,” which dedicates the whole PCI device exclusively to a single VM. However, this is sub-optimal from a resource utilization and density perspective, as you can only run as many GPU-requiring instances as there are physical GPU PCI cards attached to the server. To achieve the highest return on investment, each instance should be running at 100% capacity all of the time.


In recent years, however, a new type of technology has emerged that aims to serve the virtualization/cloud computing use case - Virtual GPU, or vGPU. As a natural extension of the SR-IOV approach that has already been implemented in NICs for many years, this technology allows you to slice one physical GPU (pGPU) into multiple virtual ones (vGPUs). Each vGPU is then allocated to a separate virtual machine or another application to consume, with all the actual gory details of resource sharing and/or isolation (i.e., GPU processing time, memory, etc.) hidden inside the pGPU.


To gain a better understanding of this implementation, please view the upstream OpenStack documentation, which describes the general concepts and current limitations of this feature in OpenStack Nova.


From the Linux OS point of view, the kernel supports vGPUs via Virtual Function I/O (VFIO) mediated devices.


When configured to do so, the OpenStack compute service (Nova) tracks available resources (pGPUs and vGPUs) and provides them to the Placement service as nested resource providers. When a need arises to create a vGPU instance, Nova creates a new mediated device (or re-uses a suitable existing one) and passes it directly to a QEMU instance via the libvirt XML domain definition.  The main prerequisite for this is a graphics card that can be used in vGPU mode, and appropriate drivers that enable and expose this functionality.

Getting vGPU drivers

This is a somewhat gray area, covered by corporate enterprise policies and NDAs. With NVIDIA, for example, you need NVIDIA GRID drivers, which expose the VFIO mdev interface to the kernel. However, these drivers are not freely available - you need to be either an existing customer of NVIDIA, or apply for a trial license. The vGPU functionality also requires a specific license type, and you need to deploy and operate a separate license server in order to enforce the licenses. In the absence of a license, the vGPU performance will artificially degrade over time. For more information, please refer to the NVIDIA documentation.

Discovering your vGPU types

Before configuring the compute service, you must first discover the vGPU types supported by your system’s graphics cards.  Assuming appropriate Nvidia drivers are installed, enumerate the devices on the mdev bus to find the supported vGPU types:


ls /sys/class/mdev_bus/*/mdev_supported_types

Example output for a single pGPU card:


nvidia-222 nvidia-223 nvidia-224 nvidia-225 nvidia-226 nvidia-227 nvidia-228 nvidia-229 nvidia-230 nvidia-231 nvidia-232 nvidia-233 nvidia-234 nvidia-252 nvidia-319 nvidia-320 nvidia-321

In order to check the meaning of those types, you can display each type’s description. For a card with PCI address 0000:18:00.0 you can use the following command:


for d in $(ls "/sys/class/mdev_bus/0000:18:00.0/mdev_supported_types/"); do echo $d; cat "/sys/class/mdev_bus/0000:18:00.0/mdev_supported_types/$d/description"; done

Example output


nvidia-222
num_heads=4, frl_config=45, framebuffer=1024M, max_resolution=5120x2880, max_instance=16
nvidia-223
num_heads=4, frl_config=45, framebuffer=2048M, max_resolution=5120x2880, max_instance=8
nvidia-224
num_heads=4, frl_config=45, framebuffer=2048M, max_resolution=5120x2880, max_instance=8
nvidia-225
num_heads=1, frl_config=60, framebuffer=1024M, max_resolution=1280x1024, max_instance=16
nvidia-226
num_heads=1, frl_config=60, framebuffer=2048M, max_resolution=1280x1024, max_instance=8
nvidia-227
num_heads=1, frl_config=60, framebuffer=4096M, max_resolution=1280x1024, max_instance=4
nvidia-228
num_heads=1, frl_config=60, framebuffer=8192M, max_resolution=1280x1024, max_instance=2
nvidia-229
num_heads=1, frl_config=60, framebuffer=16384M, max_resolution=1280x1024, max_instance=1
nvidia-230
num_heads=4, frl_config=60, framebuffer=1024M, max_resolution=5120x2880, max_instance=16
nvidia-231
num_heads=4, frl_config=60, framebuffer=2048M, max_resolution=7680x4320, max_instance=8
nvidia-232
num_heads=4, frl_config=60, framebuffer=4096M, max_resolution=7680x4320, max_instance=4
nvidia-233
num_heads=4, frl_config=60, framebuffer=8192M, max_resolution=7680x4320, max_instance=2
nvidia-234
num_heads=4, frl_config=60, framebuffer=16384M, max_resolution=7680x4320, max_instance=1
nvidia-252
num_heads=4, frl_config=45, framebuffer=1024M, max_resolution=5120x2880, max_instance=16
nvidia-319
num_heads=1, frl_config=60, framebuffer=4096M, max_resolution=4096x2160, max_instance=4
nvidia-320
num_heads=1, frl_config=60, framebuffer=8192M, max_resolution=4096x2160, max_instance=2
nvidia-321
num_heads=1, frl_config=60, framebuffer=16384M, max_resolution=4096x2160, max_instance=1


Consult the NVIDIA documentation for the meaning of these parameters, so you can choose the appropriate one for your workloads. One of the most important ones for Nova is max_instance, which describes how many vGPUs of the given type can be created on this particular pGPU. Also note the current driver limitation - the driver can not create vGPUs of different vGPU types on the same pGPU.

Enabling vGPU support in Mirantis OpenStack for Kubernetes

  • AMI: Ubuntu Server 18.04 LTS (HVM)
  • Instance type: g4dn.metal
  • OpenStack Victoria deployed by TryMOSK

For a minimal working case, you only need to specify to Nova what type of a vGPU you want to create on each host.

Below is an example configuration for Nova to create vGPUs of the type
nvidia-222

kind: OpenStackDeployment
...
spec:
  services:
    compute:
      nova:
        values:
          conf:
            nova:
              devices:
                enabled_vgpu_types: nvidia-222

This vGPU type will be applied to all compute services, so make sure you verify that this vGPU type is supported by all pGPUs on all compute hosts.  Apply these changes and wait for pods of the nova-compute daemonset to restart.  Afterwards, you can check the resource providers in the placement service to validate that the vGPU is discovered and ready for consumption.


Now create or modify an appropriate flavor in Nova that will declare a need for one unit of resource type “vgpu”:


openstack flavor set vgpu_1 --property "resources:VGPU=1"

This flavor can be used to create an instance as usual. You can also observe the placement of instances and their resource consumption of pGPU with NVIDIA tools. 


Examples (running on compute host):

root@ip-172-31-12-209:~# nvidia-smi vgpu
Thu Sep 16 16:46:30 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63                 Driver Version: 470.63                    |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  Tesla T4                   | 00000000:18:00.0             |   0%       |
+---------------------------------+------------------------------+------------+
|   1  Tesla T4                   | 00000000:19:00.0             |   0%       |
+---------------------------------+------------------------------+------------+
|   2  Tesla T4                   | 00000000:35:00.0             |   0%       |
+---------------------------------+------------------------------+------------+
|   3  Tesla T4                   | 00000000:36:00.0             |   0%       |
+---------------------------------+------------------------------+------------+
|   4  Tesla T4                   | 00000000:E7:00.0             |   0%       |
+---------------------------------+------------------------------+------------+
|   5  Tesla T4                   | 00000000:E8:00.0             |   0%       |
+---------------------------------+------------------------------+------------+
|   6  Tesla T4                   | 00000000:F4:00.0             |   0%       |
|      3251634254  GRID T4-16C    | dcae...  instance-00000004   |   0%    |
+---------------------------------+------------------------------+------------+
|   7  Tesla T4                   | 00000000:F5:00.0             |   0%       |
+---------------------------------+------------------------------+------------+

Enabling support for multiple vGPU types

The steps to enable several types of vGPUs are as follows:


  1. List the PCI addresses for available pGPUs on the host
  2. Choose which vGPU types you want to assign to which pGPUs (remember, a single pGPU can only spawn vGPU instances of the same type at a time)
  3. Configure OpenStackDeployment as in the following example, in which we enable 2 vGPU types, and from 8 available pGPUs we assign half to one vGPU type and half to the other.
  4. spec:
      services:
        compute:
          nova:
            values:
              conf:
                nova:
                  devices:
                    enabled_vgpu_types: nvidia-319,nvidia-320
                  vgpu_nvidia-319:
                    device_addresses: 0000:18:00.0,0000:19:00.0,0000:35:00.0,0000:36:00.0
                  vgpu_nvidia-320:
                    device_addresses: 0000:e7:00.0,0000:e8:00.0,0000:f4:00.0,0000:f5:00.0
  5. Apply changes to OpenStackDeployment resource, and wait for compute pods to restart

Note: This is a very simple example, since we are running the TryMOSK demo, which provides only one compute host. For real-world instances, the required configuration will be much more complex, since the PCI addresses will most likely differ from node to node, so you will have to resort to node overrides for every single compute node.


This is why we generally advise configuring only one vGPU type per compute node, and if different vGPU types are needed in the cloud, you should split compute hosts in groups, with each group having the same vGPU type.


Changing the vGPU type

Be advised that changing the vGPU type is not an easy task. Remember that a single pGPU can only support one type of vGPU at a time. So if there’s an instance using the old vgpu type, you can not spin up an instance with a new vGPU type on the same pGPU.

Removing orphan mdevs after changing the vGPU type

During provisioning of instances with vGPU, Nova creates a new mediated device if there aren’t any existing mediated devices free(i.e., not attached to instances), and then passes this device to libvirt/qemu (in libvirt domain XML).  However, Nova never deletes any mediated devices - Nova actually only includes code for creating mdev, not deleting it. 


While this probably speeds things up during provisioning, it poses a problem when changing the enabled vGPU types.  A pGPU only allows you to expose a single vGPU type at a time, and it seems to consider the vGPU type as unchangeable as long as any mdev device of this type has been created. Thus, if even an unused mdev of the previous vGPU type exists, Nova can not allocate a new type of vGPU from this particular pGPU. What’s more, Nova is not even aware of this limitation until it tries to create a new type of vGPU, leaving the success rate at the mercy of scheduling choices and how many such mdevs were created but not allocated to instances.


The problem manifests itself as NoValidHost errors from Nova when you try to boot an instance with a vGPU after the vGPU type was changed in Nova, even when vGPU slots appear available.


Additionally, when the new vGPU type has a different max_instance number than the previous one, this situation can be detected by observing that the corresponding resource provider in Nova has not switched from the old max capacity (calculated from the previous vGPU type) to the one corresponding to the new vGPU type after changing the vGPU type.

A workaround to this situation is to apply the following sequence:

  1. Disable nova-compute on the affected host
  2. Manually find all the instances that use mdevs (find mdev devices in dumped domain XMLs) on the affected node (you can also create a Python script that does the same, similar to how Nova itself does it via libvirt).
  3. Compare with the list of created mdevs and find mdevs created but not attached to any instance
    ls /sys/bus/mdev/devices/
  4. Remove those orphan mdevs that are not attached to any instance
    echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove

Conclusion

GPU resources are an important part of many compute workloads for modern data center applications - and with Mirantis OpenStack for Kubernetes you can take advantage of these resources in a virtualized environment.   If you would like to experience the power that this functionality and others within Mirantis OpenStack for Kubernetes can bring to your infrastructure, we welcome you to take it for a spin with the free trial.

Pavlo Shchelokovskyy

Pavlo Shchelokovskyy is a Principal Software Engineer at Mirantis.

Choose your cloud native journey.

Whatever your role, we’re here to help with open source tools and world-class support.

GET STARTED

NEWSLETTER

Join Our Exclusive Newsletter

Get cloud-native insights and expert commentary straight to your inbox.

SUBSCRIBE NOW