Mirantis OpenStack 7.0: NFVI Deployment Guide -- Huge pages

Evgeniy Korekin - January 25, 2016

Memory addressing on contemporary computers is done in terms of blocks of contiguous virtual memory addresses known as pages. Historically, memory pages on x86 systems have had a fixed size of 4 kilobytes, but today this parameter is configurable to some degree: the x86_32 architecture, for example, supports 4Kb and 4Mb pages, while the x86_64 architecture supports pages 4Kb, 2Mb, and more recently, 1Gb, in size.

Pages larger than the default size are referred to as "huge pages" or "large pages" (the terms are frequently capitalized). We’ll call them "huge pages" in this document.

Processes work with virtual memory addresses. Each time a process accesses memory, a kernel translates the desired virtual memory address to a physical one by looking at a special memory area called the page table, where virtual-to-physical mappings are stored. The hardware cache on the CPU is used to speed up lookups. This cache is called the translation lookaside buffer (TLB).

The TLB typically can store only a small fraction of physical-to-virtual page mappings. By increasing memory page size we reduce the total number of pages that need to be addressed, thus increasing TLB hit rate. This can lead to significant performance gains when a process does many memory operations. Also, the page table may require a significant amount of memory in cases where it needs to store many references to small memory pages. in extreme cases, memory savings from using huge pages may amount to several gigabytes. (For example, see http://kevinclosson.net/2009/07/28/quantifying-hugepages-memory-savings-with-oracle-database-11g.)

On the other hand, when the page size is large but a process doesn’t use all the page memory, unused memory is effectively lost as it cannot be used by other processes. So there is usually a tradeoff between performance and more efficient memory utilization.

In the case of virtualization, a second level of page translation (between the hypervisor and host OS) causes additional overhead. Using huge pages on the host OS lets us greatly reduce this overhead.

It’s preferable to give a virtual machine with NFV workloads exclusive access to a predetermined amount of memory. No other process can use that memory anyway, so there is no tradeoff in using huge pages. Huge pages are thus the natural option for NFV workloads.

For more information on page tables and the translation process, see https://en.wikipedia.org/wiki/Page_table

General recommendations on using huge pages on OpenStack

There are two ways to use huge pages on Linux in general:

Explicit - an application is enabled to use huge pages by changing its source code
Implicit - via automatic aggregation of default-sized pages to huge pages by the transparent huge pages (THP) mechanism in the kernel

THP are turned on by default in MOS 7.0, but Explicit huge pages potentially provide more performance gains if an application supports them.

Although we tend to think of the hypervisor as KVM, KVM is really just the kernel module; the actual hypervisor is QEMU. That means that QEMU performance is crucial for NFV. Fortunately, it supports explicit usage of huge pages via the hugetlbfs library, so we don’t really need THP here. Moreover, THP can lead to side effects with unpredictable results -- sometimes lowering performance instead of raising it.

Also be aware that when a kernel needs to swap out a THP, the aggregate huge page is first split to standard 4k pages. Explicit huge pages are never swapped to disk — this is perfectly fine for typical NFV workloads.

In general, huge pages in general can be reserved at boot or at runtime (though 1GB huge pages can only be allocated at boot). Memory generally gets fragmented on a running system and the kernel may not be able to reserve as many contiguous memory blocks in runtime as it can at boot.

For general NFV workloads we recommend using dedicated compute nodes with the major part of their memory reserved as explicit huge pages at boot time. NFV workload instances should be configured to use huge pages. We also recommend disabling THP on these compute nodes. As for preferred huge page sizes: the choice depends on the needs of specific workloads. Generally, 1Gb can be slightly faster, but 2Mb huge pages provide more granularity.

For more information on explicit huge pages, see:

Summary in the Debian Wiki: https://wiki.debian.org/Hugepages
Good general introductory article http://linuxgazette.net/155/krishnakumar.html
Series of in-depth articles starting with http://lwn.net/Articles/374424/

For more information on THP, see:

General introduction: https://lwn.net/Articles/423584/
Articles on THP performance impact:
https://blogs.oracle.com/linuxkernel/entry/performance_impact_of_transparent_huge,
https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
https://en.wikipedia.org/wiki/Second_Level_Address_Translation
http://developerblog.redhat.com/2014/03/10/examining-huge-pages-or-transparent-huge-pages-performance/

Huge pages and physical topology

All contemporary multiprocessor x86_64 systems have non-uniform memory access architecture (NUMA). NUMA-related settings will be described in the following sections of this guide. but there are some subtle characteristics of NUMA that affect huge page allocation on multi-CPU hosts that you should be aware of when configuring OpenStack.

As a rule, some amount of memory is reserved in the lower range of memory address space. This memory is used for memory-mapped I/O and usually it is reserved on the first NUMA cell -- corresponding to the first CPU -- before huge pages are allocated -- but when allocating huge pages, the kernel tries to spread them evenly across all NUMA cells. If there’s not enough contiguous memory in one of the NUMA cells, the kernel will try to compensate by allocating more memory on the remaining cells. When the amount of memory used by huge pages is close to the total amount of free memory, you end up with uneven huge page distributions across NUMA cells. This is more likely to happen when using 1Gb pages.

Here is an example from a host with 64 gigabytes of memory and two CPUs:

      # grep "Memory.*reserved" /var/log/dmesg

      [    0.000000] Memory: 65843012K/67001792K available (7396K kernel code, 1146K rwdata, 3416K rodata, 1336K init, 1448K bss, 1158780K reserved)

We can see that the kernel reserves more than 1 Gb of memory.

Now, if we try to reserve 60 1Gb pages the result will be:

     # grep . /sys/devices/system/node/node*/hugepages/hugepages*kB/nr_hugepages
     /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages:29
     /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages:31

This might lead to negative consequences. For example, if we use a VM flavor that requires 30Gb of memory in one NUMA cell (or 60Gb in two) there would be a problem. One might think that the number of huge pages on this host is enough to run two instances with 30Gb memory each or one, two-cell instance with 60Gb, but in reality, only one 30 Gb instance will be started: the other one will be one 1Gb page short. If we try to start a 60Gb, two-cell instance with this distribution of huge pages between NUMA cells it will fail to start altogether because Nova will try to find a physical host with two NUMA cells having 30Gb of memory each and fail to do that because one of the cells has insufficient memory.

You may want to use an option such as 'Socket Interleave Below 4GB' or similar if your BIOS supports it to avoid this situation. This option maps lower address space evenly between the NUMA cells, in effect splitting reserved memory between NUMA nodes.

In conclusion, you should always test to verify the real allocation of huge pages and plan accordingly, based on the results.

Enabling huge pages on MOS 7.0

To enable huge pages you need to configure every compute node where you plan to run instances that will use them. You also need to configure nova aggregates and flavors before launching huge pages backed instances.

Compute hosts configuration

Below we provide an example of how to configure huge pages on one of the compute nodes. All the commands in this section should be run on the compute nodes that will handle huge pages workloads.

We will only describe steps required for boot time configuration. For information on runtime huge pages allocation, please refer to kernel documentation (https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt).

Check that your compute node supports huge pages:

# grep -m1 "pse\|pdpe1gb" /proc/cpuinfo

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid

pse and pdpr1gb flags in the output indicate that the hardware supports ‘standard’ (2 or 4 Megabytes depending on hardware architecture) or 1Gb huge pages.

Upgrade QEMU to 2.4 to use huge pages (see the Appendix A1 “Installing qemu 2.4”).
Add huge pages allocation parameters to the list of kernel arguments in /etc/default/grub. Note that we are also disabling Transparent Huge Pages in the examples below because we're using explicit huge pages to prevent swapping.Add the following to the end of /etc/default/grub:
```
GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX  hugepagesz=<size of hugepages> hugepages=<number of hugepages>  transparent_hugepage=never”
```
Note that is either 2M or 1G.

You can also use both sizes simultaneously:
```
GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX hugepagesz=2M hugepages=
```
In the following example we preallocate 30000 2Mb pages:
```
GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX hugepagesz=2M hugepages=30000 transparent_hugepage=never”
```
Caution: be careful when deciding on the number of huge pages to reserve. You should leave enough memory for host OS processes (including memory for Ceph processes if your compute shares the Ceph OSD role) or risk unpredictable results.

Note: You can’t allocate different amounts of memory to each NUMA cell via kernel parameters. If you need to do so, you have to use command line or startup scripts. Here is an example in which we allocate 10 1Gb sized pages on the first NUMA cell and 30 on the second one:
```
echo 10 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
echo 30 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
```
Change the value of KVM_HUGEPAGES in /etc/default/qemu-kvm from 0 to 1 to make QEMU aware of huge pages:
```
KVM_HUGEPAGES=1
```
Update the bootloader and reboot for these parameters to take effect:
```
# update-grub
# reboot
```

After rebooting, don’t forget to verify that the pages are reserved according to the settings specified:

# grep Huge /proc/meminfo

AnonHugePages:         0 kB
HugePages_Total:   30000
HugePages_Free:    30000
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

# grep . /sys/devices/system/node/node*/hugepages/hugepages*kB/nr_hugepages

/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:15000
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:15000

Nova configuration

To use huge pages, you need to launch instances whose flavor has the extra specification hw:mem_pages_size.

By default, there is nothing to prevent normal instances with flavors that don’t have the extra spec from starting on compute nodes with reserved huge pages. To avoid this situation, you’ll need to create nova aggregates for compute nodes with and without huge pages, create a new flavor for huge pages-enabled instances, update all the other flavors with this extra spec and reconfigure nova scheduler service to check extra spec when scheduling instances. Follow the steps below:

From the commandline, create an aggregate for compute nodes with and without huge pages:

# nova aggregate-create hpgs-aggr
# nova aggregate-set-metadata hpgs-aggr hpgs=true
# nova aggregate-create normal-aggr
# nova aggregate-set-metadata normal-aggr hpgs=false

Add one or more hosts to them:

# nova aggregate-add-host hpgs-aggr node-9.domain.tld
# nova aggregate-add-host normal-aggr node-10.domain.tld

Create a new flavor for instances with huge pages:

# nova flavor-create m1.small.hpgs auto 2000 20 2
# nova flavor-key m1.small.hpgs set hw:mem_page_size=2048
# nova flavor-key m1.small.hpgs set aggregate_instance_extra_specs:hpgs=true

Update all other flavours so they will start only on hosts without huge pages support:

# openstack flavor list -f csv|grep -v hpgs|cut -f1 -d,| tail -n +2| \
xargs -I% -n 1 nova flavor-key % \
set aggregate_instance_extra_specs:hpgs=false

On every controller add the value AggregateInstanceExtraSpecsFilter to the scheduler_default_filters parameter in /etc/nova/nova.conf:

scheduler_default_filters=RetryFilter,AvailabilityZoneFilter,RamFilter,DiskFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,AggregateInstanceExtraSpecsFilter

Restart nova scheduler service on all controllers:
```
# restart nova-scheduler
```

Using huge pages on MOS 7.0

Now that OpenStack is configured for huge pages, you're ready to use it as follows:

Create an instance with the huge pages flavor:
nova boot --image TestVM --nic net-id=`openstack network show net04 -f value | head -n1` --flavor
m1.small.hpgs hpgs-test

Verify that instance has been successfully created:# nova list --namehpgs-

test+--------------------------------------+-----------+--------+------------+-------------+----------------------+
 | ID                                   | Name      | Status | Task State | Power State | Networks             |       +--------------------------------------+-----------+--------+------------+-------------+----------------------+
 | 593d461e-3ef2-46cc-a88d-5f147eb2a14e | hpgs-test | ACTIVE | -          | Running     | net04=192.168.111.15 |
 +--------------------------------------+-----------+--------+------------+-------------+----------------------+

If the status is ‘ERROR’, check the log files for lines containing this instance ID. The easiest way to do that is to run the following command on the Fuel Master node:# grep -Ri <Instance ID> /var/log/docker-logs/remote/node-*

If you encounter the error:

libvirtError: internal error: process exited while connecting to monitor: os_mem_prealloc: failed to preallocate pages

… it means there is not enough free memory available inside one NUMA cell to satisfy instance requirements. Check that the VM’s NUMA topology fits inside the host’s.

This error:

libvirtError: unsupported configuration: Per-node memory binding is not supported with this QEMU

… means that you are using QEMU 2.0 packages. You need to upgrade QEMU to 2.4, see Appendix A1 for instructions on how to upgrade QEMU packages.

Verify that the instance uses huge pages (all commands below should be run from a controller):Locate the part of the instance configuration that is relevant to huge pages:
```
# hypervisor=`nova show hpgs-test | grep OS-EXT-SRV-ATTR:host | cut -d\| -f3`
# instance=`nova show hpgs-test | grep OS-EXT-SRV-ATTR:instance_name | cut -d\| -f3`
# ssh $hypervisor virsh dumpxml $instance |awk '/memoryBacking/ {p=1}; p; /\/numatune/ {p=0}’
```
```
<memoryBacking>
  <hugepages>
    <page size='2048' unit='KiB' nodeset='0'/>
  </hugepages>
</memoryBacking>
<vcpu placement='static'>2</vcpu>
<cputune>
  <shares>2048</shares>
  <vcpupin vcpu='0' cpuset='0-5,12-17'/>
  <vcpupin vcpu='1' cpuset='0-5,12-17'/>
  <emulatorpin cpuset='0-5,12-17'/>
</cputune>
<numatune>
  <memory mode='strict' nodeset='0'/>
  <memnode cellid='0' mode='strict' nodeset='0'/>
</numatune>
```
The ‘memoryBacking’ section should show that this instance’s memory is backed by huge pages. You may also see that the ‘cputune’ section reveals so-called ‘pinning’ of this instance’s vCPUs. This means the instance will only run on physical CPU cores that have direct access to this instance’s memory and comes as a bonus from hypervisor awareness of the host physical topology. We will discuss instance CPU pinning in the next section.

You may also look at the QEMU process arguments and make sure they contain relevant options, such as:
```
# ssh $hypervisor pgrep -af $instance  | grep -Po "memory[^\s]+”
```
```
memory-backend-file,prealloc=yes,mem-path=/run/hugepages/kvm/libvirt/qemu,size=2000M,id=ram-node0,host-nodes=0,policy=bind
```
… or directly examine the kernel huge pages stats:
```
# ssh $hypervisor "grep huge /proc/\`pgrep -of $instance\`/numa_maps”
```
```
2aaaaac00000 bind:0 file=/run/hugepages/kvm/libvirt/qemu/qemu_back_mem._objects_ram-node0.VveFxP\040(deleted) huge anon=1000 dirty=1000 N0=1000
```
We can see that the instance uses 1000 huge pages (since this flavor’s memory is 2Gb and we are using 2048Kb huge pages).

Note: It’s possible to use more than one NUMA host cell for a single instance with the flavor key hw:numa_nodes, but you should be aware that multi-cell instances may show worse performance than single-cell instances in the case when processes inside them aren’t aware of their NUMA topology. See more on this subject in the section about NUMA CPU pinning.

Some useful commands

Here are some commands for obtaining huge pages-related diagnostics.

To obtain information about the hardware Translation Lookaside Buffer (run ‘apt-get install cpuid’ beforehand):

     #cpuid -1| awk '/^   \w/ { p=0 } /TLB information/ { p=1; } p;'
           cache and TLB information (2):
           0x63: data TLB: 1G pages, 4-way, 4 entries
           0x03: data TLB: 4K pages, 4-way, 64 entries
           0x76: instruction TLB: 2M/4M pages, fully, 8 entries
           0xff: cache data is in CPUID 4
           0xb5: instruction TLB: 4K, 8-way, 64 entries
           0xf0: 64 byte prefetching
           0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries

To show how much memory is used for Page Tables:

     # grep PageTables /proc/meminfo

     PageTables:      1244880 kB

To show current huge pages statistics:

     # grep Huge /proc/meminfo
             AnonHugePages:    606208 kB
             HugePages_Total:   15000
             HugePages_Free:    15000
             HugePages_Rsvd:        0
             HugePages_Surp:        0
             Hugepagesize:       2048 kB

To show huge pages distribution between NUMA nodes:

     # grep . /sys/devices/system/node/node*/hugepages/hugepages*kB/nr_hugepages
     /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages:29
     /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:15845
     /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages:31
     /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:15899

Try Mirantis Secure Registry for Free

Deploy the leading enterprise secure container registry quickly and easily on Kubernetes.

TRY IT FREE

SOLUTIONS:

Secure clouds for financial services

Run business-critical applications on a cloud designed for financial services—backed by cloud experts with over a decade of experience.

LEARN MORE

Mirantis OpenStack 7.0: NFVI Deployment Guide -- Huge pages

General recommendations on using huge pages on OpenStack

Huge pages and physical topology

Enabling huge pages on MOS 7.0

Compute hosts configuration

Nova configuration

Using huge pages on MOS 7.0

Some useful commands

Recommended posts

How to build an inexpensive carrier-WiFi network on your laptop with Magma

Democratizing Connectivity with a Containerized Network Function Running on a K8s-Based Edge Platform -- Q&A

Unified Edge Cloud Infrastructure for PNFs, VNFs, Mobile Edge — Webinar Q&A

Choose your cloud native journey.

Cloud Native & Coffee

Join Our Exclusive Newsletter

Try Mirantis Secure Registry for Free

Secure clouds for financial services

Digital Self-Determination

Services

Platform

Company