Made of Bugs

It's software. It's made of bugs.

Disable Transparent Hugepages

tl;dr  🔗︎

“Transparent Hugepages” is a Linux kernel feature intended to improve performance by making more efficient use of your processor’s memory-mapping hardware. It is enabled ("enabled=always") by default in most Linux distributions.

Transparent Hugepages gives some applications a small performance improvement (~ 10% at best, 0-3% more typically), but can cause significant performance problems, or even apparent memory leaks at worst.

To avoid these problems, you should set enabled=madvise on your servers by running

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

and setting transparent_hugepage=madvise on your kernel command line (e.g. in /etc/default/grub).

This change will allow applications that are optimized for transparent hugepages to obtain the performance benefits, and prevent the associated problems otherwise.

Read on for more details.

What are transparent hugepages?  🔗︎

What are hugepages?  🔗︎

For decades now, processors and operating systems have collaborated to use virtual memory to provide a layer of indirection between memory as seen by applications (the “virtual address space”), and the underlying physical memory of the hardware. This indirection protects applications from each other, and enables a whole host of powerful features.

x86 processors, like many others, implement virtual memory by a page table scheme that stores the mapping as a large table in memory 1. Traditionally, on x86 processors, each table entry controls the mapping of a single 4KB “page” of memory.

While these page tables are themselves stored in memory, the processor caches a subset of the page table entries in a cache on the processor itself, called the TLB. A look through the output of cpuid(1) on my laptop reveals that its lowest-level TLB contains 64 entries for 4KB data pages. 64*4KB is only a quarter-megabyte, much smaller than the working memory of most useful applications in 2017. This size mismatch means that applications accessing large amounts of memory may regularly “miss” the TLB, requiring expensive fetches from main memory just to locate their data in memory

Primarily in an effort to improve TLB efficiency, therefore, x86 (and other) processors have long supported creating “huge pages”, in which a single page-table entry maps a larger segment of address space to physical memory. Depending on how the OS configures it, most recent chips can map 2MB, 4MB, or even 1GB pages. Using large pages means more data fits into the TLB, which means better performance for certain workloads.

What are transparent hugepages?  🔗︎

The existence of multiple flavors of page table management means that the operating system needs to determine how to map address space to physical memory. Because application memory management interfaces (like mmap(2)) have historically been based on the smallest 4KB pages, the kernel must always support mapping data in 4KB increments. The simplest and most flexible (in terms of supported memory layouts) solution, therefore, is to just always use 4KB pages, and not benefit from hugepages for application memory mappings. And for a long time this has been the strategy adopted by the general-purpose memory management code in the kernel.

For applications (such as certain databases or scientific computing programs) that are known to require large amounts of memory and be performance-sensitive, the kernel introduced the hugetlbfs feature, which allows administrators to explicitly configure certain applications to use hugepages.

Transparent Hugepages (“THP” for short), as the name suggests, intended to bring hugepage support automatically to applications, without requiring custom configuration. Transparent hugepage support works by scanning memory mappings in the background (via the “khugepaged” kernel thread), attempting to find or create (by moving memory around) contiguous 2MB ranges of 4KB mappings, that can be replaced with a single hugepage.

What goes wrong?  🔗︎

When transparent hugepage support works well, it can garner up to about a 10% performance improvement on certain benchmarks. However, it also comes with at least two serious failure modes:

Memory Leaks  🔗︎

THP attempts to create 2MB mappings. However, it’s overly greedy in doing so, and too unwilling to break them back up if necessary. If an application maps a large range but only touches the first few bytes, it would traditionally consume only a single 4KB page of physical memory. With THP enabled, khugepaged can come and extend that 4KB page into a 2MB page, effectively bloating memory usage by 512x (An example reproducer on this bug report actually demonstrates the 512x worst case!).

This behavior isn’t hypothetical; Go’s GC had to include an explicit workaround for it, and Digital Ocean documented their woes with Redis, THP, and the jemalloc allocator.

Pauses and CPU usage  🔗︎

In steady-state usage by applications with fairly static memory allocation, the work done by khugepaged is minimal. However, on certain workloads that involve aggressive memory remapping or short-lived processes, khugepaged can end up doing huge amounts of work to merge and/or split memory regions, which ends up being entirely short-lived and useless. This manifests as excessive CPU usage, and can also manifest as long pauses, as the kernel is forced to break up a 2MB page back into 4KB pages before performing what would otherwise have been a fast operation on a single page.

Several applications have seen 30% performance degradations or worse with THP enabled, for these reasons.

So what now?  🔗︎

The THP authors were aware of the potential downsides of transparent hugepages (although, with hindsight, we might argue that they underestimated them). They therefore opted to make the behavior configurable via the /sys/kernel/mm/transparent_hugepage/enabled sysfs file.

Even more importantly, they implemented an “opt-in” mode for transparent hugepages. With the madvise setting in /sys/kernel/mm/transparent_hugepage/enabled, khugepaged will leave memory alone by default, but applications can use the madvise system call to specifically request THP behavior for selected ranges of memory.

Since – for the most part – only a few specialized applications receive substantial benefits from hugepage support, this option gives us the best of both worlds. The applications of those applications can opt-in using madvise, and the rest of us can remain free from the undesirable side-effects of transparent hugepages.

Thus, I recommend that all users set their transparent hugepage setting to madvise, as described in the tl;dr section at the top. I also hope to persuade the major distributions to disable them by default, to save numerous more administrators and operators and developers from having to rediscover these failure modes for themselves.

Postscript  🔗︎

This post is also available in <a href=" rel=“nofollow”>translation into Russian

  1. It’s actually a tree structure of varying depth, but it’s equivalent to a large sparse table. ↩︎