Introduction
This week, I had a particularly subtle issue that took me several days to find and fix. Although simple to fix, diagnosing the problem was difficult. This post involves a little tutorial on how to use Intel’s AEP DIMMs, as well as descriptions of some issues that I encountered while using them.
I’m on a system with some Intel® Optane™ Persistent Memory DIMMs, which contain large capacities of non-volatile memory. From here on out, I’ll refer to them as “AEP”, which stands for “Apache Pass”, the memory technology that they use under the hood. The three most common ways of making this memory available to the operating system are:
- 2LM mode, enabled in the BIOS, which uses a number of DDR4 DIMMs as a cache for the AEP DIMMs.
- Filesystems, in which you first enable 1LM mode, then install a filesystem onto the AEP DIMMs to take advantage of their persistence.
- NUMA mode, in which you enable 1LM mode, then bind the AEP DIMMs’ memory regions to a driver which makes them available as a NUMA node.
For my research, I’m particularly interested in the third option, since most of my tools assume that memory tiers will be accessible as NUMA nodes, and onlining a set of AEP DIMMs as a NUMA node would allow me to use those tools without modification.
Preparation
In general, these are the steps to prepare to go into NUMA mode:
- Ensure you’ve got a new enough kernel. It needs to be at least version
5.1.0-rc4
. - Change the AEP to
1LM
mode in the BIOS. -
Use
ipmctl
to put each individual AEP DIMM into AppDirect mode: To do this to all AEP on socket 0, do:ipmctl create -goal -socket 0 PersistentMemoryType=AppDirect
Reboot. -
Now the DIMMs that you chose from the previous command will show as one distinct memory region in
ipmctl
. However, you needndctl
anddaxctl
to bind this region to the appropriate kernel driver.git clone https://github.com/pmem/ndctl.git
- Build the project.
- Migrate the device model. This writes to the file
/etc/modprobe.d/daxctl.conf
. If it doesn’t, then you didn’t globally installdaxctl
to your system (it hardcodes some paths, so running this command from the source directory doesn’t work). Reboot again. -
View the memory regions that you created with
ipmctl
, usingndctl
:ndctl list -R
-
Create a namespace consisting of that region:
ndctl create-namespace --region region0 -m dax
-
Ensure you don’t have
dax_pmem_compat
running:lsmod | grep "dax_pmem_compat"
-
Ensure you do have
kmem
running:lsmod | grep kmem
The Manual Way
The next step is where I’ve been confused. Initially, my version of the
daxctl
command didn’t include any other commands: that is, from here on out,
you had to manually bind the namespace to the proper kernel driver. To do this
manually (again for a single memory region):
-
Find the device that you want to bind. Examples include
dax0.0
,dax1.0
, etc. This will list the namespaces, and within these should be the device that they’re associated with:ndctl list -N
-
Unbind from the
device_dax
driver:echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
-
Bind to the
kmem
driver:echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
Once that’s done, you should now check numastat -m
to see if you’ve got a
newly-created NUMA node. If it contains the capacity that you expect it to be,
you’re done.
The Automatic Way
However, one time I’ve encountered a situation in which this wasn’t the case.
Upon checking numastat
, the node had a size of 0 bytes. Internet searches
didn’t seem to give any hints as to why this was. As it turns out, getting the
newest version of daxctl
(which includes the online-memory
,
offline-memory
, and reconfigure-device
commands) fixed my issue, but with a
caveat.
Using a new enough daxctl
, the last three steps can be replaced by a simpler:
daxctl reconfigure-device --mode=system-ram all
This will unbind the device from the old driver and rebind it to the new driver. Crucially, though, it also onlines the memory regions, which is what I was missing when I encountered the 0-size NUMA node issue. Using this new method, the node now shows up with the appropriate capacity.
More Problems
On my system, I have two NUMA nodes of DDR, 0 and 1. Each of these is 96GB. Node 2 is the AEP on socket 0, while node 3 is the AEP on socket 1. So– to bind an application to the memory on socket 1 (while preferring DDR and spilling onto AEP), I do:
numactl --preferred=1 numactl --membind=1,3 --cpunodebind=1 ./a.out
For applications that use less than approximately 200GB of RAM, this works just
fine. However, upon scaling them to use more than 200GB, I start encountering
issues: the OOM killer is suddenly killing my process, and if I set
vm.overcommit_memory
to 2
(thus forcing malloc
to return NULL
before
allocating more memory than is available), the applications fail to allocate
memory above ~200GB.
Searching for this issue doesn’t return many results, either, and I can’t seem to figure out why those allocations are failing. If I ignore the DDR node, binding only to the AEP, all allocations succeed, and I can scale my application to use a peak RSS of more than 700GB. However, immediately upon preferring the DDR and spilling onto one of the AEP nodes, the kernel OOMs upon reaching around 200GB, despite there being nearly 600GB of free memory available on node 3.
Memory Zones
Nearly giving up, I finally check each of the memory regions that make up the AEP NUMA nodes. For node 3, checking the first gigabyte of memory looks like:
cat /sys/bus/node/devices/node3/memory1000/state
The value was online
. However, upon checking valid_zones
in the same
directory, it seems that the memory is in ZONE_MOVABLE
, not ZONE_NORMAL
.
Reading up on this, it explains why I could
manually bind to specifically that node and use its fully capacity, but not
fault memory onto it. Since this memory was onlined as ZONE_MOVABLE
, the most
that the kernel can access is the minimum of what the other nodes have: that
is, the more-than-700GB node of AEP can only have ~96GB faulted onto it, so
that I get an OOM when I allocate above nearly 200GB of memory (96GB of DDR,
plus 96GB of AEP which I fault to). This also explains why binding directly to
that node succeeds: userspace applications can still use the full capacity of
the node just fine, and numastat -m
shows the full amount.
Searching further, I find out why the nodes were ZONE_MOVABLE
: the daxctl
command
onlines them that way, by doing e.g.
echo online_movable > /sys/bus/node/devices/node3/memory1000/state
to each of the memoryXXXX
directories for a particular node. While this is
fine for an application that binds directly to that node, a subtle issue is
that the node cannot be fully faulted onto if it has more capacity than the
minimum of all of your other NUMA nodes (as will usually be the case for AEP).
The Solutions
The first and simplest solution is to modify daxctl
. I chose this one, as it
was the quickest to implement. For release version v66
, I edited line 1095 of
daxctl/lib/libdaxctl.c
from online_movable
to online
, then recompiled,
rebooted, and re-onlined the memory with this new version.
For those that want to use the manual method, there are two more solutions.
First, you can simply write a script that manually onlines each of the
gigabytes of memory for your AEP regions. Your script would essentially iterate
over each of the directories in /sys/bus/node/devices/nodeX/
, and echo
online > state
for each of the memoryXXXX
directories. This would most
likely be the simplest solution if you try the manual method above, but end up
with a 0-size NUMA node (and don’t want to use daxctl
).
The solution that would be easier in the long-term, though, is to enable the
kernel configuration CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
, which
automatically onlines newly-hotplugged memory. The memory would then be
immediately onlined into ZONE_NORMAL
upon being bound to the kmem
driver.