Comment 10 for bug 1906280

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote : Re: Charm stuck waiting for ovsdb 'no key "ovn-remote" in Open_vSwitch record'

Expanding on #8, in my testing on a different environment (Bionic host, Focal container) I found that vswitchd fails when a pthread gets created and tries to mmap some memory for its stack:

13077 20:22:25.392054 mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = -1 EAGAIN (Resource temporarily unavailable)
13077 20:22:25.392096 write(2, "ovs-vswitchd: ", 14) = 14
13077 20:22:25.392140 write(2, "pthread_create failed", 21) = 21
13077 20:22:25.392184 write(2, " (Resource temporarily unavailable)", 35) = 35
13077 20:22:25.392223 write(2, "\n", 1) = 1

The reason for that deserves a detailed description:

--------------------------
1. Stack size:

https://git.launchpad.net/ubuntu/+source/openvswitch/tree/lib/ovs-thread.c?h=applied/ubuntu/focal-updates#n438 (pthread_create in the OVS code)
https://git.launchpad.net/ubuntu/+source/glibc/tree/nptl/allocatestack.c?h=ubuntu/focal-updates&id=e639063e5d4ba5c296c990924eb4f290bc1d06ae#n562 (the glibc code doing the mmap, see the comment about PROT_NONE starting with line 559)

The prlimit for STACK memory is 8388608 and the mmap region includes a guard page (8388608 + 4096 = 8392704) so the size passed to mmap is correct (plus PROT_NONE is used). So this is not because of the stack memory.

STACK max stack size 8388608 unlimited bytes

--------------------------
2. ovs-vswitchd applies memory locking to all memory allocations by default when started via ovs-ctl:

https://git.launchpad.net/ubuntu/+source/openvswitch/tree/vswitchd/ovs-vswitchd.c?h=applied/ubuntu/focal-updates#n93
    if (want_mlockall) {
#ifdef HAVE_MLOCKALL
        if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
            VLOG_ERR("mlockall failed: %s", ovs_strerror(errno));

https://git.launchpad.net/ubuntu/+source/openvswitch/tree/utilities/ovs-ctl.in?h=applied/ubuntu/focal-updates#n321
    MLOCKALL=yes

https://git.launchpad.net/ubuntu/+source/openvswitch/tree/utilities/ovs-ctl.in?h=applied/ubuntu/focal-updates#n210
        if test X"$MLOCKALL" != Xno; then
            set "$@" --mlockall
        fi

--------------------------
3. EAGAIN returned by mmap and memory locking

mmap returns EAGAIN when it cannot lock memory and memory cannot be locked if the process goes beyond the RLIMIT_MEMLOCK (unless it has CAP_IPC_LOCK in the initial user namespace or has uid 0 in it)

https://elixir.bootlin.com/linux/v4.15.18/source/mm/mmap.c#L1385 (do_mmap)
 if (mlock_future_check(mm, vm_flags, len))
  return -EAGAIN;
https://elixir.bootlin.com/linux/v4.15.18/source/mm/mmap.c#L1300 (mlock_future_check)
  if (locked > lock_limit && !capable(CAP_IPC_LOCK))
   return -EAGAIN;

The mlock manpage documents that the use of mlockall(MCL_FUTURE) may lead to future mmap failures if the RLIMIT_MEMLOCK is hit, however, the root user (uid 0) in the initial user namespace will not be affected since it has CAP_IPC_LOCK and the limit will be ignored for it:

https://man7.org/linux/man-pages/man2/mlock.2.html
"In Linux 2.6.8 and earlier, a process must be privileged (CAP_IPC_LOCK) in order to lock memory and the RLIMIT_MEMLOCK soft resource limit defines a limit on how much memory the process may lock.
"Since kernel 2.6.9, if a privileged process calls mlockall(MCL_FUTURE) and later drops privileges (loses the CAP_IPC_LOCK capability by, for example, setting its effective UID to a nonzero value), then subsequent memory allocations (e.g., mmap(2), brk(2)) will fail if the RLIMIT_MEMLOCK resource limit is encountered."

The mlockall man page documents the case where stack allocations may fail with MCL_FUTURE:

https://man7.org/linux/man-pages/man2/mlockall.2.html
"MCL_FUTURE
Lock all pages which will become mapped into the address space of the process in the future.  These could be, for instance, new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions.

If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number of locked bytes to exceed the permitted maximum (see below).  In the same circumstances, stack growth may likewise fail: the kernel will deny stack expansion and deliver a SIGSEGV signal to the process."

--------------------------
4. RLIMIT_MEMLOCK defaults

The memlock limit is set to 16777216 at the host size by systemd:

sudo prlimit --memlock --pid 1
RESOURCE DESCRIPTION SOFT HARD UNITS
MEMLOCK max locked-in-memory address space 16777216 16777216 bytes

https://git.launchpad.net/ubuntu/+source/systemd/tree/src/core/main.c?h=applied/237-3ubuntu10.41#n1298
        r = setrlimit_closest(RLIMIT_MEMLOCK, &RLIMIT_MAKE_CONST(1024ULL*1024ULL*16ULL));

This gets inherited by LXD containers created on the host as well.

At the Bionic host side, I have systemd 237-3ubuntu10.41 installed and running:

dpkg -l | grep systemd
# ...
ii libsystemd0:amd64 237-3ubuntu10.41 amd64 ii systemd 237-3ubuntu10.41 amd64 system and service manager

However, there is a new version of systemd 237-3ubuntu10.43 available from the archives which includes a fix for https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1830746

https://github.com/systemd/systemd/commit/91cfdd8d29 (upstream commit)
https://git.launchpad.net/ubuntu/+source/systemd/commit/?h=applied/ubuntu/bionic-updates&id=a91c89eebc9e70631653dfd2c4148801e029181e (downstream patch for bionic)

This bumps up thee mlock limit to 64 MiB instead of 16 MiB.

--------------------------