Comment 12 for bug 1906280

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote : Re: Charm stuck waiting for ovsdb 'no key "ovn-remote" in Open_vSwitch record'

5. in #10 I referred to an ability of a process with CAP_IPC_LOCK to bypass the RLIMIT_MEMLOCK:

https://elixir.bootlin.com/linux/v4.15.18/source/mm/mmap.c#L1300 (mlock_future_check)
  if (locked > lock_limit && !capable(CAP_IPC_LOCK))
   return -EAGAIN;

Which raises the question whether this capability needs to be effective (see man 7 capabilities) in the user namespace of the unprivileged container or in the initial user namespace.

Based on what I see, CAP_IPC_LOCK is not dropped for unprivileged containers (also based on a comment from Stephane here https://discuss.linuxcontainers.org/t/how-to-add-cap-ipc-lock-capabilities-to-container/484/2):

$ ps 17228
  PID TTY STAT TIME COMMAND
17228 ? Ss 0:00 /sbin/init

$ grep Cap /proc/17228/status
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000

$ capsh --decode=0000003fffffffff
0x0000003fffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read

The "capable" function in the kernel checks the presence of a capability for the **initial** user namespace (the function comments seem to refer to that as having a "superior capability")

https://elixir.bootlin.com/linux/v4.15.18/source/kernel/capability.c#L429 (capable)
https://elixir.bootlin.com/linux/v4.15.18/source/kernel/user.c#L26
struct user_namespace init_user_ns = {

As opposed to the ns_capable function, for example:
https://elixir.bootlin.com/linux/v4.15.18/source/kernel/capability.c#L395 (ns_capable)

Therefore, we will not be able to use CAP_IPC_LOCK for users in unprivileged LXD containers to bypass RLIMIT_MEMLOCK.