adding seccomp rule for socket() fails on i386 since kernel 4.3

Bug #1526358 reported by Martin Pitt
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libseccomp (Ubuntu)
Fix Released
Undecided
Unassigned
linux (Ubuntu)
Invalid
Medium
Andy Whitcroft
systemd (Ubuntu)
Invalid
Medium
Unassigned

Bug Description

Four days ago, on Dec 10, http://autopkgtest.ubuntu.com/packages/s/systemd/xenial/i386/ started failing:

======================================================================
FAIL: test_boot (__main__.NspawnTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/adt-run.IG1dKn/build.Yzd/systemd-228/debian/tests/boot-and-services", line 204, in test_boot
    self.assertIn(b'fake container started', out)
AssertionError: b'fake container started' not found in b'Spawning container c1 on /tmp/tmpl04y_tf8/c1.\nPress ^] three times within 1s to kill container.\nFailed to create directory /tmp/tmpl04y_tf8/c1/sys/fs/selinux: Read-only file system\nFailed to create directory /tmp/tmpl04y_tf8/c1/sys/fs/selinux: Read-only file system\nFailed to add audit seccomp rule: Bad address\n'

This is reproducible in xenial-release, i. e. it already slipped through -proposed.

This can be reproduced easily on a xenial i386 VM:

  sudo apt-get install busybox-static
  mkdir -p /tmp/c/sbin /tmp/c/etc /tmp/c/bin/
  cp /bin/busybox /tmp/c/bin/
  ln -s ../bin/busybox /tmp/c/sbin/init
  ln -s busybox /tmp/c/bin/sh
  cp /etc/os-release /tmp/c/etc
  sudo systemd-nspawn -b -D /tmp/c

This should normally boot a busybox container; you'll get a few error messages as there's no SysV init stuff there, but it should start and pressing enter should get you into a shell. But on i386 it fails with

$ sudo systemd-nspawn -b -D /tmp/c
Spawning container c on /tmp/c.
Press ^] three times within 1s to kill container.
Failed to create directory /tmp/c/sys/fs/selinux: Read-only file system
Failed to create directory /tmp/c/sys/fs/selinux: Read-only file system
Failed to add audit seccomp rule: Bad address

which is what the test case fails on too.

Revision history for this message
Martin Pitt (pitti) wrote :
tags: added: i386 regression-release xenial
Changed in systemd (Ubuntu):
importance: Undecided → High
status: New → Triaged
Revision history for this message
Martin Pitt (pitti) wrote :

As I suspected, rebuilding the 228-2ubuntu1 systemd source in current xenial does not fix this. The ubuntu1 → ubuntu2 delta was relatively small and does not touch nspawn/seccomp at all. So I figure this is a regression in some -dev package of libc, seccomp, or linux-libc-dev.

Revision history for this message
Martin Pitt (pitti) wrote :

Version comparison between the two builds:

 - libseccomp-dev: Both versions built against 2.2.3-2ubuntu1
 - libc6-dev: 2.21-0ubuntu4 → 2.21-0ubuntu5
 - linux-libc-dev: 4.2.0-19.23 → 4.3.0-2.11
 - binutils: 2.25.51.20151113-2ubuntu1 → 2.25.90.20151209-1ubuntu1
 - gcc-5: 5.2.1-24ubuntu3 → 5.3.1-3ubuntu1

This reproduces in a schroot after bind-mounting cgroupfs:

sudo mount -o bind /sys/fs/cgroup/ /var/lib/schroot/mount/schroot-xenial-i386-systemd/sys/fs/cgroup/

I bisected the above toolchain packages, and when building systemd against linux-libc-dev 4.2.0-16.19 it works again.

Revision history for this message
Martin Pitt (pitti) wrote :

I tried in the forward direction: linux-libc-dev 4.3.0-4.13 still fails, and that's the latest xenial one (4.3.0-5.14 is not built yet).

I also tried 4.4.0-0.5 in the unstable PPA (https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/unstable/+build/8438036) and it still fails.

Changed in systemd (Ubuntu):
importance: High → Medium
Revision history for this message
Martin Pitt (pitti) wrote :

For the record: if someone bisects this, I strongly advise to build systemd in a pre-created schroot with

   CFLAGS="-g -O0" DEB_BUILD_FLAGS=nocheck dpkg-buildpackage -Pnoudeb -us -uc -b -j4

which will only take some 3 minutes, instead of 20 .

Martin Pitt (pitti)
Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: bot-stop-nagging
Revision history for this message
Martin Pitt (pitti) wrote :

I further bisected it down to adding this line to /usr/include/i386-linux-gnu/asm/unistd_32.h:

  #define __NR_socket 359

if I drop just that and rebuild systemd, seccomp/nspawn work again.

While systemd does define some syscalls for some more obscure platforms in https://github.com/systemd/systemd/blob/master/src/basic/missing.h if the kernel headers don't already define them, these don't seem to collide with either __NR_socket or the value 359. The only place where I see this referenced is in /usr/include/asm-generic/unistd.h:

#define __NR_socket 198
__SYSCALL(__NR_socket, sys_socket)

but as that redefines __NR_socket I figure that's unrelated. Commenting it out doesn't change the behaviour at all.

I confirm this on Debian sid which has linux-libc-dev 4.3.3-1, exact same situation.

At this point I'm afraid I don't understand what's going on and what these new syscall definitions do.

Revision history for this message
Martin Pitt (pitti) wrote :

I tried to #undef __NR_socket in the systemd sources, to see where this value is actually being used. Turns out it is in https://github.com/systemd/systemd/blob/master/src/nspawn/nspawn.c#L1577 in setup_seccomp():

        r = seccomp_rule_add(
                        seccomp,
                        SCMP_ACT_ERRNO(EAFNOSUPPORT),
                        SCMP_SYS(socket),
                        2,
                        SCMP_A0(SCMP_CMP_EQ, AF_NETLINK),
                        SCMP_A2(SCMP_CMP_EQ, NETLINK_AUDIT));
        if (r < 0) {
                log_error_errno(r, "Failed to add audit seccomp rule: %m");

where SCMP_SYS is a macro from libseccomp-dev (/usr/include/seccomp.h):

/**
 * Convert a syscall name into the associated syscall number
 * @param x the syscall name
 */
#define SCMP_SYS(x) (__NR_##x)

So this links the new syscall definition to seccomp. Apparently seccomp_rule_add() (in the same seccomp.h file) behaves differently if the syscall is defined. I just wonder how this actually built on i386 with the 4.2.0 kernel headers which did not have __NR_socket defined?

With current 4.3 kernel headers, the value of SCMP_SYS(socket) == 359, as defined above. With the previous 4.2 kernel headers, the value is 4294967195 == 0xFFFFFF9B instead, apparently some auto-generated value. So this explains how it built before.

So it looks like this might be between libseccomp and the kernel now?

Revision history for this message
Martin Pitt (pitti) wrote :

I now isolated this seccomp failure into a tiny .c file which reproduces this. On amd64 it works:

$ gcc -o /tmp/o ~/seccomp-socket-filter.c -lseccomp && /tmp/o
SCMP_SYS(socket) == 41 == 29
Success

and on i386 it reproduces the error:

$ gcc -o /tmp/o ~/seccomp-socket-filter.c -lseccomp && /tmp/o
SCMP_SYS(socket) == 359 == 167
seccomp_rule_add failed: Bad address

So what systemd is trying to do is to first initialize seccomp with possible alternative architectures (running 32 bit container on 64 bit host, and vice versa if you have a 64 bit kernel) and then disallow opening socket()s to the netlink audit subsystem, as audit is broken for containers. The gist of it is

    seccomp = seccomp_init(SCMP_ACT_ALLOW);
    seccomp_arch_add(seccomp, SCMP_ARCH_X86_64);
    seccomp_rule_add(
            seccomp,
            SCMP_ACT_ERRNO(EAFNOSUPPORT),
            SCMP_SYS(socket),
            2,
            SCMP_A0(SCMP_CMP_EQ, AF_NETLINK),
            SCMP_A2(SCMP_CMP_EQ, NETLINK_AUDIT));

This has worked on both arches until __NR_socket got defined on i386, before it used that autogenerated value.

summary: - xenial/i386 regression: nspawn fails with "Failed to add audit seccomp
- rule: Bad address"
+ adding seccomp rule for socket() fails on i386 since kernel 4.3
Revision history for this message
Martin Pitt (pitti) wrote :

This isn't specific to netlink. I removed the two rules from the seccomp filter and simplified it to just generally block socket(). I also simplified adding the arches so that only the non-native arch is added, not the native one. Note that adding the socket() filter *does* work on both arches if the non-native architecture does not get added, this only fails with adding x86_64 to the filter on i386.

Revision history for this message
Martin Pitt (pitti) wrote :

Forgot to attach the simplified file..

Revision history for this message
Andy Whitcroft (apw) wrote :

So in the commit below we switched how the socket family of calls are exposed at the syscall level (which was a 4.3-rc1 change):

  commit 9dea5dc921b5f4045a18c63eb92e84dc274d17eb
  Author: Andy Lutomirski <email address hidden>
  Date: Tue Jul 14 15:24:24 2015 -0700

    x86/entry/syscalls: Wire up 32-bit direct socket calls

One of the stated goals of this was to expose these calls for seccomp mediation and to bring 32bit in line with 64bit. So it is cirtain we never did do seccomp mediation on these before.

Revision history for this message
Martin Pitt (pitti) wrote :

Notified systemd upstream in https://github.com/systemd/systemd/issues/2177 .

Robie Basak (racb)
Changed in libseccomp (Ubuntu):
status: New → Triaged
Revision history for this message
Andy Whitcroft (apw) wrote :

Running the example above the EFAULT is being generated in userspace. Looking at libseccomp it seems we have a literal copy of the systemcall table mapping call strings to local numbers. For 32bit the new system calls are not filled in so they will fail. Esentially libseccomp and the kernel headers are out of sync, so systemd thinks it can use real mitigation on socket() but libseccomp does not think 32bit supports it.

Martin Pitt (pitti)
Changed in linux (Ubuntu):
status: Confirmed → Invalid
Changed in libseccomp (Ubuntu):
status: Triaged → In Progress
Changed in systemd (Ubuntu):
status: Triaged → Invalid
Andy Whitcroft (apw)
Changed in libseccomp (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu):
importance: Undecided → Medium
assignee: nobody → Andy Whitcroft (apw)
Changed in libseccomp (Ubuntu):
assignee: nobody → Andy Whitcroft (apw)
Revision history for this message
Martin Pitt (pitti) wrote :
Changed in libseccomp (Ubuntu):
assignee: Andy Whitcroft (apw) → nobody
importance: High → Undecided
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libseccomp - 2.2.3-2ubuntu3

---------------
libseccomp (2.2.3-2ubuntu3) xenial; urgency=low

  * debian/patches/add-x86-32bit-socket-calls.patch: add the newly
    connected direct socket calls. (LP: #1526358)

 -- Andy Whitcroft <email address hidden> Wed, 16 Dec 2015 14:30:17 +0000

Changed in libseccomp (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.