glibc 2.34 upgrade will break some essential services

Bug #1942276 reported by Sergio Durigan Junior
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
docker.io (Ubuntu)
Fix Released
Medium
Unassigned
glibc (Ubuntu)
Fix Released
High
Unassigned

Bug Description

Try this:

$ lxc launch ubuntu-daily:impish test-docker
$ lxc shell test-docker
# cat <<EOF >/etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed restricted main multiverse universe
EOF
# apt update
# apt install libc-bin -y
...
(debconf query asking which services should be restarted. Just select Ok)
...
Restarting services...
 systemctl restart accounts-daemon.service console-getty.service cron.service packagekit.service polkit.service rsyslog.service snapd.service ssh.service systemd-journald.service systemd-networkd.service systemd-resolved.service systemd-udevd.service udisks2.service
Job for systemd-networkd.service failed.
See "systemctl status systemd-networkd.service" and "journalctl -xeu systemd-networkd.service" for details.
Job for systemd-resolved.service failed because the control process exited with error code.
See "systemctl status systemd-resolved.service" and "journalctl -xeu systemd-resolved.service" for details.
Service restarts being deferred:
 /etc/needrestart/restart.d/dbus.service
 systemctl restart networkd-dispatcher.service
 systemctl restart systemd-logind.service
 systemctl restart unattended-upgrades.service
# ping ubuntu.com
ping: ubuntu.com: Temporary failure in name resolution
# systemctl status systemd-networkd
× systemd-networkd.service - Network Service
     Loaded: loaded (/lib/systemd/system/systemd-networkd.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2021-09-01 20:41:03 UTC; 36s ago
TriggeredBy: × systemd-networkd.socket
       Docs: man:systemd-networkd.service(8)
    Process: 2411 ExecStart=/lib/systemd/systemd-networkd (code=exited, status=217/USER)
   Main PID: 2411 (code=exited, status=217/USER)

Sep 01 20:41:03 test-docker systemd[1]: systemd-networkd.service: Scheduled restart job, restart counter is at 5.
Sep 01 20:41:03 test-docker systemd[1]: Stopped Network Service.
Sep 01 20:41:03 test-docker systemd[1]: systemd-networkd.service: Start request repeated too quickly.
Sep 01 20:41:03 test-docker systemd[1]: systemd-networkd.service: Failed with result 'exit-code'.
Sep 01 20:41:03 test-docker systemd[1]: Failed to start Network Service.

The same can be reproduced inside a VM. If the user reboots the system, it becomes usable again.

[ Original Description ]

This bug is blocking docker.io on update-excuses.

I noticed that docker.io version 20.10.7-0ubuntu2 (currently in impish-proposed) is failing to start when installed inside an Impish LXD container. You can reproduce the bug by doing:

$ lxc launch ubuntu-daily:impish test-docker -c security.nesting=true
$ lxc shell test-docker
# cat <<EOF >/etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed restricted main multiverse universe
EOF
# apt update
# apt install docker.io -y
...
Setting up docker.io (20.10.7-0ubuntu2) ...
Adding group `docker' (GID 120) ...
Done.
Created symlink /etc/systemd/system/multi-user.target.wants/docker.service → /lib/systemd/system/docker.service.
Created symlink /etc/systemd/system/sockets.target.wants/docker.socket → /lib/systemd/system/docker.socket.
A dependency job for docker.service failed. See 'journalctl -xe' for details.
invoke-rc.d: initscript docker, action "start" failed.
○ docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
     Active: inactive (dead)
TriggeredBy: × docker.socket
       Docs: https://docs.docker.com

Sep 01 01:52:47 test-docker systemd[1]: Dependency failed for Docker Application Container Engine.
Sep 01 01:52:47 test-docker systemd[1]: docker.service: Job docker.service/start failed with result 'dependency'.
...

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Investigating a bit more, here's what we see when we check the status of docker.socket:

# systemctl status docker.socket
× docker.socket - Docker Socket for the API
     Loaded: loaded (/lib/systemd/system/docker.socket; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2021-09-01 02:05:47 UTC; 14s ago
   Triggers: ● docker.service
     Listen: /run/docker.sock (Stream)

Sep 01 02:05:47 test-docker systemd[1]: Starting Docker Socket for the API.
Sep 01 02:05:47 test-docker systemd[2488]: docker.socket: Failed to resolve group docker: No such process
Sep 01 02:05:47 test-docker systemd[1]: docker.socket: Control process exited, code=exited, status=216/GROUP
Sep 01 02:05:47 test-docker systemd[1]: docker.socket: Failed with result 'exit-code'.
Sep 01 02:05:47 test-docker systemd[1]: Failed to listen on Docker Socket for the API.

What caught my attention is the following line:

Sep 01 02:05:47 test-docker systemd[2488]: docker.socket: Failed to resolve group docker: No such process

It doesn't make sense to me. As can be seen in the bug description, the "docker" group was properly created *before* the service/socket was (tentatively) started.

If we inspect the socket, we see that its group is indeed wrong (it should be "docker", but it's "root"):

# ls -la /var/run/docker.sock
srw-rw---- 1 root root 0 Sep 1 02:05 /var/run/docker.sock

What's strange is that systemd should be responsible for changing the ownership of the socket when the service is started, but it can't (because it fails to "resolve" the "docker" group). strace wasn't very helpful to determine what's going on here.

It's also interesting to note that I can't reproduce the problem with docker.io 20.10.7-0ubuntu1 (from impish-release). The installation finishes just fine.

I noticed that the last upload was about shipping libnetwork into the golang-github-docker-docker-dev package. Initially I don't see how this could have impacted the docker installation here. Maybe libnetwork is messing with the socket creation somehow?

Revision history for this message
Tianon Gravi (tianon) wrote :

I can reproduce, but I can even reproduce lots of failures by only upgrading "libc6" and "libc-bin" (which come in with "docker.io"), without Docker even installed or being installed. Lots of other services then try to restart and fail to do so, and "apt update" even starts failing to resolve DNS.

Revision history for this message
Tianon Gravi (tianon) wrote :

After some discussion with mwhudson, I tried the following:

- add proposed
- install the libc6/libc-bin updates (which breaks a lot of stuff)
- restart the container
- stuff is working again
- install docker.io (from proposed)
- profit

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

And what exactly should we do here to unblock this version from impish-proposed? Should we change debian/tests/docker-in-lxd to enable -proposed in the lxd container?

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Indeed, I can confirm that if we enable -proposed, upgrade everything and reboot the container, then the docker.io installation succeeds.

I'm not sure what the best course of action is here, but I'm experimenting a few things and will file a PR if/when I have something workable.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

FWIW, I was able to reproduce this problem even inside a VM, which means that this is not related specifically to LXD containers.

Another interesting point here is the fact that a lot of important systemd services are unable to restart after the libc upgrade. We end up with a system without internet connectivity, for example.

I'm adding a glibc task to this bug and a block-proposed tag, as well as retitling it.

Changed in glibc (Ubuntu):
importance: Undecided → High
tags: added: block-proposed
summary: - docker 20.10.7-0ubuntu2 fails to start when installed inside Impish LXD
- container
+ glibc 2.34 upgrade will break some essential services
description: updated
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

What appears to be going on here is that systemd is not restarted as part of the upgrade of glibc so it is still running glibc 2.33. When starting a service that does anything even slightly funky with users and groups (so things that use DynamicUser= like systemd-resolved but also things like docker which just uses Group= on a socket) it forks itself and calls Name Service Switch apis which dlopen nss modules like /lib/x86_64-linux-gnu/libnss_files.so.2. But these now come from the glibc 2.34 package and are not compatible with the libc already loaded into the forked process and so the nss calls all fail.

I don't know why this didn't bite us for other glibc upgrades -- nss modules are basically never cross version compatible afaik. Maybe systemd has changed and used to have an execve between the fork and any access to nss apis?

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So my previous comment isn't quite right.

For one thing, I *don't* think this affects DynamicUser=, only User= and Group=. And https://sourceware.org/legacy-ml/libc-help/2016-12/msg00006.html says that nss modules are supposed to be ABI compatible between releases but it appears that for whatever reason, nss_files from 2.34 is not compatible with glibc 2.33. Now to look into why that might be.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Wait a minute...

root@upgrade-testing:~# readelf --wide -s /usr/lib/x86_64-linux-gnu/libnss_files.so.2

Symbol table '.dynsym' contains 6 entries:
   Num: Value Size Type Bind Vis Ndx Name
     0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
     1: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_deregisterTMCloneTable
     2: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __gmon_start__
     3: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_registerTMCloneTable
     4: 0000000000000000 0 FUNC WEAK DEFAULT UND __cxa_finalize@GLIBC_2.2.5 (3)
     5: 0000000000000000 0 OBJECT GLOBAL DEFAULT ABS GLIBC_PRIVATE

This doesn't look even close to correct!

Revision history for this message
Dan Bungert (dbungert) wrote :

https://sourceware.org/pipermail/libc-alpha/2021-June/127273.html

stuff from nss_files was moved into glibc proper.
for libnss-db, we addressed this by no longer explicitly linking against -lnss_files.

LP: #1939918

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Oh right! Yes that explains everything. I filed an upstream bug https://sourceware.org/bugzilla/show_bug.cgi?id=28300 although I'm not really sure what I expect them to do about it. The obvious thing to do would be to reexec systemd on upgrade but as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=753725 explains that's not completely safe. I guess we could revert that patch series. I can't really see how to fix this without another glibc upload though :/

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

OK, with thanks to Julian and Dimitri, I think this patch https://paste.ubuntu.com/p/Bf9tDggMvv/ to glibc will avoid the issue and avoid the worst part of the referenced debian bug (the kernel panic). It means another glibc upload though :(

tags: added: rls-ii-incoming
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in docker.io (Ubuntu):
status: New → Confirmed
Changed in glibc (Ubuntu):
status: New → Confirmed
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I've uploaded a glibc fix now and it appears to fix the bug so I'll go ahead and remove the block-proposed tag.

Changed in glibc (Ubuntu):
status: Confirmed → Fix Committed
tags: removed: block-proposed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package glibc - 2.34-0ubuntu2

---------------
glibc (2.34-0ubuntu2) impish; urgency=medium

  * d/patches/ubuntu/Fix-close_range-closefrom-tests.patch: Patch from
    upstream to fix test failures in autopkgtest environment (which has a
    pair of fds open that the test suite did not cope with).
  * d/debhelper.in/libc.postinst: go back to restarting systemd on libc6
    upgrade, but carefully. LP: #1942276

 -- Michael Hudson-Doyle <email address hidden> Fri, 03 Sep 2021 09:26:51 +1200

Changed in glibc (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package docker.io - 20.10.7-0ubuntu3

---------------
docker.io (20.10.7-0ubuntu3) impish; urgency=medium

  * d/t/docker-in-lxd:
    Perform a full upgrade and restart of the container before attempting
    to install docker.io. (LP: #1942276)

 -- Sergio Durigan Junior <email address hidden> Wed, 01 Sep 2021 18:58:31 -0400

Changed in docker.io (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.