lxc fails to start with cgroup error

Bug #1668123 reported by Brad Marshall
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
systemd (Ubuntu)
Won't Fix
Critical
Dimitri John Ledkov

Bug Description

After rebooting a KVM instance hosting LXCs, we get the following error:

  $ sudo lxc-ls --fancy
  lxc: cgmanager.c: lxc_cgmanager_escape: 331 call to cgmanager_move_pid_abs_sync(name=dsystemd) failed: invalid request

and the LXCs won't start up. In the error logs it showed:

  lxc-start 1487906073.534 ERROR lxc_cgfs - cgfs.c:lxc_cgroupfs_create:873 - Could not find writable mount point for cgroup hierarchy 11 while trying to create cgroup.

The only way I could get the lxcs started was to stop and start cgmanager, just a simple restart wasn't sufficient.

Please let me know if you need any further information.

$ lsb_release -a
Description: Ubuntu 14.04.5 LTS
Release: 14.04

$ dpkg-query -W systemd
systemd 204-5ubuntu20.24

$ dpkg-query -W cgmanager
cgmanager 0.24-0ubuntu7.5

$ dpkg-query -W lxc
lxc 1.0.9-0ubuntu2

$ uname -a
Linux infra 3.13.0-110-generic #157-Ubuntu SMP Mon Feb 20 11:54:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Could you please attach a complete sosreport from the affected system?

You appear to be running an incompatible set of packages and features:
* deputy systemd should not be used by cgmanager
* lxc sould not be using deputy systemd
* deputy systemd on trusty may not be used with stock kernel
* deputy systemd on trusty requires v4.4 kernel

Changed in systemd (Ubuntu):
status: New → Confirmed
importance: Undecided → Critical
assignee: nobody → Dimitri John Ledkov (xnox)
Revision history for this message
Brad Marshall (brad-marshall) wrote :

I've generated a sosreport, but its 7.7G. How would you like me to get this to you?

The interesting part is that all this was deployed via juju, so I don't know how we got into this state. It also appears to be the only node in this state, so its a bit confusing as to how it got to be this way.

Is there anything I can check configuration wise to see why its using deputy systemd? Its certainly not something we've explicitly picked as far as I know.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

private-fileshare.c.c? or e.g. people.c.c? Or just give me ssh access into the machine?

Revision history for this message
Brad Marshall (brad-marshall) wrote :

It's uploading slowly to https://private-fileshare.canonical.com/~bradm/lp1668123/, once you see the .md5 file in place and the sosreport is 7.7G, it'll be done.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Sidenote) the 18GB of /var/lib/juju/db (with backups, of backups, of backups) was not helpful, I'll need to talk to sosreport people about that. This is what made the report so huge.

1) It appears that deputy systemd was installed on the machine and subsequently upgraded:
2017-02-12 01:30:24 upgrade systemd:amd64 204-5ubuntu20.22 204-5ubuntu20.24

However, there are no logs available as to what/who/why 20.22 deputy systemd was installed.

2) Have you tried to use snapd on trusty on that host? Has anything else tried to do that? (e.g. juju manual provider or some such?!)

3) To recover the system, you should $ apt remove systemd; and reboot. However that is the workaround

4) Is this nested lxc? or errors inside the instances?
E.g. from logs I see failures to start lxc instances, but I don't see logs for failing to start instances for some reason.

5) Why was lxc downgraded/upgraded/downgraded multiple times?

6) Are the error messages from this machine? Whilst I do see that systemd is installed, and dsystemd cgroup is mounted, I am failing to find the logs for any lxc failures related to starting them.

Is there /var/log/lxc or some such that you could share privately? for some reason it was not part of the sosreport.

cgmanager should not be interracting with dsystemd.
systemd should not be present on this system (as hwe kernel is not in use, nor is snapd).
lxc should work irrespective of dsystemd.

I will setup trusty, with GA kernel, lxc1, deploy any charm (e.g. ubuntu), and install deputy systemd to try to reproduce this test case.

I wonder if upstart systemd job should be neutered, unless snapd is present, and we are booted with hwe kernel.

Changed in systemd (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Brad Marshall (brad-marshall) wrote :

> Sidenote) the 18GB of /var/lib/juju/db (with backups, of backups, of backups)
> was not helpful, I'll need to talk to sosreport people about that. This is
> what made the report so huge.

I did notice that, but I figured getting you all of the data was better than
fiddling around trying to not include that part, and maybe missing bits.
It'd be nice to have better control over it so we don't have to throw the juju
state db around if we don't need to.

> 1) It appears that deputy systemd was installed on the machine and
> subsequently upgraded:
> 2017-02-12 01:30:24 upgrade systemd:amd64 204-5ubuntu20.22 204-5ubuntu20.24

> However, there are no logs available as to what/who/why 20.22 deputy systemd
> was installed.

Interesting. I don't really know why it would have been installed.

> 2) Have you tried to use snapd on trusty on that host? Has anything else
> tried to do that? (e.g. juju manual provider or some such?!)

No, I don't believe anyone has, I don't see any evidence of that.

> 3) To recover the system, you should $ apt remove systemd; and reboot.
> However that is the workaround

Ok, I'll organise removing the package and rebooting it.

> 4) Is this nested lxc? or errors inside the instances?
> E.g. from logs I see failures to start lxc instances, but I don't see logs
> for failing to start instances for some reason.

This is LXCs on a KVM. The errors are in /var/log/lxc, its odd
that the sosreport didn't include it.

> 5) Why was lxc downgraded/upgraded/downgraded multiple times?

We were trying to work out if LP#1656280 was related somehow, the
errors were occuring before we did that.

> 6) Are the error messages from this machine? Whilst I do see that systemd is
> installed, and dsystemd cgroup is mounted, I am failing to find the logs for
> any lxc failures related to starting them.

The errors are in /var/log/lxc - see the next reply.

> Is there /var/log/lxc or some such that you could share privately? for
> some reason it was not part of the sosreport.

I've uploaded it to https://private-fileshare.canonical.com/~bradm/lp1668123/lp1668123-var-log-lxc.tar.gz

> cgmanager should not be interracting with dsystemd.
> systemd should not be present on this system (as hwe kernel is not in use, nor is snapd).
> lxc should work irrespective of dsystemd.

Its odd that stopping and starting cgmanager would let LXC work then.

> I will setup trusty, with GA kernel, lxc1, deploy any charm (e.g. ubuntu),
> and install deputy systemd to try to reproduce this test case.

> I wonder if upstart systemd job should be neutered, unless snapd is present,
> and we are booted with hwe kernel.

It does sound like a good idea if we're going to have failures like what we
saw.

Revision history for this message
JuanJo Ciarlante (jjo) wrote :

FYI because of other maintenance I had to do on the affected nodes,
after upgrading to linux-generic-lts-xenial 4.4.0-66-generic this
issue didn't show anymore.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

I've had this occur on another system, and this time I could debug it a little more freely. This is an interesting output:

# dpkg --purge --simulate systemd
dpkg: dependency problems prevent removal of systemd:
 snapd depends on systemd (>= 204-5ubuntu20.20); however:
  Package systemd is to be removed.

dpkg: error processing package systemd (--purge):
 dependency problems - not removing
Errors were encountered while processing:
 systemd

Now, why snapd was installed I'm not sure, but I guess that's what installed systemd, since the snapd package depends on systemd.

Changed in systemd (Ubuntu):
status: Incomplete → New
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
Revision history for this message
Dan Streetman (ddstreet) wrote :

please reopen if this is still an issue

Changed in systemd (Ubuntu):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.