when starting many LXD containers, they start failing to boot with "Too many open files"

Bug #1602192 reported by Christian Muirhead
58
This bug affects 8 people
Affects Status Importance Assigned to Milestone
lxd (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Fix Released
Medium
Unassigned

Bug Description

== SRU
=== Rationale
LXD containers using systemd will use a very large amount of inotify watches. This means that a system will typically run out of global watches with as little as 15 Ubuntu 16.04 containers.

An easy fix for the issue is to bump the number of user watches up to 1024, making it possible to run around 100 containers before hitting the limit again.

To do so, LXD is now shipping a sysctl.d file which bumps that particular limit on systems that have LXD installed.

=== Testcase
1) Upgrade LXD
2) Spawn about 50 Ubuntu 16.04 containers ("lxc launch ubuntu:16.04")
3) Check that they all get an IP address ("lxc list"), that's a pretty good sign that they booted properly

=== Regression potential
Not expecting anything here. Juju has shipped a similar configuration for a while now and so have the LXD feature releases.

We pretty much just forgot to include this particular change in our LTS packaging branch

== Original bug report
Reported by Uros Jovanovic here: https://bugs.launchpad.net/juju-core/+bug/1593828/comments/18

"...
However, if you bootstrap LXD and do:

juju bootstrap localxd lxd --upload-tools
for i in {1..30}; do juju deploy ubuntu ubuntu$i; sleep 90; done

Somewhere between 10-20-th deploy fails with machine in pending state (nothin useful in logs) and none of the new deploys after that first pending succeeds. Might be a different bug, but it's easy to verify with running that for loop.

So, this particular error was not in my logs, but the controller still ends up unable to provision at least 30 machines ..."

I can reproduce this. Looking on the failed machine I can see that jujud isn't running, which is why juju considers the machine not up, and in fact nothing of juju seems to be installed. There's nothing about juju in /var/log.

Comparing cloud-init-output.log between a stuck-pending machine and one which has started up fine, they both start with some key-generation messages, but the successful machine then has the line:

Cloud-init v. 0.7.7 running 'init' at Tue, 12 Jul 2016 08:32:00 +0000. Up 4.0 seconds.

...and then a whole lot of juju-installation gubbins, while the failed machine log just stops.

Revision history for this message
Christian Muirhead (2-xtian) wrote :
description: updated
Revision history for this message
Christian Muirhead (2-xtian) wrote :

Here's a cloud-init log from a successfully started machine for comparison.

Changed in juju-core:
importance: Undecided → Critical
milestone: none → 2.0-beta12
Revision history for this message
John A Meinel (jameinel) wrote :

we need the log for machine-0.log with the caveat that if you're adding 30 machines, you're probably saturating the log file, so you'll want a fairly early snapshot (before it gets rotated). I'm expecting there is a problem with the provisioner getting wedged. And then possibly it gets to a point where it ends up in an error state, gets restarted, and then ends up back in an error state.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Adding the controller machine-0.log

Revision history for this message
Christian Muirhead (2-xtian) wrote :

And the cloud-init.log from a failed machine.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

I just noticed the duplicate key error in the controller machine-0.log (from https://bugs.launchpad.net/juju-core/+bug/1593828) - I hadn't rebuilt with the fixed version of mgo. I'm doing that now and rerunning this.

That said, I don't think it's the cause of this failure - it happened well after the system was bootstrapped.

Revision history for this message
Christian Muirhead (2-xtian) wrote : Re: [Bug 1602192] Re: deploy 30 nodes on lxd, machines never leave pending

Ok - so in fact it might be the mgo bug in my case, I need to rebuild and
try again.

On Tue, Jul 12, 2016 at 12:10 PM John A Meinel <email address hidden>
wrote:

> we need the log for machine-0.log with the caveat that if you're adding
> 30 machines, you're probably saturating the log file, so you'll want a
> fairly early snapshot (before it gets rotated). I'm expecting there is a
> problem with the provisioner getting wedged. And then possibly it gets
> to a point where it ends up in an error state, gets restarted, and then
> ends up back in an error state.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1602192
>
> Title:
> deploy 30 nodes on lxd, machines never leave pending
>
> Status in juju-core:
> New
>
> Bug description:
> Reported by Uros Jovanovic here: https://bugs.launchpad.net/juju-
> core/+bug/1593828/comments/18
>
> "...
> However, if you bootstrap LXD and do:
>
> juju bootstrap localxd lxd --upload-tools
> for i in {1..30}; do juju deploy ubuntu ubuntu$i; sleep 90; done
>
> Somewhere between 10-20-th deploy fails with machine in pending state
> (nothin useful in logs) and none of the new deploys after that first
> pending succeeds. Might be a different bug, but it's easy to verify
> with running that for loop.
>
> So, this particular error was not in my logs, but the controller still
> ends up unable to provision at least 30 machines ..."
>
> I can reproduce this. Looking on the failed machine I can see that
> jujud isn't running, which is why juju considers the machine not up,
> and in fact nothing of juju seems to be installed. There's nothing
> about juju in /var/log.
>
> Comparing cloud-init-output.log between a stuck-pending machine and
> one which has started up fine, they both start with some key-
> generation messages, but the successful machine then has the line:
>
> Cloud-init v. 0.7.7 running 'init' at Tue, 12 Jul 2016 08:32:00 +0000.
> Up 4.0 seconds.
>
> ...and then a whole lot of juju-installation gubbins, while the failed
> machine log just stops.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1602192/+subscriptions
>

Revision history for this message
Christian Muirhead (2-xtian) wrote : Re: deploy 30 nodes on lxd, machines never leave pending

Rebuilt with the mgo fix applied and reran, I still see the same behaviour. Good and bad cloud-init and cloud-init-output logs, + the machine-0.log from the controller attached.

Changed in juju-core:
status: New → Triaged
tags: added: lxd
Revision history for this message
Christian Muirhead (2-xtian) wrote :

Not sure whether this is important, but looking at the cloud-init logs the good and bad have different log record formats. The row prefix for the good log is:
Jul 12 11:52:44 ubuntu [CLOUDINIT] util.py[DEBUG]:

And this for the bad log:
2016-07-12 11:54:29,932 - util.py[DEBUG]:

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Does that mean that the cloud-inits are being run with different configs somehow?

Revision history for this message
Martin Packman (gz) wrote :

It's probably work comparing the /var/lib/cloud tree from a working node to a failed one, particularly /var/lib/cloud/instance cloud-config files

Revision history for this message
Nicholas Skaggs (nskaggs) wrote :

I found similar odd behavior at scale with LXD and juju -- I hadn't yet got around to filing a nice bug as it's unclear who is to blame.

If your machine has many LXD bootstraps, strange things will happen. LXD will simply not respond. You can simply start 20 containers and then attempt to do a bootstrap. You'll see lots of weirdness. I absolutely feel there is an LXD bug in here.

Changed in juju-core:
assignee: nobody → Christian Muirhead (2-xtian)
Revision history for this message
Christian Muirhead (2-xtian) wrote :

Thanks Martin, I'm looking into that now.

Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Christian Muirhead (2-xtian) wrote :

So most of the differences between the /var/lib/cloud trees are either just the expected differences between machine names and passwords or seem like symptoms rather than the cause of the problem (eg. data/status.json indicates that only the init-local stage ran).

I've made contact with Scott Moser to find out why some of the stages aren't being run - he says init won't run if the network isn't up, he suggested asking Andy McDermott.

I'm also trying what Nicholas suggests - starting a lot of containers (without Juju in the loop) and seeing if we see the same behaviour.

Revision history for this message
Scott Moser (smoser) wrote :

I've now seen this reproduce with the attached go-lxc-run.sh

i'm debugging some.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Scratch that - Scott's already done that - he can see the same behaviour without Juju involved.

Revision history for this message
Scott Moser (smoser) wrote :

Something that doesn't look right in the failed case is that '-.mount' fails (per 'systemctl --state=failed').

cloud-init Wants local-fs.target, and cloud-init did not run.
but
cloud-init-local also Wants local-fs.target and it *did* run.
The difference there might be because cloud-init-local has 'DefaultDependencies=no'.

$ systemctl --state=failed
  UNIT LOAD ACTIVE SUB DESCRIPTION
● -.mount loaded failed failed /
● iscsid.service loaded failed failed iSCSI initiator daemon (iscs
● open-iscsi.service loaded failed failed Login to default iSCSI targe
● systemd-remount-fs.service loaded failed failed Remount Root and Kernel File
● systemd-sysctl.service loaded failed failed Apply Kernel Variables
● systemd-journald-audit.socket loaded failed failed Journal Audit Socket
● systemd-udevd-kernel.socket loaded failed failed udev Kernel Socket

LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.

7 loaded units listed. Pass --all to see loaded but inactive units, too.

$ systemctl status --no-pager --full -- -.mount
● -.mount - /
   Loaded: loaded (/etc/fstab; bad; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2016-07-13 15:49:11 UTC; 21min ago
    Where: /
     What: /dev/disk/by-label/cloudimg-rootfs
     Docs: man:fstab(5)
           man:systemd-fstab-generator(8)
  Process: 85 ExecMount=/bin/mount /dev/disk/by-label/cloudimg-rootfs / -t ext4 (code=exited, status=32)

$ systemd-analyze critical-chain cloud-init.service
The time after the unit is active or started is printed after the "@" character.
The time the unit takes to start is printed after the "+" character.

└─networking.service @2.542s +2.771s
  └─network-pre.target @2.540s
    └─cloud-init-local.service @153ms +2.386s
      └─system.slice @119ms
        └─-.slice @111ms
$ systemctl status cloud-init.service
● cloud-init.service - Initial cloud-init job (metadata service crawler)
   Loaded: loaded (/lib/systemd/system/cloud-init.service; enabled; vendor prese
   Active: inactive (dead)

Revision history for this message
Scott Moser (smoser) wrote :

One interesting difference in status of the -.mount job is the 'What'.
in a good container, we have:
  What: lxd-pool/containers/x1
In a failed container we have:
  What: /dev/disk/by-label/cloudimg-rootfs

### Good container ###
$ systemctl status --no-pager --full -- -.mount
● -.mount - /
   Loaded: loaded (/etc/fstab; bad; vendor preset: enabled)
   Active: active (mounted) since Wed 2016-07-13 15:53:35 UTC; 44min ago
    Where: /
     What: lxd-pool/containers/x1
     Docs: man:fstab(5)
           man:systemd-fstab-generator(8)
    Tasks: 0
   Memory: 0B
      CPU: 0

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

### Bad container ###
$ systemctl status --no-pager --full -- -.mount
● -.mount - /
   Loaded: loaded (/etc/fstab; bad; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2016-07-13 15:49:11 UTC; 48min ago
    Where: /
     What: /dev/disk/by-label/cloudimg-rootfs
     Docs: man:fstab(5)
           man:systemd-fstab-generator(8)
  Process: 85 ExecMount=/bin/mount /dev/disk/by-label/cloudimg-rootfs / -t ext4 (code=exited, status=32)

Revision history for this message
Scott Moser (smoser) wrote :

Steve asked for the contents of fstab. In both cases it has:
  LABEL=cloudimg-rootfs / ext4 defaults 0 0

07/13/16 15:37:56 <slangasek> smoser: ok. what are the contents there?
07/13/16 15:38:22 <-- tvoss has quit (Ping timeout: 260 seconds)
07/13/16 15:38:26 <smoser> # cat /etc/fstab
07/13/16 15:38:26 <smoser> LABEL=cloudimg-rootfs / ext4 defaults 0 0
07/13/16 15:38:42 <slangasek> alright
07/13/16 15:38:52 <slangasek> so has udev/systemd failed to resolve that label to a device?
07/13/16 15:39:03 <slangasek> (is /dev/disk/by-label/cloudimg-rootfs present?)
07/13/16 15:39:22 <smoser> never in a container.
07/13/16 15:39:24 <smoser> no udev
07/13/16 15:40:06 <smoser> so that is strictly wrong. but in most cases (probably > 90% of the time) something addresses that.

Revision history for this message
Martin Pitt (pitti) wrote :

Some observations/notes:

 * I can reproduce the bug using Scott's script on my laptop (x-013 fails). No ZFS, just one huge btrfs partition for everything, so it seems unrelated to the host file system.

 * The test script removes "working" instances; after that, rebooting the failed container in a loop always works. Also, given that fstab, the generated .mount etc. are all identical, I don't believe that it's an actual difference in the container root fs. This is more likely to be related to some race condition that things get slower with the number of running containers in parallel.

 * The root fs for containers is already supposed to be mounted for containers -- after all, that is just a bind mount. So -.mount should immediately jump from inactive to active at container boot -- in the case when it actually tries to "mount" it's doomed to fail as /dev/disk/ does not really work in a container. So the bug is that in this case it thinks that the root fs is not mounted yet.

 * systemd-remount-fs.service fails in all cases ("mount: can't find LABEL=cloudimg-rootfs", plus "mount: permission denied"). This is a wart, but unrelated to this bug.

For the failure case there is no journal, as the boot fails too early for that. Being able to set systemd.log_target= and systemd.log_level= kernel/pid1 arguments for the container would be very useful, not sure if that actually works with lxd.

Next steps: I'll try to reproduce this with plain LXC (more easily observable/accessible/debuggable) and ask for how to pass additional arguments to pid 1 with lxd.

Revision history for this message
Martin Pitt (pitti) wrote :

When using LXC directly instead of LXD (using "sudo ./go-lxc-run.sh adt-xenial"), it *also* fails in the 13th container. There -.mount doesn't fail, but does not even exist as there is no /etc/fstab in that autopkgtest container template. Instead, I get these errors in the journal:

  x-013 systemd[1]: inotify_init1() failed: Too many open files

i. e. it is running out of file descriptors.

To be more close to what smoser's script does, I created a cloud-based template with

  sudo lxc-create -n xtemplate -t ubuntu-cloud -- -r xenial

and ran with that "xtemplate" container. This already fails in the 9th container, with similar "too many open files" errors. Interestingly in LXC there is no /run/systemd/generator/-.mount at all.

36784.839804] systemd-journald[999]: Failed to rotate /var/log/journal/5aa2ce6f17f156f9099864b6546dc1b0/user-1000.journal: No space left on device

Conversely I ran the lxd reproducer on a container image without /etc/fstab (images:, not ubuntu:). Curiously I'm now running into something entirely different -- the 14th container repeatedly fails to get created due to some ENOSPC, although my partition still has 108 GB of free space. I'm not sure if this might also just be an artifact of some NOFILES rlimit..

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Marking as invalid for juju-core as it is reproducible outside of juju

Changed in juju-core:
status: In Progress → Invalid
milestone: 2.0-beta12 → none
Revision history for this message
Martin Pitt (pitti) wrote :

This is how this can be debugged with lxd: Change the reproducer script to not delete the "working" containers, but leave them running. Then booting the failed x-013 (or whichever it is) will fail reliably. "lxc stop -f x-013" it, then "lxc config edit x-013" and add the following keys:

config:
  environment.SYSTEMD_LOG_LEVEL: debug

devices:
  kmsg:
    major: "1"
    minor: "11"
    path: /dev/kmsg
    type: unix-char

This will enable verbose logging and provide a /dev/kmsg in the container so that the container can actually log stuff to dmesg (it fails too early for the journal to start). Then "dmesg -w" provides a nice log tail. There we see:

[12235.886344] systemd-tmpfiles[54]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
[12235.891623] systemd-remount-fs[55]: /bin/mount for / exited with exit status 1.
[12235.909095] systemd-udevd[51]: inotify_init failed: Too many open files
[12235.909101] systemd-udevd[51]: error initializing inotify
[12235.909151] systemd-udevd[51]: failed to allocate manager object: Cannot allocate memory

(repeats a few times). stracing lxd children also confirms that there are a lot of EMFILE errors.

However, I stopped x-013 and rebooted x-012, which works again. The EMFILE *also* happens there, so this is unrelated and just a red herring. The important difference is that udev works there, so systemd sees that the root device is already mounted/does not exist. Without udev it just tries to mount it as it has no further information about it.

As a workaround, remove or comment out the root fs line in /etc/fstab -- it's not needed at all in a container. Then the container boots, which makes it more convenient to debug udev in it.

Revision history for this message
Martin Pitt (pitti) wrote :

So while pid1 survives the EMFILES, udevd does not. Working container:

1058 writev(2, [{"rules contain 49152 bytes tokens (4096 * 12 bytes), 13886 bytes strings", 71}, {"\n", 1}], 2) = 72
[...]
1058 inotify_init1(O_CLOEXEC) = 8

Failing container:

writev(2, [{"rules contain 49152 bytes tokens (4096 * 12 bytes), 13886 bytes strings", 71}, {"\n", 1}], 2rules contain 49152 bytes tokens (4096 * 12 bytes), 13886 bytes strings
) = 72
[...]
inotify_init1(O_CLOEXEC) = -1 EMFILE (Too many open files)
writev(2, [{"inotify_init failed: Too many open files", 40}, {"\n", 1}], 2inotify_init failed: Too many open files
) = 41
writev(2, [{"error initializing inotify", 26}, {"\n", 1}], 2error initializing inotify
) = 27
[...]
writev(2, [{"failed to allocate manager object: Cannot allocate memory", 57}, {"\n", 1}],

src/udev/udevd.c, manager_new():

        manager->fd_inotify = udev_watch_init(manager->udev);
        if (manager->fd_inotify < 0)
                return log_error_errno(ENOMEM, "error initializing inotify");

So the "cannot allocate memory" is the wrong error code, but it does fail because of running out of file descriptors.

summary: - deploy 30 nodes on lxd, machines never leave pending
+ when starting many LXD containers, they start failing to boot with "Too
+ many open files"
Changed in lxd (Ubuntu):
status: New → Confirmed
Revision history for this message
Martin Pitt (pitti) wrote :

Apparently the relevant limit is /proc/sys/fs/inotify/max_user_instances. This is "128" by default. When increasing it with

   sudo sysctl fs.inotify.max_user_instances=256

then the failed container reboots fine (and udev starts).

Revision history for this message
Martin Pitt (pitti) wrote :

I don't know that much about /proc/sys/fs/inotify/max_user_instances, but apparently this isn't properly namespaced, so it is a global limit for all containers on the host? As a workaround, lxd could ship a /usr/lib/sysctl.d/ snippet to bump it, but I'm not sure if that has other downsides (particularly as we install lxd by default).

Revision history for this message
Seth Forshee (sforshee) wrote :

It does look like max_user_instances is global and not namespaced. So unless there's a resource leak somewhere that needs to be fixed, bumping that limit may be the only option.

Revision history for this message
Nicholas Skaggs (nskaggs) wrote :

It sounds like the "fix" for this lies outside of LXD. Can we add the affected packages and involve the right upstreams for this? Are we going to pursue this purely upstream with systemd / kernel or are we going to attempt a distro-patch of some sort?

Revision history for this message
Stéphane Graber (stgraber) wrote :

So there is a patchset that's been sent for review a couple of times upstream which would tie a bunch of the inotify limits to a user namespace. This would almost certainly fix this issue.

Until then, we can workaround it by bumping the default values for those knobs.
This bump could be done in the Ubuntu kernel directly, or through a sysctl file, either shipped by default with our other sysctl settings or as a hook in the lxd or juju packages (lxd would probably make more sense).

Revision history for this message
james beedy (jamesbeedy) wrote :

@stgrabber, what's the status on this?

Changed in juju-core:
assignee: Christian Muirhead (2-xtian) → nobody
Revision history for this message
Richard Harding (rharding) wrote :

Per email thread, Juju needs to ship with expanded details:

> Stephane, I share your concerns around selecting the right knobs and
> testing accordingly but my main concern now is that I feel many people will
> hit this limitation when they deploy bigger charms in containers. Is there
> any way to expedite the work-around testing?

Well, Juju itself could also be bumping those limits, that way only
hosts that do actually run a bunch of big charms will have their limits
bumped rather than everyone who has LXD installed.

You could just ship a /etc/sysctl.d/10-juju.conf file with:
fs.inotify.max_queued_events = 131072
fs.inotify.max_user_instances = 1024
fs.inotify.max_user_watches = 4194304

affects: juju-core → juju
Changed in juju:
status: Invalid → Triaged
assignee: nobody → Richard Harding (rharding)
milestone: none → 2.0.0
Changed in juju:
milestone: 2.0.0 → 2.1.0
Revision history for this message
Michael Foord (mfoord) wrote :

We can include this in cloud-init for machines that will host containers. However, for the lxd provider this would need to be set on the machine running juju (the host machine). Should this be done at juju install time?

Revision history for this message
John A Meinel (jameinel) wrote :

Michael and I played around with some different settings, and here are my notes.

1) Package kde-runtime seems to install
  /etc/sysctl.d/30-baloo-inotify-limits.conf which sets max_user_watches to 512*1024

  'slabtop' says that my baseline kernel memory is 380-420MB with no containers
  running.

2) With
  fs.inotify.max_queued_events = 16384
  fs.inotify.max_user_instances = 128
  fs.inotify.max_user_watches = 524288

go-run-lxc.sh fails to start container number 13. with "x-013 failed to boot"

  Kernel memory has grown to 1076MB up from 400MB with
  141816K dentry
  564320K btrfs_inode
  As the largest items.

3) However if I set
  fs.inotify.max_user_watches = 8192

It fails again at exactly 13 containers.
So while in a real-world scenario, max_user_watches may come into play, a
"standard" desktop value of 512K is plenty (at least to have machines
provision). (I do believe I've seen max_user_watches come into play while
using Juju in the past, it just isn't the specific problem in the
go-lxc-run.sh script which takes Juju out of the picture.)

4) If I then play around with 'max_user_instances' with:
  fs.inotify.max_queued_events = 16384
  fs.inotify.max_user_instances = 256
  fs.inotify.max_user_watches = 524288

I then fail on the 19th container, with error: Failed to change ownership of:
<random filename here>

Kernel memory is up to: 1363MB
Top entries are:
 178872K dentry
 723104K btrfs_inode

And at this point, my machine is behaving poorly. Things like "lxd delete
--force" actually fails to cleanup instances, because
  "error: unable to open database file".
And Term crashed at least one time.

I even tried at one point to set: fs.inotify.max_user_instances=2048
But it still failed at 19 (that was when Term crashed).

But I'm pretty confident that it means we're exhausting some other limit,
rather than inotify max_user_instances or max_user_watches, since changing
either of them doesn't actually increase the number of containers.

I'm going to run some more tests with Juju in the loop.

Revision history for this message
John A Meinel (jameinel) wrote :

5) Another data point, with
$ sysctl fs.inotify
fs.inotify.max_queued_events = 131072
fs.inotify.max_user_instances = 1024
fs.inotify.max_user_watches = 524288

(so max_queued_events 8x greater, and max_user_instances well above previously established useful level), I still only get 19 containers. and it still fails with a "Failed to change ownership of".

So if we want to get to 50 containers, it seems we need more than just setting the inotify limits.
I still plan on doing similar tests with Juju in the mix, as it is more likely to actually need user_watches than just launching a bare Xenial container.

Revision history for this message
John A Meinel (jameinel) wrote :

With Juju in the loop, I run into whatever limit a bit faster. I was successful at doing:

 juju bootstrap test-lxd lxd
 juju deploy ubuntu
 juju add-unit -n 5 ubuntu # wait for status to say everything is running
 juju add-unit -n 5 ubuntu # wait for status to be happy

but then after doing one more
 juju add-unit -n 5 ubuntu

Firefox crashed shortly thereafter, and so did the Juju Controller, 'juju status' stopped responding. 'lxc list' shows that containers 0-10, 12 and 13 are running, but 11,14,15 are stopped.

lsof | wc -l
reports 84854 file handles open, but that's still lower than
cat /proc/sys/fs/file-max
1635148

Kernel memory is currently at 1409M

I get some random errors like "unable to acquire lock, no space left on device" even though /dev/sda1 has 10GB free.

This is all under a Trusty HWE kernel (4.4.0-38), though the images are Xenial ones that we are running.

It appears that Mongo was killed when things started to go south and systemd decided that it would stop trying to restart 'juju-db' With the error message "Start request repeated too quickly".
It didn't try again for at least 10 minutes before I went in and started it manually.

Revision history for this message
Michael Foord (mfoord) wrote :

Note from jam after some experimentation:

So having played with it for a bit, I'm more comfortable with an /etc/sysctl.d/10-juju.conf that sets max_user_watches=512k and max_user_instances=256 but if we want to get to 50 instances we need to dig harder.

I can just barely get to 10 instances of 'ubuntu' from juju, and only 19 raw containers with any of the inotify settings, and processes start dying at that point.

Revision history for this message
Michael Foord (mfoord) wrote :

Some settings to tweak from tych0 on IRC:
https://github.com/lxc/lxd/blob/master/doc/production-setup.md

Revision history for this message
Richard Harding (rharding) wrote :

Chatting with tych0 we were linked to:
https://github.com/lxc/lxd/blob/master/doc/production-setup.md

That notes a bunch of tweaks to run the system at scale. We need to evaluate what items make sense to enable ootb, and what things are solid for a prod lxd system but not a great idea for all users to just have.

Revision history for this message
John A Meinel (jameinel) wrote :

Note that there should be support for /etc/system/limits.d/10-juju.conf

I'm testing it now, but it may be that we can drop something in there as well. I'll test it a bit, but if we have some tasteful defaults, maybe we can make it work.

I think we can change from their default so instead of "* nofile 1M" we do something like "@lxd nofile 1M" so only users that are also in the LXD group will get the expanded limits.

Again, these feel a bit more like LXD issues than Juju ones, but maybe we can get something done in the immediate term.

Revision history for this message
Michael Foord (mfoord) wrote :

Created bug #1631038 for the addition of an /etc/sysctl.d/10-juju.conf file upping max_user_watches and max_user_instances which will increase the number of containers we can run out of the box, but not by enough.

Revision history for this message
John A Meinel (jameinel) wrote :

I did try setting all of the items that are mentioned in production-setup.md. To start with, a few of them are not reasonable. max_user_instances defaults to 128, and we were able to see a difference at 256, but not at 1024. Setting it to 1M seems silly.

I'll also note that my Kernel memory consumption went up significantly with those settings. When I hit 12 containers I was already up over 2GB of kernel memory (whereas before I peaked around 1.4GB of kernel memory when I hit 19 containers).

It seems to be a case of a huge number more "kmalloc-64" entries. I'm not sure where those are coming from, but there are enough "OBJS" that it overflows the standard column widths in the slabtop output.

With all of those set, I did get more containers after rebooting my machine. (Just logging out and back in again, I actually went down to 18 containers max).

At 22 containers I hit 3.8GB of Kernel memory. I'm letting it continue to run to see where it gets to.

I did also make sure to change the LXD backend pool to ZFS instead of being just the normal disk. (using zfs-dkms for Trusty kernel.)

Given that I was using a btrfs filesystem, and now LXD is using ZFS that might also be a factor in how many containers I could run. Certainly in the initial reports "btrfs_inode_*" was near the top of the KMem output. And now its all kmalloc and dentry. Maybe that's a side-effect of dkms?

I did end up hitting 30 containers at 4.6GB of Kernel memory before the go-lxc-run.sh wanted to start clearing up old containers. So I'll patch that out and see how far I get.

Revision history for this message
John A Meinel (jameinel) wrote :

Interestingly, my baseline Kernel memory with no containers (and not much other software) was about the same (~400MB). I'm not entirely sure why it grew faster with the new settings, but didn't effect the baseline.

Revision history for this message
John A Meinel (jameinel) wrote :

After rebooting with the new values, I did manage to get to launch a lot more containers, failing at only the 232nd one.

I wonder if the issue is not the User number of open files, but the Root number of open files, which requires a reboot to get updated.

I'll try to play around more to really nail down what things actually need to be changed, but 200+ containers is a huge difference from 19. (Also need to run this test with Juju in the mix.)

Revision history for this message
John A Meinel (jameinel) wrote :

I added a "cat /proc/meminfo | grep Slab" to go-lxc-run.sh and found this:

$ sysctl fs.inotify
fs.inotify.max_queued_events = 65536
fs.inotify.max_user_instances = 1024
fs.inotify.max_user_watches = 524288
$ ulimit -a
...
open files (-n) 1048576
...
$ go-lxc-run.sh
[0] x-001:[13]......7 Slab: 432176 kB
...
[1365] x-071:[15].......8 Slab: 9639484 kB
[1387] x-072:[15].......8 Slab: 9603648 kB
[1409] x-073:[14].......8 Slab: 9621936 kB
[1431] x-074:[14].......8 Slab: 9673000 kB
[1453] x-075:[13]........9 Slab: 9680444 kB
[1475] x-076:[14]........9 Slab: 8223312 kB
[1501] x-077:[16]......7 Slab: 5095880 kB
...
1812] x-093:[12]......7 Slab: 6163512 kB
[1831] x-094:[12]......7 Slab: 6271272 kB
[1850] x-095:[13]..................false Slab: 6371956 kB
x-095 failed to boot. keeping x-095.

So kernel memory seemed to peak at around 10GB, and then somehow dropped down to 5GB to allow 20 more containers to be created. (on a 16GB machine, that's a fair bit allotted to just the kernel).

However, 95 seems to be the limit for 1024 fs.inotify.max_user_instances. But that still means setting it to 1M is silly. I'll do another run with max_user_instances=2048 and see what happens. Since I know with very-high max values, I can get to 230 containers, that sounds like 2048 is sufficient for it.

I'll also play around next with putting Juju into the loop and see where I get. I really wonder about max-open-files for Root vs User. LXD is running as Root, but if the containers are all running User level, shouldn't that be the constraint?

Revision history for this message
John A Meinel (jameinel) wrote :

with
 fs.inotify.max_user_instances = 2048

it didn't fail until the 187th container.

It did fail with a new message that I haven't seen:
[3831] x-187:error: Error calling 'lxd forkstart x-187 /var/lib/lxd/containers /var/log/lxd/x-187/lxc.conf': err='exit status 1'
  lxc 20161007100331.798 ERROR lxc_conf - conf.c:run_buffer:347 - Script exited with status 1
  lxc 20161007100331.798 ERROR lxc_start - start.c:lxc_init:465 - failed to run pre-start hooks for container 'x-187'.
  lxc 20161007100331.798 ERROR lxc_start - start.c:__lxc_start:1313 - failed to initialize the container

Try `lxc info --show-log local:x-187` for more info

I wasn't able to actually do '--show-log', because the cleanup code already had torn down that instance (go-lxc-run.sh only leaves the container around if it launches but fails to come up, if the launch itself fails it is still in the CLEAN list.)

Michael Foord (mfoord)
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Stéphane Graber (stgraber) wrote :

FYI a new version of the inotify patchset was sent today for review:
https://lists.linuxfoundation.org/pipermail/containers/2016-October/037480.html

This approach is the real fix for the inotify part of this problem.

For open files, we've had a few reports of systemd occasionally misbehaving and dropping our bumped ulimits on the floor, this may be what you've been running into...

With ulimits bumped and the inotify resolved in the kernel, the next likely limit we'd hit is pts_max, but assuming normal uses of 2-3 devices per container, it'd take quite a few of them to reach the default limit of 1024.

Anyway, those kind of kernel limits are something we're quite aware of, I actually am flying back from LinuxCon where I gave a talk covering a bunch of those problems and we'll be pushing at Linux Plumbers in a couple of weeks to try and get proper solutions in the upstream kernel.

Revision history for this message
Michael Foord (mfoord) wrote :

As a note, increasing fs.max-files had no effect for me.

Using the following settings I was able to deploy about 23 containers with juju, which is higher than I could previously:

    fs.inotify.max_user_watches = 524288
    fs.inotify.max_user_instances = 256

These are the proposed settings to come with a juju install in bug #1631038.

Revision history for this message
Michael Foord (mfoord) wrote :

I have filed bug #1631914 against lxd for better surfacing of the "too many files open" error.

Revision history for this message
Michael Foord (mfoord) wrote :

$ juju bootstrap foo lxd
Creating Juju controller "foo" on lxd/localhost
Looking for packaged Juju agent version 2.0.0 for amd64
No packaged binary found, preparing local Juju agent binary
To configure your system to better support LXD containers, please see: https://github.com/lxc/lxd/blob/master/doc/production-setup.md
Launching controller instance(s) on lxd/localhost...
 - juju-6e02cd-0
Fetching Juju GUI 2.2.0

Revision history for this message
James Tunnicliffe (dooferlad) wrote :

Would it be reasonable to apply the settings that https://github.com/lxc/lxd/blob/master/doc/production-setup.md suggests when we add the first LXD to a machine?

What about interrogating the kernel about available resources and refusing to add a container when it won't work? I can try adding 20 containers to a MAAS node at the moment and get no useful message and they do start to fail.

Since Juju is running as root we could just do a sysctl key=value to set those live and apparently changes to /etc/security/limits.conf apply to new processes without a reboot. Preserving the sysctl over reboots isn't difficult either.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Not really. The problem is that bumping those sysctls to the value in that document will work on most normal systems but will fail dramatically on very low memory systems like some cloud instances and ARM systems.

As for having LXD know when things will fail, it simply has not idea. All the errors happen through processes inside the container that LXD has no visibility on.

You can start hundreds (possibly thousands) of alpine containers before any such problem occur, but spawning 20 containers that use systemd will trip it. Similarly, you can probably start about 10x as many Ubuntu 14.04 containers before you run the same problem as with Ubuntu 16.04 containers.

In most cases, what's receiving the allocation error back from the kernel is systemd in the container. And rather than moving on and just doing some polling when inotify doesn't work, it just plain hangs there without providing any kind of useful feedback to the user (since logging isn't even started at that point).

As I said before, there is kernel work being done now to fix the inotify part of this problem in a clean, sane way which will work for everyone. Until then, it's reasonable for Juju to bump those limits as they know exactly what kind of instance they're running on.

For the ulimits (limits.conf), we'll have to look into what's going on here because the LXD systemd unit does have those bumped, but they somehow seem to be ignored by systemd or reset at some point in time, causing some of this breakage. I suspect it'll boil down to being a systemd bug, but we need to take a close look at this.

Revision history for this message
Richard Harding (rharding) wrote :

Tested w Juju 2.1 beta3 and verified I can get over 20 containers and while it's slow does not get stuck in pending. The changes to the profiles on the server that Michael has done appear to be working.

Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Nobuto Murata (nobuto) wrote :

Can we see the commit related to the bug? Also, is there a plan to backport it to 2.0 series as I'm not sure when we get 2.1 GA?
https://github.com/juju/juju/wiki/Juju-Release-Schedule

Revision history for this message
Billy Olsen (billy-olsen) wrote :

This bug is fixed for the local LXD provider scenario for 2.0 on Xenial with this commit - http://bazaar.launchpad.net/~juju-qa/ubuntu/xenial/juju/2.0.0/revision/214 - in which the juju-2.0.conf file has the settings identified and dropped into /usr/lib/sysctl.d. Unfortunately, this only addresses the case of the local LXD provider and not any other LXD usage.

For example, if you do a high density deployment using MAAS + LXD, this code only applies to the juju-2.0 packages which are installed on the machine where the juju client is run. This means that any of the other possible uses of the LXD containers within Juju are receiving the same benefit from this tweak.

So to me it feels like there's a fix committed and even released, however it only partially solves tuning the system for higher density of LXD containers.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

billy-olsen: You're right - because we don't install juju from the package on the deployed machines the config to enable more lxd containers doesn't get done there. We should apply it in that case.

Changed in juju:
status: Fix Committed → Confirmed
Changed in juju:
status: Confirmed → Triaged
Changed in juju:
milestone: 2.1.0 → 2.1-rc1
assignee: Richard Harding (rharding) → nobody
Revision history for this message
Anastasia (anastasia-macmood) wrote :

From internal discussion, for Juju, in clouds, there is no guarantee that the images can support particular kernel parameters. For us, it is best not to try and guess and better surface the error in the status.

Changed in juju:
importance: Critical → High
milestone: 2.1-rc1 → 2.2.0
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.2-beta1 → 2.2-beta2
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.2-beta2 → 2.2-beta3
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxd - 2.12-0ubuntu2

---------------
lxd (2.12-0ubuntu2) zesty; urgency=medium

  * Increate default inotify limits to 1024 instances per uid.
    (LP: #1602192)

 -- Stéphane Graber <email address hidden> Mon, 03 Apr 2017 18:51:34 -0400

Changed in lxd (Ubuntu):
status: Confirmed → Fix Released
Changed in juju:
milestone: 2.2-beta3 → 2.2-beta4
Changed in juju:
milestone: 2.2-beta4 → 2.2-rc1
Tim Penhey (thumper)
no longer affects: juju
description: updated
Changed in lxd (Ubuntu Xenial):
status: New → Triaged
status: Triaged → In Progress
importance: Undecided → Medium
Revision history for this message
Chris Halse Rogers (raof) wrote : Please test proposed package

Hello Christian, or anyone else affected,

Accepted lxd into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/lxd/2.0.10-0ubuntu1~16.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in lxd (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
Revision history for this message
Stéphane Graber (stgraber) wrote :

I confirmed that the sysctl.d file is now present and functional.

tags: added: verification-done verification-done-xenial
removed: verification-needed verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxd - 2.0.10-0ubuntu1~16.04.2

---------------
lxd (2.0.10-0ubuntu1~16.04.2) xenial; urgency=medium

  * Fix regression in image update logic (LP: #1712455):
    - 0005-Fix-regression-in-image-auto-update-logic.patch
    - 0006-lxd-images-Carry-old-cached-value-on-refresh.patch
    - 0007-Attempt-to-restore-the-auto_update-property.patch

  * Ship a sysctl.d file that bumps inotify watches count. (LP: #1602192)
  * Update debian/watch to look only at LTS releases.

 -- Stéphane Graber <email address hidden> Tue, 22 Aug 2017 20:39:36 -0400

Changed in lxd (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Stéphane Graber (stgraber) wrote : Update Released

The verification of the Stable Release Update for lxd has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.