Error executing lxc-clone: lxc_container: utils.c: mkdir_p 220 Not a directory - Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template

Bug #1410876 reported by Larry Michel
52
This bug affects 5 people
Affects Status Importance Assigned to Milestone
juju-core
Invalid
High
Unassigned
lxc
Fix Released
Undecided
Unassigned
lxc (Ubuntu)
Fix Released
High
Unassigned
Trusty
Fix Released
High
Unassigned

Bug Description

This is for:

++ export OPENSTACK_RELEASE=juno
++ OPENSTACK_RELEASE=juno
++ export COMPUTE=nova-lxc
++ COMPUTE=nova-lxc
++ export UBUNTU_RELEASE=trusty
++ UBUNTU_RELEASE=trusty

Tool version is 1.20.14:
Launching instance
WARNING picked arbitrary tools &{1.20.14-precise-amd64 https://streams.canonical.com/juju/tools/releases/juju-1.20.14-precise-amd64.tgz 932640702b5d9f08a312118a4afe330444308f92b7de350de9e547719f084bd9 8130412}
 - /MAAS/api/1.0/nodes/node-5f9c14e6-ae98-11e3-b194-00163efc5068/

and agent-version on node says 1.20.13.

Juju-status.yaml:

'5':
    agent-state: started
    agent-version: 1.20.13
    containers:
      5/lxc/0:
        agent-state-info: 'error executing "lxc-clone": lxc_container: utils.c: mkdir_p:
          220 Not a directory - failed to create directory ''/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'';
          lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy
          snapshot %s - failed to allocate a pty; Insufficent privileges to control
          juju-trusty-lxc-template'
        instance-id: pending
        series: trusty
      5/lxc/1:
        agent-state-info: cannot clone a running container
        instance-id: pending
        series: trusty
    dns-name: apsaras.oil
    hardware: arch=amd64 cpu-cores=4 mem=32768M tags=hw-ok,oil-slave-1,hardware-dell-poweredge-R210
    instance-id: /MAAS/api/1.0/nodes/node-5f9c14e6-ae98-11e3-b194-00163efc5068/
    series: trusty

and from log:

machine-0: 2015-01-14 15:32:12 DEBUG juju.state.apiserver apiserver.go:150 <- [7C] unit-swift-storage-1 {"RequestId":42,"Type":"Uniter","Request":"CharmURL","Params":{"Entities":[{"Tag":"service-swift-storage"}]}}
machine-0: 2015-01-14 15:32:12 DEBUG juju.state.apiserver apiserver.go:157 -> [7C] unit-swift-storage-1 383.812us {"RequestId":42,"Response":{"Results":[{"Error":null,"Result":"local:trusty/swift-storage-90","Ok":false}]}} Uniter[""].CharmURL
machine-0: 2015-01-14 15:32:12 DEBUG juju.state.apiserver apiserver.go:150 <- [7C] unit-swift-storage-1 {"RequestId":43,"Type":"Uniter","Request":"CharmArchiveSha256","Params":{"URLs":[{"URL":"local:trusty/swift-storage-90"}]}}
machine-0: 2015-01-14 15:32:12 DEBUG juju.state.apiserver apiserver.go:157 -> [7C] unit-swift-storage-1 777.029us {"RequestId":43,"Response":{"Results":[{"Error":null,"Result":"b8aac93898c3cc661c0804613e5cce79a402fbe596400e8e80e8a3a1bb16c577"}]}} Uniter[""].CharmArchiveSha256
machine-0: 2015-01-14 15:32:12 DEBUG juju.state.apiserver apiserver.go:150 <- [61] machine-2 {"RequestId":57,"Type":"Provisioner","Request":"SetStatus","Params":{"Entities":[{"Tag":"machine-2-lxc-0","Status":"error","Info":"error executing \"lxc-clone\": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template","Data":null}]}}
machine-0: 2015-01-14 15:32:12 DEBUG juju.state.apiserver apiserver.go:150 <- [7C] unit-swift-storage-1 {"RequestId":44,"Type":"Uniter","Request":"SetCharmURL","Params":{"Entities":[{"Tag":"unit-swift-storage-1","CharmURL":"local:trusty/swift-storage-90"}]}}

Larry Michel (lmic)
description: updated
tags: added: oil
Curtis Hovey (sinzui)
tags: added: lxc trusty
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.23
Revision history for this message
Larry Michel (lmic) wrote :

Logs from one failure

Revision history for this message
Larry Michel (lmic) wrote :

Openstack bundle file

Revision history for this message
Horacio Durán (hduran-8) wrote :

I took a look and requested oil team to try to reproduce this in newer versions and different configurations (lxc-use-clone:false) to try to determine where exactly lies the problem.

The relevant line on the logs is:

juju_debug_log.txt:machine-5: 2015-01-15 11:41:05 ERROR juju.container.lxc lxc.go:208 lxc container cloning failed: error executing "lxc-clone": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template

and

2015-01-15 11:41:05 ERROR juju.provisioner provisioner_task.go:418 cannot start instance for machine "5/lxc/0": error executing "lxc-clone": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template

After checking the code I found out that the clone operation error ouput is relayed directly from golxc.Clone I cannot determine exactly what is going on there.

Just in case the error could be in a path I checked http://bazaar.launchpad.net/~juju/golxc/trunk/revision/8 which adds containerDir to the cloning but obtained no useful conclusion.

Revision history for this message
Tim Penhey (thumper) wrote :

The problem is a little hard to diagnose with the current output.

1) are you able to rerun the tests with the following config?
    logging-config=juju.container.lxc=TRACE
  - you can either do this at bootstrap time like this:
     juju bootstrap --logging-config=juju.container.lxc=TRACE;juju=DEBUG
 - or on a running environment like this:
    juju set-env 'logging-config=juju.container.lxc=TRACE;juju=DEBUG'

2) Has this ever passed before? If so, what other config settings have changed?

3) Can we also get the values from /etc/lxc ?

Changed in juju-core:
status: Triaged → Incomplete
Revision history for this message
Curtis Hovey (sinzui) wrote :

Does this issue really affect 1.21-beta5? This issue doesn't show any evidence of regression. I don't think this should block the release of 1.21-rc1.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.23 → none
Revision history for this message
Larry Michel (lmic) wrote :

I am looking into the frequency for this bug to see when was the first and last time it was recreated. I am also attempting to recreate with the suggested debug settings. Attaching the deployment file that was used for this particular failure.

Curtis Hovey (sinzui)
no longer affects: juju-core/1.21
Revision history for this message
Larry Michel (lmic) wrote :

We are hitting this for 1.21 in our production environment . Because of other blocking issues, we are not able to recreate in test environment yet. Please see all debug data for 48 recreates for the month of January with the last 4 being on 1.21. Debug flag is turned on and what's missing is the /etc/lxc and we're looking into how we can collect this info in the production environment.

Has this ever worked? Since we're seeing this for different configuration, the answer would pertain to what's common and neutron-contrail is the common thread.

Larry Michel (lmic)
Changed in juju-core:
status: Incomplete → New
Revision history for this message
Curtis Hovey (sinzui) wrote :

Yes. Juju is very good at creating lxc containers it always works for the release. Every revision is test a minimum of 14 combinations with lxc, then retested for weekly tests. If /etc/lxc is missing there would be systemic failures in CI (and this has happened). As for this being universally broken...only OIL is reporting this case.

As for /etc/lxc being missing, I believe the lxc package wasn't installed. Maybe there is a case where it is not installed, or apt is still installing packages.

no longer affects: juju-core/1.22
Changed in juju-core:
status: New → Triaged
milestone: none → 1.23
Revision history for this message
Larry Michel (lmic) wrote :

Curtis, sorry for the confusion. /etc/lxc is not missing from the system. I meant that its content is missing from the logs since we are not currently collecting its content in the event of a failure.

Revision history for this message
Larry Michel (lmic) wrote :

Artifacts from one of the build.

Revision history for this message
Greg Lutostanski (lutostag) wrote :

I will run a deployment with same yaml bundle and see if we can repro it offline so we can poke it interactively and see what is happening.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We're still seeing this with 1.22.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.23 → 1.24-alpha1
Casey Marshall (cmars)
tags: added: stakeholder-critical
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.24-alpha1 → 1.25.0
Revision history for this message
John A Meinel (jameinel) wrote :

These files in artifacts.zip make it a bit to debug our end:
juju_debug_log.txt
cat: /var/log/juju/all-machines.log: Permission denied

juju_debug_machine-0.txt
cat: /var/log/juju/machine-0.log: Permission denied

I'm pretty sure those are owned as root, probably with 700 permissions. (Depending on debug level secrets can end up in those files, which is why they aren't world readable.)

I'm also having a really hard time figuring out some of the formatting of that error message, but I'm trying.
It appears to be trying to download the root image from a local machine:
+ url2=https://10.245.0.177:17070/environment/97640150-37d0-4356-80b1-095a39c78437/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz;

what is 10.245.0.177? Given the port 17070 it is almost definitely the Juju API server.

+ cd /var/cache/lxc/cloud-trusty;
+ ''['' 0 -eq 1 '']'';

It looks like maybe it is failing to cd $cache, even though earlier it does issue a "mkdir -p $cache".
Alternatively that check is actually just seeing if the file already exists, and since it doesn't it goes ahead and issues the wget request. wget seems unhappy:
+ wget https://10.245.0.177:17070/environment/97640150-37d0-4356-80b1-095a39c78437/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz;
 --2015-02-03 01:16:43--
  https://10.245.0.177:17070/environment/97640150-37d0-4356-80b1-095a39c78437/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz;
   Connecting to 10.245.0.177:17070... connected.;
   HTTP request sent, awaiting response... 500 Internal Server Error;
   2015-02-03 01:16:43 ERROR 500: Internal Server Error.;

If we assume 10.245.0.177 is a juju state server hosting root images, it is getting a 500 ISE.

Revision history for this message
John A Meinel (jameinel) wrote :

So Ian noted that the Juju API Server is acting as a caching proxy here. The client requests the file from Juju and Juju is supposed to download it from cloud-images and cache the file and then return it.

Without the log files (all-machines.log and/or machine-0.log) it is hard to know what the internal failure is.

Can you possibly try changing the line:
+ juju ssh -e maas 0 'cat /var/log/juju/all-machines.log'

to
+ juju ssh -e maas 0 'sudo cat /var/log/juju/all-machines.log'

?

Revision history for this message
Alexis Bruemmer (alexis-bruemmer) wrote :

Based on John's last comment, logs will be needed before further progress can be made on this bug.

Changed in juju-core:
status: Triaged → Incomplete
Curtis Hovey (sinzui)
no longer affects: juju-core/1.24
Revision history for this message
Larry Michel (lmic) wrote :

John, the logs you are requesting are already attached to this bug. I looked into the tarball logfiles-1410876-01312015-2.tar.gz and the error you mentioned is only for the builds for oil-triage-20150131-020000.000000-20150201-020000.000000 and a the first build of oil-triage-20150129-020000.000000-20150130-020000.000000.

All the other directories have complete all-machines.log output.

Changed in juju-core:
status: Incomplete → New
Revision history for this message
Larry Michel (lmic) wrote :

It does so happen that the archive manager displays these 2 directories first, so I understand why it would look like they're all bad. You should see a total of 43 complete all-machines.log in that tarball. Hopefully, these help.

lmic@lmic-Latitude-E6520:~/Downloads/gz$ find . -name "juju_debug_log.txt" -exec grep -m 1 "Permission denied" {} \;
cat: /var/log/juju/all-machines.log: Permission denied
cat: /var/log/juju/all-machines.log: Permission denied
cat: /var/log/juju/all-machines.log: Permission denied
cat: /var/log/juju/all-machines.log: Permission denied
cat: /var/log/juju/all-machines.log: Permission denied
lmic@lmic-Latitude-E6520:~/Downloads/gz$ find . -name "juju_debug_log.txt" -exec grep -m 1 "Permission denied" {} \; | wc -l
5
lmic@lmic-Latitude-E6520:~/Downloads/gz$ find . -name "juju_debug_log.txt" -exec ls -l {} \; |wc -l
48

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
Revision history for this message
Ante Karamatić (ivoks) wrote :

'failed to allocate a pty' might indicate exhausted ptys. Is /dev/pts mounted? How many are in use (sudo lsof /dev/pts/*) when it fails?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I think the failed to allocate pty bit is a red herring - it looks like the original failure is this:

error executing "lxc-clone": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps';

This error comes from lxc's use of libc's mkdir(), which is setting errno to ENOTDIR, meaning one of the components in the directory it's trying to create is not actually a directory.

So, which part of the '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps' path might exist as a non directory?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Well, I don't know. The lxc code looks like it tries for another path if using that mkdir fails...

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

could you please show the result of:

full=/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps
while [ $full != "/" ]; do
 echo "XXX: $full"
 ls -ld $full
 full=`dirname $full`
done

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

The problem with running scripts like that is this is all coming from an automated environment that cleans up after failure. We haven't been able to add a script to do that yet.

Did see another failure of this that has a slightly different error message:
        agent-state-info: 'container failed to start and failed to destroy: manual
          cleanup of containers needed: error executing "lxc-wait": lxc_container:
          utils.c: mkdir_p: 220 Not a directory - failed to create directory ''/run/lock/lxc//var/lib/lxc/juju-machine-4-lxc-1/snaps'';
          lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy
          snapshot %s - failed to allocate a pty; Insufficent privileges to control
          juju-machine-4-lxc-1'

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

It looks like the lxc commands that juju uses aren't logged at DEBUG log level. It would be really helpful for debugging this if they were.

For example, from the error here, it's not clear what 'lxc-clone' command is being called:

2015-01-15 11:41:05 ERROR juju.provisioner provisioner_task.go:418 cannot start instance for machine "5/lxc/0": error executing "lxc-clone": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template

Where is the "%s" coming from? Either lxc or juju is not converting a format specifier and it's not clear which. Logging what lxc commands juju calls would clear that up.

Revision history for this message
Ian Booth (wallyworld) wrote :

The calls are logged at trace level. You could enable that:

juju set-env logging-config="<root>=INFO;juju=DEBUG;golxc=TRACE;juju.container.lxc=TRACE"

Or having logging-config set in the yaml

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

To do logging config in the yaml, we'd add:

logging-config: "<root>=INFO;juju=DEBUG;golxc=TRACE;juju.container.lxc=TRACE"

to our environments' config in our environment.yaml right?

Revision history for this message
John A Meinel (jameinel) wrote :

Sorry about missing the logs. I had looked at the latest comment (artifacts-1410876.zip) which didn't have any log messages in it.
I'll try to dig through the bigger tarball.

logging-config looks correct to me.

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (8.0 KiB)

So I tried looking at 20150129.../32588 which has this in the status.yaml:
    containers:
      5/lxc/0:
        agent-state-info: 'error executing "lxc-wait": lxc_container: utils.c: mkdir_p:
          220 Not a directory - failed to create directory ''/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'';
          lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy
          snapshot %s - failed to allocate a pty; Insufficent privileges to control
          juju-trusty-lxc-template'
        instance-id: pending
        series: trusty

I dug into the logs for machine-0 and machine-5:

For machine-5 we see it notice that it wants to create a 5/lxc/0 instance, and that it grabs the lock to create the template, but then just that something failed.

machine-5: 2015-01-29 10:44:14 INFO juju.provisioner provisioner_task.go:119 Starting up provisioner task machine-5
machine-5: 2015-01-29 10:44:14 INFO juju.provisioner provisioner_task.go:331 found machine "5/lxc/0" pending provisioning
machine-5: 2015-01-29 10:44:15 INFO juju.provisioner.lxc lxc-broker.go:65 starting lxc container for machineId: 5/lxc/0
machine-5: 2015-01-29 10:44:15 INFO juju.container.lxc clonetemplate.go:77 wait for flock on juju-trusty-lxc-template
machine-5: 2015-01-29 10:44:15 INFO juju.container lock.go:50 acquire lock "juju-trusty-lxc-template", ensure clone exists
machine-5: 2015-01-29 10:44:15 INFO juju.container.lxc clonetemplate.go:117 template does not exist, creating
machine-5: 2015-01-29 10:44:32 ERROR juju.container.lxc clonetemplate.go:167 container failed to start: error executing "lxc-wait": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template
machine-5: 2015-01-29 10:44:32 INFO juju.container lock.go:66 release lock "juju-trusty-lxc-template"
machine-5: 2015-01-29 10:44:32 ERROR juju.provisioner.lxc lxc-broker.go:97 failed to start container: error executing "lxc-wait": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template
machine-5: 2015-01-29 10:44:32 ERROR juju.provisioner provisioner_task.go:418 cannot start instance for machine "5/lxc/0": error executing "lxc-wait": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template

Note that "Could not destroy snapshot %s" doesn't occur in our codebase (that I could see). I'm pretty sure it is something in the LXC codebase. (I was hoping to fix the bare %s bug).

If you look on machine-0 around the time this is h...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I've attached the debug log and status from a failure we hit yesterday, this time with LXC trace logging enabled. It doesn't show the failed command in the debug log though, so it's not capturing what I wanted to see.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

It looks like the lxc-clone command being used isn't logged even in the successful case. Here's example logs that show the template being created and then lxc-0 up and running, but no logged clone command.

http://paste.ubuntu.com/11844287/

Why is the clone command not being logged?

Revision history for this message
Ryan Beisner (1chb1n) wrote :

We are experiencing this in our automated OpenStack bare metal testing (UOSCI). It appears to be a race, as it is intermittently an issue. It is frequent enough to cause noticeable test failures though.

The observable symptom is that juju-deployer times out, as some container units are forever in an "allocating" & "Waiting for agent initialization to finish" state.

The impact is that we cannot reliably test OpenStack deployments on baremetal.

Juju version is 1.24.2-0ubuntu1~14.04.1~juju1 fro ppa:juju/stable.

# from the machine log:
2015-07-21 19:38:35 DEBUG juju.worker.logger logger.go:45 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
2015-07-21 19:41:05 ERROR juju.provisioner.lxc lxc-broker.go:163 failed to start container: lxc container cloning failed: error executing "lxc-clone": lxc_container: utils.c: mkdir_p: 267 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3210 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template
2015-07-21 19:41:05 ERROR juju.provisioner provisioner_task.go:597 cannot start instance for machine "4/lxc/0": lxc container cloning failed: error executing "lxc-clone": lxc_container: utils.c: mkdir_p: 267 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3210 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template
2015-07-21 19:41:06 ERROR juju.provisioner.lxc lxc-broker.go:163 failed to start container: lxc container cloning failed: cannot clone a running container
2015-07-21 19:41:06 ERROR juju.provisioner provisioner_task.go:597 cannot start instance for machine "4/lxc/1": lxc container cloning failed: cannot clone a running container

# More detail:
http://paste.ubuntu.com/11920907/

Revision history for this message
Ryan Beisner (1chb1n) wrote :

# machines where juju deploy to lxc failed:
ubuntu@imaginative-error:/var/log/juju$ apt-cache policy lxc
lxc:
  Installed: 1.1.2-0ubuntu3~cloud0
  Candidate: 1.1.2-0ubuntu3~cloud0
  Version table:
 *** 1.1.2-0ubuntu3~cloud0 0
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu/ trusty-updates/kilo/main amd64 Packages
        100 /var/lib/dpkg/status
     1.0.7-0ubuntu0.1 0
        500 http://archive.ubuntu.com//ubuntu/ trusty-updates/main amd64 Packages
     1.0.3-0ubuntu3 0
        500 http://archive.ubuntu.com//ubuntu/ trusty/main amd64 Packages
ubuntu@imaginative-error:/var/log/juju$

# machines where juju deploy to lxc succeeded:
ubuntu@grizzled-family:/var/log/juju$ apt-cache policy lxc
lxc:
  Installed: 1.0.7-0ubuntu0.1
  Candidate: 1.1.2-0ubuntu3~cloud0
  Version table:
     1.1.2-0ubuntu3~cloud0 0
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu/ trusty-updates/kilo/main amd64 Packages
 *** 1.0.7-0ubuntu0.1 0
        500 http://archive.ubuntu.com//ubuntu/ trusty-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     1.0.3-0ubuntu3 0
        500 http://archive.ubuntu.com//ubuntu/ trusty/main amd64 Packages

See the paste in my previous comment for the juju stat view.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Observations re: the deployment referenced @ http://paste.ubuntu.com/11920907/:

Machines 1,2,3 - all have lxc 1.0.7, and happy containers. The base charm is ceph.

Machines 4,5,6 - all have lxc 1.1.2, and no love. The base charm is nova-compute.

# juju run --machine 0,1,2,3,4,5,6 "apt-cache policy lxc"
http://paste.ubuntu.com/11921073/

While this does illustrate behavioral inconsistency between ceph and nova-compute in that they have apparently added the cloud archive ppa at different moments throughout hook execution, I don't think that this is inherently an OpenStack or charm issue.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Hi,

I think I see what's going on.

lxc mutexes container actions using a lockfile, something like /run/lock/lxc/$lxcpath/$lxcname. The last path element is a file.

When you create a snapshot, the lockfile /run/lock/lxc/$lxcpath/$lxcname/snaps/snap0 (for instance) is used.

If the container's own lockfile happens to exist at the moment that you attemtp to create the snapshot lockfile, then you would see this 'mkdir: not a directory" because it tries to create that path as a directory when it already exists as a file.

There are several ways this could be fixed in lxc, but for a workaround the best I can suggest is to not act on a snapshot while concurrently acting on the parent container (from another shell/task).

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

The nicest fix is probably to use /run/lock/lxc/$lxcpath/$lxcname/lock as $lxcname's lockfile

Revision history for this message
Larry Michel (lmic) wrote :

We've been hitting this on juno and kilo, 18 out of 214 builds.

LXC versions on systems I checked: lxc:amd64 (1.0.7-0ubuntu0.1)

Ryan Beisner (1chb1n)
tags: added: openstack-provider uosci
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Turns out this was actually fixed in the latest security release. IIUC.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Is that fix going to be available in trusty?

Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI: Following the security release, I've not been able to reproduce this manually, and our automated testing which was failing, is now passing.

Revision history for this message
Ryan Beisner (1chb1n) wrote :
Curtis Hovey (sinzui)
Changed in juju-core:
status: Triaged → Invalid
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.0 → none
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We're still seeing this with 1.0.7-0ubuntu0.2 and juju-core 1.24.4. I've attached the juju machine log which contains the log of the LXC activity on that machine, including the failing commands, startin with this one:

2015-08-13 04:09:12 TRACE golxc.run.lxc-wait golxc.go:448 run: lxc-wait [--name juju-machine-1-lxc-1 --state RUNNING|STOPPED]
2015-08-13 04:09:12 TRACE golxc.run.lxc-wait golxc.go:458 run failed output: lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lxc/lock//var/lib/lxc/juju-machine-1-lxc-1/snaps'
failed to create lock
lxc_container: lxccontainer.c: do_snapshot_destroy: 3267 Could not find snapshot %s - failed to allocate a pty
Insufficent privileges to control juju-machine-1-lxc-1

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Hi,

thanks for pointing that out. The trusty package appears to be subtly different - its src/lxc/lxclock.c creates "/run/lxc/lock/$lxcpath/$lxcname". Upstream it creates "/run/lxc/lock/$lxcpath/.$lxcname", keeping "/run/lxc/lock/$lxcpath/$lxcname" available to be a directory.

Changed in lxc:
status: New → Fix Released
Changed in lxc (Ubuntu):
status: New → Fix Released
importance: Undecided → High
Changed in lxc (Ubuntu Trusty):
importance: Undecided → High
status: New → Triaged
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Larry, or anyone else affected,

Accepted lxc into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/lxc/1.0.7-0ubuntu0.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in lxc (Ubuntu Trusty):
status: Triaged → Fix Committed
tags: added: verification-needed
Revision history for this message
Brian Murray (brian-murray) wrote :

I've accepted this into -proposed but the bug would benefit from a more detailed test case and regression statement.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@jason-hobbs or @lmic,

can you provide a concise recipe for reproducing this in the Description?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1410876] Re: Error executing lxc-clone: lxc_container: utils.c: mkdir_p 220 Not a directory - Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template
Download full text (4.5 KiB)

This bug is hit very rarely - we deploy thousands of containers a day and
see it a few times a week. I don't have a way to reproduce it reliably
other than waiting to hit it in OIL.

On Fri, Aug 21, 2015 at 12:22 PM, Serge Hallyn <email address hidden>
wrote:

> @jason-hobbs or @lmic,
>
> can you provide a concise recipe for reproducing this in the
> Description?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1410876
>
> Title:
> Error executing lxc-clone: lxc_container: utils.c: mkdir_p 220 Not a
> directory - Could not destroy snapshot %s - failed to allocate a pty;
> Insufficent privileges to control juju-trusty-lxc-template
>
> Status in juju-core:
> Invalid
> Status in lxc:
> Fix Released
> Status in lxc package in Ubuntu:
> Fix Released
> Status in lxc source package in Trusty:
> Fix Committed
>
> Bug description:
> This is for:
>
> ++ export OPENSTACK_RELEASE=juno
> ++ OPENSTACK_RELEASE=juno
> ++ export COMPUTE=nova-lxc
> ++ COMPUTE=nova-lxc
> ++ export UBUNTU_RELEASE=trusty
> ++ UBUNTU_RELEASE=trusty
>
> Tool version is 1.20.14:
> Launching instance
> WARNING picked arbitrary tools &{1.20.14-precise-amd64
> https://streams.canonical.com/juju/tools/releases/juju-1.20.14-precise-amd64.tgz
> 932640702b5d9f08a312118a4afe330444308f92b7de350de9e547719f084bd9 8130412}
> - /MAAS/api/1.0/nodes/node-5f9c14e6-ae98-11e3-b194-00163efc5068/
>
> and agent-version on node says 1.20.13.
>
> Juju-status.yaml:
>
> '5':
> agent-state: started
> agent-version: 1.20.13
> containers:
> 5/lxc/0:
> agent-state-info: 'error executing "lxc-clone": lxc_container:
> utils.c: mkdir_p:
> 220 Not a directory - failed to create directory
> ''/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'';
> lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could
> not destroy
> snapshot %s - failed to allocate a pty; Insufficent privileges
> to control
> juju-trusty-lxc-template'
> instance-id: pending
> series: trusty
> 5/lxc/1:
> agent-state-info: cannot clone a running container
> instance-id: pending
> series: trusty
> dns-name: apsaras.oil
> hardware: arch=amd64 cpu-cores=4 mem=32768M
> tags=hw-ok,oil-slave-1,hardware-dell-poweredge-R210
> instance-id:
> /MAAS/api/1.0/nodes/node-5f9c14e6-ae98-11e3-b194-00163efc5068/
> series: trusty
>
> and from log:
>
> machine-0: 2015-01-14 15:32:12 DEBUG juju.state.apiserver
> apiserver.go:150 <- [7C] unit-swift-storage-1
> {"RequestId":42,"Type":"Uniter","Request":"CharmURL","Params":{"Entities":[{"Tag":"service-swift-storage"}]}}
> machine-0: 2015-01-14 15:32:12 DEBUG juju.state.apiserver
> apiserver.go:157 -> [7C] unit-swift-storage-1 383.812us
> {"RequestId":42,"Response":{"Results":[{"Error":null,"Result":"local:trusty/swift-storage-90","Ok":false}]}}
> Uniter[""].CharmURL
> machine-0: 2015-01-14 15:32:12 DEBUG juju.state.apiserver
> apiserver.go:150 <- [7C] unit-swift-storage-1
> {"RequestId":43,"Type":"Uniter","Reques...

Read more...

Changed in lxc (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.