At termination, LXC rootfs is not always unmounted before rmtree() is called

Bug #1046313 reported by David Kang
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Pádraig Brady
Essex
Fix Released
Undecided
Pádraig Brady
nova (Ubuntu)
Fix Released
Undecided
Unassigned
Precise
Fix Released
Undecided
Unassigned

Bug Description

nova version used:
commit 20c6bb6c9000fa0d193f688b668f5f3eeda8fb05
Merge: aedaf10 0876cf5
Author: Jenkins <email address hidden>
Date: Wed Aug 29 14:33:01 2012 +0000

    Merge "Do not run pylint by default"

Symptom:

The rootfs of LXC instance is not unmounted before rmtree() is called in the nova/virt/libvirt/driver.py file.
I've seen this problem in Essex and in Folsom.
It does not happen always, though.
I suspect there is timing issues between unmount() and rmtree().
This bug eventually leads to "no free nbd device".

Example:
 After terminating instance i-00000005, I still see that its rootfs is mounted to /dev/nbd15:
$ mount
/dev/nbd15 on /usr/local/upstream-Aug-29/instances/instance-00000005/rootfs type ext2 (rw)

 Since it is not unmounted before rmtree() is called, nova-compute complains.

Here is the log of nova-compute:
2012-09-04 09:11:46 INFO nova.virt.libvirt.driver [-] [instance: 8e0b9d15-2c4b-40e7-a932-90c8d39d9657] Instance destroyed successfully.
2012-09-04 09:11:46 DEBUG nova.utils [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] Attempting to grab semaphore "iptables" for method "_apply"... from (pid=10672) inner /usr/local/nova/nova/utils.py:708
2012-09-04 09:11:46 DEBUG nova.utils [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] Got semaphore "iptables" for method "_apply"... from (pid=10672) inner /usr/local/nova/nova/utils.py:712
2012-09-04 09:11:46 DEBUG nova.utils [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] Attempting to grab file lock "iptables" for method "_apply"... from (pid=10672) inner /usr/local/nova/nova/utils.py:716
2012-09-04 09:11:46 DEBUG nova.utils [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] Got file lock "iptables" for method "_apply"... from (pid=10672) inner /usr/local/nova/nova/utils.py:724
2012-09-04 09:11:46 DEBUG nova.utils [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf iptables-save -c -t filter from (pid=10672) execute /usr/local/nova/nova/utils.py:176
2012-09-04 09:11:46 DEBUG nova.utils [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] Result was 0 from (pid=10672) execute /usr/local/nova/nova/utils.py:191
2012-09-04 09:11:46 DEBUG nova.utils [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf iptables-restore -c from (pid=10672) execute /usr/local/nova/nova/utils.py:176
2012-09-04 09:11:46 DEBUG nova.utils [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] Result was 0 from (pid=10672) execute /usr/local/nova/nova/utils.py:191
2012-09-04 09:11:46 DEBUG nova.utils [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf iptables-save -c -t nat from (pid=10672) execute /usr/local/nova/nova/utils.py:176
2012-09-04 09:11:46 DEBUG nova.utils [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] Result was 0 from (pid=10672) execute /usr/local/nova/nova/utils.py:191
2012-09-04 09:11:46 DEBUG nova.utils [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] Running cmd (subprocess): sudo nova-rootwrap /etc/nova/rootwrap.conf iptables-restore -c from (pid=10672) execute /usr/local/nova/nova/utils.py:176
2012-09-04 09:11:47 DEBUG nova.utils [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] Result was 0 from (pid=10672) execute /usr/local/nova/nova/utils.py:191
2012-09-04 09:11:47 DEBUG nova.network.linux_net [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] IPTablesManager.apply completed with success from (pid=10672) _apply /usr/local/nova/nova/network/linux_net.py:369
2012-09-04 09:11:47 INFO nova.virt.libvirt.driver [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] [instance: 8e0b9d15-2c4b-40e7-a932-90c8d39d9657] Deleting instance files /usr/local/nova/instances/instance-00000005
2012-09-04 09:11:47 ERROR nova.virt.libvirt.driver [req-52c4813e-2ae8-4307-af31-158d896fe374 admin admin] Failed to cleanup directory /usr/local/nova/instances/instance-00000005: [Errno 13] Permission denied: '/usr/local/nova/instances/instance-00000005/rootfs/lost+found'

 I can manually unmount it and release /dev/nbd15 to finish the
termination process.
Without doing that, nbd15 is permanently occupied by the terminated instance.

Chuck Short (zulcss)
tags: added: lxc
Revision history for this message
Pádraig Brady (p-draigbrady) wrote :

It's possible that umount returns before actually completing the umount
(communication with the qemu-nbd processes etc.). Though that would
be a bug (as we don't specify the -z option). Now there is a similar bug in
FUSE that we had to work around in the libguestfs implementation, so it's possible.

What kernel version/distro are you using?

Also I'm a bit confused that when you ran the mount command,
the mount point showed up. If it was a race, then mount point
would probably be unmounted before you ran that command.
There are definitely no mentions of the following in the log?
  "Failed to unmount container filesystem"

If not and this only happens intermittently, then
logically either umount silently fails somethimes with nbd,
or sometimes the _device_for_path() function in
/usr/local/nova/nova/virt/disk/api.py doesn't find the corresponding device.
Both surprising TBH.

Changed in nova:
assignee: nobody → Pádraig Brady (p-draigbrady)
importance: Undecided → High
status: New → Triaged
Revision history for this message
David Kang (dkang) wrote :

 kernel: we applied a patch for lxc tool support to 2.6.38. We use RHEL 6.2.

$ uname -a
Linux bespin101 2.6.38.2-lxc-0.7.4 #3 SMP Fri Dec 2 15:58:39 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

 About race condition, sorry for the confusion.
I believe it is not unmounted at all.
But, I don't see error message of "Failed to unmount ...".
So, I think it fails to unmount but does not report error.

 As I mentioned before,
I can unmount the rootfs and disconnect nbd device manually later.

Revision history for this message
Pádraig Brady (p-draigbrady) wrote :

Note nbd isn't supported on RHEL, so I presume you manually added that module to your kernel build too.
It might be that you're hitting old qemu bugs. Though looking a bit more at your logs I don't actually
see the umount command being run. That suggests an issue with the nova lxc umount logic :(

If there really is no mention of "Running cmd.*umount" in your logs for the problematic runs,
then could you change the _device_for_path() function in /usr/local/nova/nova/virt/disk/api.py
to add this before the return:

    if not device:
        LOG.debug("Didn't find %s in %s" % (path, open("/proc/mounts").read()))

If that outputs something it should help determine why umount is not being called.

thanks,
Pádraig.

Revision history for this message
David Kang (dkang) wrote :

Yes, I've enabled nbd manually.

I've added your code and got it!
You are absolutely right. It didn't find the path of the rootfs.

2012-09-07 15:42:23 INFO nova.virt.libvirt.driver [req-fe924337-1fdd-4a69-9563-dc5d093bfb68 admin admin] [instance: d5a4ab12-32e9-4ad5-b7ce-7b8c9e9d851f] Deleting instance files /usr/local/nova/instances/instance-0000000f
2012-09-07 15:42:23 DEBUG nova.virt.disk.api [req-fe924337-1fdd-4a69-9563-dc5d093bfb68 admin admin] Didn't find /usr/local/nova/instances/instance-0000000f/rootfs in rootfs / rootfs rw 0 0

 It is due to the symbolic link used for nova directory.

$ ls -l /usr/local/nova
lrwxrwxrwx. 1 root root 15 Sep 5 14:39 /usr/local/nova -> upstream-Sep-5/

The absolute path name of the rootfs is:
      /usr/local/upstream-Sep-5/instances/instance-0000000f/rootfs
However, since /usr/local/nova is a symbolic link of the /usr/local/upstream-Sep-5,
the code tries to unmount "/usr/local/nova/instances/instance-0000000f/rootfs".

 Then, is it a bug or should I not use symbolic link at all?

Revision history for this message
Pádraig Brady (p-draigbrady) wrote :

Excellent.

I'm not sure why some runs worked for you.
Anyway we should handle symbolic links.
Fix coming up.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/12623

Changed in nova:
status: Triaged → In Progress
Changed in nova:
milestone: none → folsom-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/essex)

Fix proposed to branch: stable/essex
Review: https://review.openstack.org/12624

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/12623
Committed: http://github.com/openstack/nova/commit/b35b34a65a6ddcfee524304cc548b4dfe1fc6cd5
Submitter: Jenkins
Branch: master

commit b35b34a65a6ddcfee524304cc548b4dfe1fc6cd5
Author: Pádraig Brady <email address hidden>
Date: Sat Sep 8 02:43:01 2012 +0100

    fix unmounting of LXC containers in the presence of symlinks

    When looking up the mount dir in /proc/mounts we need
    to canonicalize the path so that symlinks are resolved

    Fixes bug: 1046313
    Change-Id: Ib2bb8253e9cdfeda3c6ba13e56a0f907518ff558

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/essex)

Reviewed: https://review.openstack.org/12624
Committed: http://github.com/openstack/nova/commit/4eebe76e9671afcdf8f468a91c9eb27ddba30715
Submitter: Jenkins
Branch: stable/essex

commit 4eebe76e9671afcdf8f468a91c9eb27ddba30715
Author: Pádraig Brady <email address hidden>
Date: Sat Sep 8 02:43:01 2012 +0100

    fix unmounting of LXC containers in the presence of symlinks

    When looking up the mount dir in /proc/mounts we need
    to canonicalize the path so that symlinks are resolved

    Fixes bug: 1046313
    Change-Id: Ib2bb8253e9cdfeda3c6ba13e56a0f907518ff558
    Cherry-picked: b35b34a65a6ddcfee524304cc548b4dfe1fc6cd5

Thierry Carrez (ttx)
Changed in nova:
milestone: folsom-rc1 → 2012.2
Changed in nova (Ubuntu):
status: New → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello David, or anyone else affected,

Accepted nova into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/nova/2012.1.3+stable-20130423-e52e6912-0ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in nova (Ubuntu Precise):
status: New → Fix Committed
tags: added: verification-needed
Revision history for this message
Yolanda Robla (yolanda.robla) wrote : Verification report.

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/12623
Stable review: https://review.openstack.org/12624

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Revision history for this message
Yolanda Robla (yolanda.robla) wrote :

Test coverage log.

Revision history for this message
Yolanda Robla (yolanda.robla) wrote :

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/12623
Stable review: https://review.openstack.org/12624

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Revision history for this message
Yolanda Robla (yolanda.robla) wrote :

Test coverage log.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2012.1.3+stable-20130423-e52e6912-0ubuntu1

---------------
nova (2012.1.3+stable-20130423-e52e6912-0ubuntu1) precise-proposed; urgency=low

  * Resynchronize with stable/essex (e52e6912) (LP: #1089488):
    - [48e81f1] VNC proxy can be made to connect to wrong VM LP: 1125378
    - [3bf5a58] snat rule too broad for some network configurations LP: 1048765
    - [efaacda] DOS by allocating all fixed ips LP: 1125468
    - [b683ced] Add nosehtmloutput as a test dependency.
    - [45274c8] Nova unit tests not running, but still passing for stable/essex
      LP: 1132835
    - [e02b459] vnc unit-test fixes
    - [87361d3] Jenkins jobs fail because of incompatibility between sqlalchemy-
      migrate and the newest sqlalchemy-0.8.0b1 (LP: #1073569)
    - [e98928c] VNC proxy can be made to connect to wrong VM LP: 1125378
    - [c0a10db] DoS through XML entity expansion (CVE-2013-1664) LP: 1100282
    - [243d516] No authentication on block device used for os-volume_boot
      LP: 1069904
    - [80fefe5] use_single_default_gateway does not function correctly
      (LP: #1075859)
    - [bd10241] Essex 2012.1.3 : Error deleting instance with 2 Nova Volumes
      attached (LP: #1079745)
    - [86a5937] do_refresh_security_group_rules in nova.virt.firewall is very
      slow (LP: #1062314)
    - [ae9c5f4] deallocate_fixed_ip attempts to update an already deleted
      fixed_ip (LP: #1017633)
    - [20f98c5] failed to allocate fixed ip because old deleted one exists
      (LP: #996482)
    - [75f6922] snapshot stays in saving state if the vm base image is deleted
      (LP: #921774)
    - [1076699] lock files may be removed in error dues to permissions issues
      (LP: #1051924)
    - [40c5e94] ensure_default_security_group() does not call sgh (LP: #1050982)
    - [4eebe76] At termination, LXC rootfs is not always unmounted before
      rmtree() is called (LP: #1046313)
    - [47dabb3] Heavily loaded nova-compute instances don't sent reports
      frequently enough (LP: #1045152)
    - [b375b4f] When attach volume lost attach when node restart (LP: #1004791)
    - [4ac2dcc] nova usage-list returns wrong usage (LP: #1043999)
    - [014fcbc] Bridge port's hairpin mode not set after resuming a machine
      (LP: #1040537)
    - [2f35f8e] Nova flavor ephemeral space size reported incorrectly
      (LP: #1026210)
  * Dropped, superseeded by new snapshot:
    - debian/patches/CVE-2013-0335.patch: [48e81f1]
    - debian/patches/CVE-2013-1838.patch: [efaacda]
    - debian/patches/CVE-2013-1664.patch: [c0a10db]
    - debian/patches/CVE-2013-0208.patch: [243d516]
 -- Yolanda <email address hidden> Mon, 22 Apr 2013 12:37:08 +0200

Changed in nova (Ubuntu Precise):
status: Fix Committed → Fix Released
Revision history for this message
Scott Kitterman (kitterman) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.