Live migration fails with qemu-img >= 2.10: "Failed to get shared "write" lock\nIs another process using the image?"

Bug #1718295 reported by Matt Riedemann
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Sean Dague
devstack
Fix Released
High
Matt Riedemann
qemu (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

Looks like this is pretty new:

http://logs.openstack.org/01/503601/7/check/gate-tempest-dsvm-multinode-live-migration-ubuntu-xenial/b19b77c/logs/screen-n-api.txt.gz?level=TRACE#_Sep_19_17_47_11_508623

Sep 19 17:47:11.508623 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: ERROR nova.api.openstack.extensions [None req-e31fde7b-317f-4db9-b225-10b6e11b2dff tempest-LiveMigrationTest-1678596498 tempest-LiveMigrationTest-1678596498] Unexpected exception in API method: MigrationError_Remote: Migration error: Disk info file is invalid: qemu-img failed to execute on /opt/stack/data/nova/instances/806812af-bf9e-4dc1-8e0c-11603ccd9f62/disk : Unexpected error while running command.
Sep 19 17:47:11.508805 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /opt/stack/data/nova/instances/806812af-bf9e-4dc1-8e0c-11603ccd9f62/disk
Sep 19 17:47:11.508946 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: Exit code: 1
Sep 19 17:47:11.509079 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: Stdout: u''
Sep 19 17:47:11.509233 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: Stderr: u'qemu-img: Could not open \'/opt/stack/data/nova/instances/806812af-bf9e-4dc1-8e0c-11603ccd9f62/disk\': Failed to get shared "write" lock\nIs another process using the image?\n'
Sep 19 17:47:11.509487 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: Traceback (most recent call last):
Sep 19 17:47:11.509649 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming
Sep 19 17:47:11.509789 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: res = self.dispatcher.dispatch(message)
Sep 19 17:47:11.510231 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 222, in dispatch
Sep 19 17:47:11.510418 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: return self._do_dispatch(endpoint, method, ctxt, args)
Sep 19 17:47:11.510555 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 192, in _do_dispatch
Sep 19 17:47:11.510687 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: result = func(ctxt, **new_args)
Sep 19 17:47:11.510829 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/opt/stack/new/nova/nova/exception_wrapper.py", line 76, in wrapped
Sep 19 17:47:11.510959 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: function_name, call_dict, binary)
Sep 19 17:47:11.511194 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Sep 19 17:47:11.511713 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: self.force_reraise()
Sep 19 17:47:11.511852 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Sep 19 17:47:11.512037 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: six.reraise(self.type_, self.value, self.tb)
Sep 19 17:47:11.512687 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/opt/stack/new/nova/nova/exception_wrapper.py", line 67, in wrapped
Sep 19 17:47:11.516811 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: return f(self, context, *args, **kw)
Sep 19 17:47:11.516966 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/opt/stack/new/nova/nova/compute/utils.py", line 876, in decorated_function
Sep 19 17:47:11.517110 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: return function(self, context, *args, **kwargs)
Sep 19 17:47:11.525392 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/opt/stack/new/nova/nova/compute/manager.py", line 217, in decorated_function
Sep 19 17:47:11.525526 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: kwargs['instance'], e, sys.exc_info())
Sep 19 17:47:11.525654 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Sep 19 17:47:11.525783 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: self.force_reraise()
Sep 19 17:47:11.525909 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Sep 19 17:47:11.526057 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: six.reraise(self.type_, self.value, self.tb)
Sep 19 17:47:11.526186 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/opt/stack/new/nova/nova/compute/manager.py", line 205, in decorated_function
Sep 19 17:47:11.526314 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: return function(self, context, *args, **kwargs)
Sep 19 17:47:11.529795 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/opt/stack/new/nova/nova/compute/manager.py", line 5378, in check_can_live_migrate_source
Sep 19 17:47:11.529952 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: block_device_info)
Sep 19 17:47:11.530083 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5960, in check_can_live_migrate_source
Sep 19 17:47:11.530446 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: block_device_info)
Sep 19 17:47:11.530587 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 6060, in _assert_dest_node_has_enough_disk
Sep 19 17:47:11.530720 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: disk_infos = self._get_instance_disk_info(instance, block_device_info)
Sep 19 17:47:11.531945 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 7254, in _get_instance_disk_info
Sep 19 17:47:11.532099 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: block_device_info)
Sep 19 17:47:11.532234 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 7222, in _get_instance_disk_info_from_config
Sep 19 17:47:11.532366 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: backing_file = libvirt_utils.get_disk_backing_file(path)
Sep 19 17:47:11.532496 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/opt/stack/new/nova/nova/virt/libvirt/utils.py", line 197, in get_disk_backing_file
Sep 19 17:47:11.532627 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: backing_file = images.qemu_img_info(path, format).backing_file
Sep 19 17:47:11.532755 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: File "/opt/stack/new/nova/nova/virt/images.py", line 72, in qemu_img_info
Sep 19 17:47:11.532896 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: raise exception.InvalidDiskInfo(reason=msg)
Sep 19 17:47:11.533373 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: InvalidDiskInfo: Disk info file is invalid: qemu-img failed to execute on /opt/stack/data/nova/instances/806812af-bf9e-4dc1-8e0c-11603ccd9f62/disk : Unexpected error while running command.
Sep 19 17:47:11.534451 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: Command: /usr/bin/python2 -m oslo_concurrency.prlimit --as=1073741824 --cpu=30 -- env LC_ALL=C LANG=C qemu-img info /opt/stack/data/nova/instances/806812af-bf9e-4dc1-8e0c-11603ccd9f62/disk
Sep 19 17:47:11.534670 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: Exit code: 1
Sep 19 17:47:11.534902 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: Stdout: u''
Sep 19 17:47:11.541303 ubuntu-xenial-2-node-rax-ord-10997038 <email address hidden>[28339]: Stderr: u'qemu-img: Could not open \'/opt/stack/data/nova/instances/806812af-bf9e-4dc1-8e0c-11603ccd9f62/disk\': Failed to get shared "write" lock\nIs another process using the image?\n'

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Unexpected%20exception%20in%20API%20method%3A%20MigrationError_Remote%3A%20Migration%20error%3A%20Disk%20info%20file%20is%20invalid%3A%20qemu-img%20failed%20to%20execute%20on%5C%22%20AND%20tags%3A%5C%22screen-n-api.txt%5C%22&from=7d

252 hits starting on 9/19, check and gate, all failures.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

I don't see any related nova changes in the last 48 hours that look like they could cause this. I wonder if there is a new qemu package or something.

qemu-system 1:2.10+dfsg-0ubuntu1~cloud0

Revision history for this message
Matt Riedemann (mriedem) wrote :

qemu-system 1:2.10+dfsg-0ubuntu1~cloud0 was released on 9/5 so we've already been using it for a couple of weeks.

Revision history for this message
Matt Riedemann (mriedem) wrote :

It's probably this change to devstack to use the pike cloud archive:

https://review.openstack.org/#/c/501224/

Changed in nova:
status: Confirmed → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (master)

Fix proposed to branch: master
Review: https://review.openstack.org/505446

Changed in devstack:
assignee: nobody → Matt Riedemann (mriedem)
status: New → In Progress
Matt Riedemann (mriedem)
Changed in devstack:
importance: Undecided → High
Revision history for this message
James Page (james-page) wrote : Re: Unexpected exception in API method: MigrationError_Remote: Migration error: Disk info file is invalid: qemu-img failed to execute - Failed to get shared "write" lock\nIs another process using the image?
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to devstack (master)

Reviewed: https://review.openstack.org/505446
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=ee22ca8373abd3b5a4c44a9c5c4da39c511195c8
Submitter: Jenkins
Branch: master

commit ee22ca8373abd3b5a4c44a9c5c4da39c511195c8
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 20 00:29:36 2017 +0000

    Revert "Update to using pike cloud-archive"

    This reverts commit a7e9a5d447b3eeacfb52d7ddc94445058a8d6fd1.

    The jobs that run live migration tests are failing at about
    a rate of 50% since this merged. There are no recent changes
    to nova in the last 24 hours that are related to live
    migration, and this is failing on the master branch only,
    so I suspect the failures are due to new qemu packages
    getting pulled in from this change.

    Change-Id: Ic8481539c6a0cc7af08a736a625b672979435908
    Closes-Bug: #1718295

Changed in devstack:
status: In Progress → Fix Released
Revision history for this message
James Page (james-page) wrote : Re: Unexpected exception in API method: MigrationError_Remote: Migration error: Disk info file is invalid: qemu-img failed to execute - Failed to get shared "write" lock\nIs another process using the image?

This is a behavioural change in qemu for 2.10 - looks like 'qemu-img info' needs to be passed a '--force-share' option to allow it to read a disk file for a running instance.

Revision history for this message
Sean Dague (sdague) wrote :

This appears like it might be an issue with the ppa in pike

affects: qemu → ubuntu
affects: ubuntu → qemu (Ubuntu)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since it is a wanted behavioral change in upstream qemu setting that task to "Won't Fix" unless we come up with a reason to convince qemu to do so.
Once might argue that info should imply force-share or such, but unless we do so make clear that no qemu change is expected.
James already mentioned bug 1718133.

@James and Matt - should we dup one of those onto the other?

Changed in qemu (Ubuntu):
status: New → Won't Fix
Revision history for this message
James Page (james-page) wrote :

Distro patch to unblock Ubuntu Pike UCA and Artful (needs conditional check for general consumption).

Changed in nova:
status: Invalid → New
Revision history for this message
James Page (james-page) wrote :

Setting Nova bug task back to 'New' as I think we do need to accommodate this behavioural change in qemu.

Matt Riedemann (mriedem)
Changed in nova:
status: New → Triaged
summary: - Unexpected exception in API method: MigrationError_Remote: Migration
- error: Disk info file is invalid: qemu-img failed to execute - Failed to
- get shared "write" lock\nIs another process using the image?
+ Live migration fails with qemu-img >= 2.10
summary: - Live migration fails with qemu-img >= 2.10
+ Live migration fails with qemu-img >= 2.10: "Failed to get shared
+ "write" lock\nIs another process using the image?"
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/505673

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/505674

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/505748

Changed in nova:
assignee: Matt Riedemann (mriedem) → James Page (james-page)
Changed in nova:
assignee: James Page (james-page) → Sean Dague (sdague)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by James Page (<email address hidden>) on branch: master
Review: https://review.openstack.org/505748
Reason: Abandoning in preference of alternative fix.

Changed in nova:
assignee: Sean Dague (sdague) → Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Sean Dague (sdague)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/505673
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=807579755c4a116309eca5b2bcdbab9d1f393bab
Submitter: Zuul
Branch: master

commit 807579755c4a116309eca5b2bcdbab9d1f393bab
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 20 10:44:11 2017 -0400

    Support qemu >= 2.10

    Qemu 2.10 added the requirement of a --force-share flag to qemu-img
    info when reading information about a disk that is in use by a
    guest. We do this a lot in Nova for operations like gathering
    information before live migration.

    Up until this point all qemu/libvirt version matching has been solely
    inside the libvirt driver, however all the image manip code was moved
    out to nova.virt.images. We need the version of QEMU available there.

    This does it by initializing that version on driver init host. The net
    effect is also that broken libvirt connections are figured out
    earlier, as there is an active probe for this value.

    Co-Authored-By: Sean Dague <email address hidden>

    Change-Id: Iae2962bb86100f03fd3ad9aac3767da876291e74
    Closes-Bug: #1718295

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack (master)

Reviewed: https://review.openstack.org/505674
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=917ad0998be8c48bfcc0e3031bc1b75cd9ed1927
Submitter: Zuul
Branch: master

commit 917ad0998be8c48bfcc0e3031bc1b75cd9ed1927
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 20 14:46:48 2017 +0000

    Update to using pike cloud-archive

    This reverts commit ee22ca8373abd3b5a4c44a9c5c4da39c511195c8

    Depends-On: Iae2962bb86100f03fd3ad9aac3767da876291e74

    Change-Id: I4d5fa052bdc5eef1795f6507589e2eaf4e093e23
    Related-Bug: #1718295

Revision history for this message
Kevin Lefevre (archifleks) wrote :

Is this going to be backported to Pike, I ran into the same issue with openstack-ansible in Pike

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/509774

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/pike)

Change abandoned by Kevin Lefevre (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/509774
Reason: sorry

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/509774
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5e9508b77f86f1fadb173a071e2519aad24534f9
Submitter: Jenkins
Branch: stable/pike

commit 5e9508b77f86f1fadb173a071e2519aad24534f9
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 20 10:44:11 2017 -0400

    Support qemu >= 2.10

    Qemu 2.10 added the requirement of a --force-share flag to qemu-img
    info when reading information about a disk that is in use by a
    guest. We do this a lot in Nova for operations like gathering
    information before live migration.

    Up until this point all qemu/libvirt version matching has been solely
    inside the libvirt driver, however all the image manip code was moved
    out to nova.virt.images. We need the version of QEMU available there.

    This does it by initializing that version on driver init host. The net
    effect is also that broken libvirt connections are figured out
    earlier, as there is an active probe for this value.

    Co-Authored-By: Sean Dague <email address hidden>

    Change-Id: Iae2962bb86100f03fd3ad9aac3767da876291e74
    Closes-Bug: #1718295
    (cherry picked from commit 807579755c4a116309eca5b2bcdbab9d1f393bab)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.0.0b1

This issue was fixed in the openstack/nova 17.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.0.2

This issue was fixed in the openstack/nova 16.0.2 release.

Revision history for this message
Filippo DiNoto (fdinoto) wrote :

I think I am experiencing this issue on 15.1.13

If it's not going to be backported to O can anyone suggest a workaround?

Revision history for this message
Matt Riedemann (mriedem) wrote :

@Filippo: your options are (1) use an older version of QEMU or (2) backport the change to your Ocata-based deployment and run with it out of tree.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to devstack (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/536798

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on devstack (stable/pike)

Change abandoned by Lee Yarwood (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/536798

Revision history for this message
Gökhan (skylightcoder) wrote :

hi , because of power outage, Most of our compute nodes unexpectedly shut down and now I can't start our instances. I am getting "Failed to get "write" lock Is another process using the image?" error. Full error log is http://paste.openstack.org/show/754107/. My environment is OpenStack Pike and Instances are on a nfs shared storage. Nova version is 16.1.6.dev2. This fix can not solve this problem completely. What would you suggest for solve this problem ?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Maybe your files are still in locked state on the NFS system.
That you chould check with qemu-img operations, no need to start the guest.
If they are you might need to remove that lock bit (not sure how atm).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.opendev.org/693851

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ocata)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/ocata
Review: https://review.opendev.org/c/openstack/nova/+/693851
Reason: stable/ocata of openstack/nova transitioned to End of Life ( https://review.opendev.org/c/openstack/releases/+/795664 ), to be able to delete the branch we need to abandon all open patches on the branch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.