[SRU] Creation of image (or live snapshot) from the existing VM fails if libvirt-image-backend is configured to qcow2 starting from Ussuri

Bug #1896617 reported by Vladimir Grevtsev
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Opinion
Wishlist
Unassigned
OpenStack Nova Compute Charm
Invalid
Undecided
Unassigned
Ubuntu Cloud Archive
Fix Released
Critical
Corey Bryant
Ussuri
Fix Released
Critical
Corey Bryant
Victoria
Fix Released
Critical
Corey Bryant
nova (Ubuntu)
Fix Released
Critical
Corey Bryant
Focal
Fix Released
Critical
Corey Bryant
Groovy
Fix Released
Critical
Corey Bryant

Bug Description

[Impact]

tl;dr

1) creating the image from the existing VM fails if qcow2 image backend is used, but everything is fine if using rbd image backend in nova-compute.
2) openstack server image create --name <name of the new image> <instance name or uuid> fails with some unrelated error:

$ openstack server image create --wait 842fa12c-19ee-44cb-bb31-36d27ec9d8fc
HTTP 404 Not Found: No image found with ID f4693860-cd8d-4088-91b9-56b2f173ffc7

== Details ==

Two Tempest tests ([1] and [2]) from the 2018.02 Refstack test lists [0] are failing with the following exception:

49701867-bedc-4d7d-aa71-7383d877d90c
Traceback (most recent call last):
  File "/home/ubuntu/snap/fcbtest/14/.rally/verification/verifier-2d9cbf4d-fcbb-491d-848d-5137a9bde99e/repo/tempest/api/compute/base.py", line 369, in create_image_from_server
    waiters.wait_for_image_status(client, image_id, wait_until)
  File "/home/ubuntu/snap/fcbtest/14/.rally/verification/verifier-2d9cbf4d-fcbb-491d-848d-5137a9bde99e/repo/tempest/common/waiters.py", line 161, in wait_for_image_status
    image = show_image(image_id)
  File "/home/ubuntu/snap/fcbtest/14/.rally/verification/verifier-2d9cbf4d-fcbb-491d-848d-5137a9bde99e/repo/tempest/lib/services/compute/images_client.py", line 74, in show_image
    resp, body = self.get("images/%s" % image_id)
  File "/home/ubuntu/snap/fcbtest/14/.rally/verification/verifier-2d9cbf4d-fcbb-491d-848d-5137a9bde99e/repo/tempest/lib/common/rest_client.py", line 298, in get
    return self.request('GET', url, extra_headers, headers)
  File "/home/ubuntu/snap/fcbtest/14/.rally/verification/verifier-2d9cbf4d-fcbb-491d-848d-5137a9bde99e/repo/tempest/lib/services/compute/base_compute_client.py", line 48, in request
    method, url, extra_headers, headers, body, chunked)
  File "/home/ubuntu/snap/fcbtest/14/.rally/verification/verifier-2d9cbf4d-fcbb-491d-848d-5137a9bde99e/repo/tempest/lib/common/rest_client.py", line 687, in request
    self._error_checker(resp, resp_body)
  File "/home/ubuntu/snap/fcbtest/14/.rally/verification/verifier-2d9cbf4d-fcbb-491d-848d-5137a9bde99e/repo/tempest/lib/common/rest_client.py", line 793, in _error_checker
    raise exceptions.NotFound(resp_body, resp=resp)
tempest.lib.exceptions.NotFound: Object not found
Details: {'code': 404, 'message': 'Image not found.'}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/snap/fcbtest/14/.rally/verification/verifier-2d9cbf4d-fcbb-491d-848d-5137a9bde99e/repo/tempest/api/compute/images/test_images_oneserver.py", line 69, in test_create_delete_image
    wait_until='ACTIVE')
  File "/home/ubuntu/snap/fcbtest/14/.rally/verification/verifier-2d9cbf4d-fcbb-491d-848d-5137a9bde99e/repo/tempest/api/compute/base.py", line 384, in create_image_from_server
    image_id=image_id)
tempest.exceptions.SnapshotNotFoundException: Server snapshot image d82e95b0-9c62-492d-a08c-5bb118d3bf56 not found.

So far I was able to identify the following:

1) https://github.com/openstack/tempest/blob/master/tempest/api/compute/images/test_images_oneserver.py#L69 invokes a "create image from server"
2) It fails with the following error message in the nova-compute logs: https://pastebin.canonical.com/p/h6ZXdqjRRm/

The same occurs if the "openstack server image create --wait" will be executed; however, according to https://docs.openstack.org/nova/ussuri/admin/migrate-instance-with-snapshot.html the VM has to be shut down before the image creation:

"Shut down the source VM before you take the snapshot to ensure that all data is flushed to disk. If necessary, list the instances to view the instance name. Use the openstack server stop command to shut down the instance:"

This step is definitely being skipped by the test (e.g it's trying to perform the snapshot on top of the live VM).

FWIW, I'm using libvirt-image-backend: qcow2 in my nova-compute application params; and I was able to confirm that if the above parameter will be changed to "libvirt-image-backend: rbd", the tests will pass successfully.

Also, there is similar issue I was able to find: https://bugs.launchpad.net/nova/+bug/1885418 but it doesn't have any useful information rather then confirmation of the fact that OpenStack Ussuri + libvirt backend has some problem with the live snapshotting.

[0] https://refstack.openstack.org/api/v1/guidelines/2018.02/tests?target=platform&type=required&alias=true&flag=false
[1] tempest.api.compute.images.test_images_oneserver.ImagesOneServerTestJSON.test_create_delete_image[id-3731d080-d4c5-4872-b41a-64d0d0021314]
[2] tempest.api.compute.images.test_images_oneserver.ImagesOneServerTestJSON.test_create_image_specify_multibyte_character_image_name[id-3b7c6fe4-dfe7-477c-9243-b06359db51e6]

[Test Case]
deploy/configure openstack, using juju here
if upgrading to the fixed package, libvirt-guests will require restart: sudo systemctl restart libvirt-guests
create openstack instance
openstack server image create --wait <instance-uuid>
successful if fixed; fails with permissions error if not fixed

[Regression Potential]
This actually reverts the nova group members to what they used to be prior to the focal version of the packages. If there is a regression in this fix it would likely result in a permissions issue.

summary: - Creation of image from the existing VM fails if libvirt-image-backend is
- configured to qcow2
+ Creation of image (or live snapshot) from the existing VM fails if
+ libvirt-image-backend is configured to qcow2
description: updated
description: updated
Revision history for this message
Nikolay Vinogradov (nikolay.vinogradov) wrote : Re: Creation of image (or live snapshot) from the existing VM fails if libvirt-image-backend is configured to qcow2

I'm seeing that on the Bionic/Ussuri as well. Image from a VM on qcow2 ephemeral storage can't be created if VM is running, but if the VM is stopped it succeeds. Meanwhile image from a VM on Ceph can be created with no issues.

Also tested with 'raw' image backend: https://pastebin.canonical.com/p/yzTbx4kRXv/

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :
Download full text (3.2 KiB)

Can confirm that this issue is NOT reproducible on Bionic/Stein, will test on Bionic/Train now.

$ juju config nova-compute-kvm openstack-origin
cloud:bionic-stein
$ juju config nova-compute-kvm libvirt-image-backend
qcow2

$ os server list -f yaml
- Flavor: m1.tiny
  ID: 3612ca0d-0b6b-4104-8663-54d8d891b100
  Image: cirros
  Name: demo
  Networks: internal=10.0.0.115, 172.27.86.146
  Status: ACTIVE

$ os server image create --wait demo -f yaml

checksum: null
container_format: null
created_at: '2020-09-22T15:04:59Z'
disk_format: null
file: /v2/images/ceca0c58-1fbf-47f6-a94c-f81b265e0c20/file
id: ceca0c58-1fbf-47f6-a94c-f81b265e0c20
min_disk: 1
min_ram: 0
name: demo
owner: 9a163a46437c44109e6ec5b10a7a9ed1
properties:
  base_image_ref: 531d4af1-b6bf-40a2-b4d2-162f8a1e7d1d
  boot_roles: Admin,member,reader
  image_type: snapshot
  instance_uuid: 3612ca0d-0b6b-4104-8663-54d8d891b100
  locations: []
  os_hash_algo: null
  os_hash_value: null
  os_hidden: false
  owner_project_name: admin
  owner_user_name: admin
  user_id: 6b88aca2425c4520a55868f57b6201ca
protected: false
schema: /v2/schemas/image
size: null
status: queued
tags: []
updated_at: '2020-09-22T15:04:59Z'
virtual_size: null
visibility: private

ubuntu@node06:~$ dpkg -l | grep nova
ii nova-api-metadata 2:19.3.0-0ubuntu1~cloud0 all OpenStack Compute - metadata API frontend
ii nova-common 2:19.3.0-0ubuntu1~cloud0 all OpenStack Compute - common files
ii nova-compute 2:19.3.0-0ubuntu1~cloud0 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:19.3.0-0ubuntu1~cloud0 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:19.3.0-0ubuntu1~cloud0 all OpenStack Compute - compute node libvirt support
ii python3-nova 2:19.3.0-0ubuntu1~cloud0 all OpenStack Compute Python 3 libraries
ii python3-novaclient 2:13.0.0-0ubuntu1~cloud0 all client library for OpenStack Compute API - 3.x
ubuntu@node06:~$ dpkg -l | grep libvirt
ii libvirt-clients 5.0.0-1ubuntu2.6~cloud0 amd64 Programs for the libvirt library
ii libvirt-daemon 5.0.0-1ubuntu2.6~cloud0 amd64 Virtualization daemon
ii libvirt-daemon-driver-storage-rbd 5.0.0-1ubuntu2.6~cloud0 amd64 Virtualization daemon RBD storage driver
ii libvirt-daemon-system 5.0.0-1ubuntu2.6~cloud0 amd64 Libvirt daemon configuration files
ii libvirt0:amd64 5.0.0-1ubuntu2.6~cloud0 amd64 library for interfacing with different virtualization systems
ii nova-compute-libvirt 2:19.3.0-0ubuntu1~cloud0 all OpenStack Compute - compute node libvirt support
ii python3-libvirt ...

Read more...

Revision history for this message
Nikolay Vinogradov (nikolay.vinogradov) wrote :

Possibly relevant: https://github.com/openstack/nova/commit/fafbc182f9179c16b89c45d02544d4582e0a1194#diff-f4019782d93a196a0d026479e6aa61b1R1814

In Bionic Ussuri we have libvirt 6.0.0 and qemu 4.2.0 which could potentially change the behaviour of live snapshot on the ephemeral local disks.

Revision history for this message
Nikolay Vinogradov (nikolay.vinogradov) wrote :
Download full text (4.4 KiB)

Bionic/Ussuri versions:
ubuntu@cloud6:~$ dpkg -l | grep nova
ii nova-api-metadata 2:21.0.0-0ubuntu0.20.04.1~cloud0 all OpenStack Compute - metadata API frontend
ii nova-common 2:21.0.0-0ubuntu0.20.04.1~cloud0 all OpenStack Compute - common files
ii nova-compute 2:21.0.0-0ubuntu0.20.04.1~cloud0 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:21.0.0-0ubuntu0.20.04.1~cloud0 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:21.0.0-0ubuntu0.20.04.1~cloud0 all OpenStack Compute - compute node libvirt support
ii python3-nova 2:21.0.0-0ubuntu0.20.04.1~cloud0 all OpenStack Compute Python 3 libraries
ii python3-novaclient 2:17.0.0-0ubuntu1~cloud0 all client library for OpenStack Compute API - 3.x
ubuntu@cloud6:~$ dpkg -l | grep libvirt
ii libvirt-clients 6.0.0-0ubuntu8.2~cloud0 amd64 Programs for the libvirt library
ii libvirt-daemon 6.0.0-0ubuntu8.2~cloud0 amd64 Virtualization daemon
ii libvirt-daemon-driver-qemu 6.0.0-0ubuntu8.2~cloud0 amd64 Virtualization daemon QEMU connection driver
ii libvirt-daemon-driver-storage-rbd 6.0.0-0ubuntu8.2~cloud0 amd64 Virtualization daemon RBD storage driver
ii libvirt-daemon-system 6.0.0-0ubuntu8.2~cloud0 amd64 Libvirt daemon configuration files
ii libvirt-daemon-system-systemd 6.0.0-0ubuntu8.2~cloud0 amd64 Libvirt daemon configuration files (systemd)
ii libvirt0:amd64 6.0.0-0ubuntu8.2~cloud0 amd64 library for interfacing with different virtualization systems
ii nova-compute-libvirt 2:21.0.0-0ubuntu0.20.04.1~cloud0 all OpenStack Compute - compute node libvirt support
ii python3-libvirt 6.1.0-1~cloud0 amd64 libvirt Python 3 bindings
ubuntu@cloud6:~$ dpkg -l | grep qemu
ii ipxe-qemu 1.0.0+git-20180124.fbe8c52d-0ubuntu2.2 all PXE boot firmware - ROM images for qemu
ii ipxe-qemu-256k-compat-efi-roms 1.0.0+git-20150424.a25a16d-0ubuntu2 all PXE boot firmware - Compat EFI ROM images for qemu
ii libvirt-daemon-driver-qemu 6.0.0-0ubuntu8.2~cloud0 amd64 Virtualization daemon QEMU connection driver
ii qemu-block-extra:amd64 1:4.2-3ubuntu6.3~cloud0 amd64 ext...

Read more...

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :
Download full text (5.3 KiB)

Alright, so I can confirm that this issue is NOT reproducible on Bionic/Train, so Ussuri is the only affected version:

$ juju config nova-compute-kvm libvirt-image-backend
qcow2
$ juju config nova-compute-kvm openstack-origin
cloud:bionic-train

$ os server list -f yaml
- Flavor: m1.tiny
  ID: 7a740f9c-8fea-40ca-9526-3e072be4f12c
  Image: cirros
  Name: demo
  Networks: internal=10.0.0.206, 172.27.86.129
  Status: ACTIVE
ubuntu@OrangeBox84:~$ ping 172.27.86.129
PING 172.27.86.129 (172.27.86.129) 56(84) bytes of data.
64 bytes from 172.27.86.129: icmp_seq=1 ttl=62 time=2.19 ms
^C
--- 172.27.86.129 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.199/2.199/2.199/0.000 ms
ubuntu@OrangeBox84:~$ ping 172.27.86.129^C
ubuntu@OrangeBox84:~$ os server image create --wait demo -f yaml

checksum: null
container_format: null
created_at: '2020-09-22T15:47:47Z'
disk_format: null
file: /v2/images/3bf49a39-4576-4ab5-8ec6-f09227b39b05/file
id: 3bf49a39-4576-4ab5-8ec6-f09227b39b05
min_disk: 1
min_ram: 0
name: demo
owner: e4ecab73ad214da6abb6cd51711266a1
properties:
  base_image_ref: 2697b005-d727-45d5-9914-f0477933347d
  boot_roles: reader,Admin,member
  image_type: snapshot
  instance_uuid: 7a740f9c-8fea-40ca-9526-3e072be4f12c
  locations: []
  os_hash_algo: null
  os_hash_value: null
  os_hidden: false
  owner_project_name: admin
  owner_user_name: admin
  user_id: ba28625a95174640a0e2c16af07a2dff
protected: false
schema: /v2/schemas/image
size: null
status: queued
tags: []
updated_at: '2020-09-22T15:47:47Z'
virtual_size: null
visibility: private

# Package versions which got installed from Train UCA

ubuntu@node06:~$ sudo -i
root@node06:~# dpkg -l | grep nova
ii nova-api-metadata 2:20.3.0-0ubuntu1~cloud0 all OpenStack Compute - metadata API frontend
ii nova-common 2:20.3.0-0ubuntu1~cloud0 all OpenStack Compute - common files
ii nova-compute 2:20.3.0-0ubuntu1~cloud0 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:20.3.0-0ubuntu1~cloud0 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:20.3.0-0ubuntu1~cloud0 all OpenStack Compute - compute node libvirt support
ii python3-nova 2:20.3.0-0ubuntu1~cloud0 all OpenStack Compute Python 3 libraries
ii python3-novaclient 2:15.1.0-0ubuntu2~cloud0 all client library for OpenStack Compute API - 3.x
root@node06:~# dpkg -l | grep libvirt
ii libvirt-clients 5.4.0-0ubuntu5.4~cloud0 amd64 Programs for the libvirt library
ii libvirt-daemon 5.4.0-0ubuntu5.4~cloud0 amd64 Virtualization daemon
ii libvirt-daemon-driver-storage-rbd 5.4.0...

Read more...

summary: Creation of image (or live snapshot) from the existing VM fails if
- libvirt-image-backend is configured to qcow2
+ libvirt-image-backend is configured to qcow2 starting from Ussuri
Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote : Re: Creation of image (or live snapshot) from the existing VM fails if libvirt-image-backend is configured to qcow2 starting from Ussuri

Subscribing field-critical as this is a noticeable functionality degradation - Refstack tests doesn't pass due to this issue (see issue description).

Revision history for this message
Lee Yarwood (lyarwood) wrote :

I8e8035dcf508f5215bba9b7575c5c6abfe41da31 isn't related to the failure but might be a way of resolving the change in behaviour within virDomainBlockRebase if there has been one.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

Can anyone point me at a complete n-cpu log showing the actual failure and not the test failure to find the image?

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Hi Lee,

$ openstack server image create --wait 645f031e-7426-4ad0-8263-5cc35e8be8a8
HTTP 404 Not Found: No image found with ID af189018-2f51-4617-88b8-d2d968d6667b

Sure - https://paste.ubuntu.com/p/qDBnD6Tn4T/ here's what happened exactly after typing the command above.

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :
Download full text (6.9 KiB)

One more observation: if the above command has been invoked without --wait flag, it tells the user it succeeds (and the image is in "queued" state); however the nova-compute.log has the same error AND the image isn't created at the end of the day.

nova-compute.log part after the command invocation: https://paste.ubuntu.com/p/6WvsWfXYQX/

$ openstack server image create 645f031e-7426-4ad0-8263-5cc35e8be8a8

| checksum | None |
| container_format | None |
| created_at | 2020-09-23T14:06:07Z |
| disk_format | None |
| file | /v2/images/ef6eb727-1e7a-49f2-9bbe-b3a703c585da/file |
| id | ef6eb727-1e7a-49f2-9bbe-b3a703c585da |
| min_disk | 40 |
| min_ram | 0 ...

Read more...

Revision history for this message
Lee Yarwood (lyarwood) wrote :

2020-09-23 13:55:27.563 25090 ERROR oslo_messaging.rpc.server libvirt.libvirtError: unable to verify existence of block copy target: Permission denied

Right so this is the same as the trace in bug 1885418

The --wait behaviour is correct, the API is async and so without --wait it returns successfully once the request is accepted. --wait just polls until the snapshot has been taken.

I'm just building a focal env now with the required versions of libvirt and QEMU to hit this.

Revision history for this message
Lee Yarwood (lyarwood) wrote :

I can't reproduce this on Focal with QEMU 1:4.2-3ubuntu6.6 and libvirt 6.0.0-0ubuntu8.3 or Fedora 32 with QEMU 4.2.1-1 and libvirt 6.1.0-4. Looking at the libvirt code [1] I assume this could be an app-armour denial?

Moving the bug to incomplete for OpenStack Nova for the time being until we have a reproducer against an upstream devstack based env.

[1] https://github.com/libvirt/libvirt/blob/f253dc90f528d884562075ca7c2ae7143537d5c5/src/util/virfile.c#L2083-L2142

Changed in nova:
status: New → Incomplete
Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

> I'm just building a focal env

fwiw, the env I'm running the reproducer is Bionic, but I think that should not matter too much as long as OpenStack and Libvirt versions would be the same.

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

> I assume this could be an app-armour denial?

I can see the following in the dmesg @ the compute node:

Sep 23 13:03:06 node09 kernel: [12572.344288] audit: type=1400 audit(1600866186.496:34): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-645f031e-7426-4ad0-8263-5cc35e8be8a8" pid=8563 comm="apparmor_parser"
Sep 23 13:51:58 node09 kernel: [15504.209400] audit: type=1400 audit(1600869118.381:35): apparmor="DENIED" operation="ptrace" profile="libvirtd" pid=7242 comm="libvirtd" requested_mask="trace" denied_mask="trace" peer="/usr/bin/nova-compute"
Sep 23 13:51:59 node09 kernel: [15505.337932] audit: type=1400 audit(1600869119.509:36): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/bin/nova-compute" pid=25073 comm="apparmor_parser"
Sep 23 13:52:03 node09 kernel: [15509.470269] audit: type=1400 audit(1600869123.641:37): apparmor="DENIED" operation="ptrace" profile="libvirtd" pid=7242 comm="libvirtd" requested_mask="trace" denied_mask="trace" peer="/usr/bin/nova-compute"

However, nothing new is appearing when I'm invoking the "server image create" command.

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

I think this is not an AppArmor issue, since I tried with the fully disabled AA:

root@node09:~# echo '' > /var/log/nova/nova-compute.log
root@node09:~# aa-enabled
No - disabled at boot.
root@node09:~# aa-status
apparmor module is loaded.
apparmor filesystem is not mounted.

<did the openstack server image create in the second tab>

root@node09:~# grep -RiP 'denied' /var/log/nova
/var/log/nova/nova-compute.log:2020-09-23 15:47:33.126 2144 ERROR oslo_messaging.rpc.server [req-9516af76-7462-42f3-9e17-61d0fcb17b96 453be1c526c1444ca84ce6be692efd4b 756410002f0749709c7b3406b1949c5a - 0cfbc1f92c474a71afeeb6326d036dc2 0cfbc1f92c474a71afeeb6326d036dc2] Exception during message handling: libvirt.libvirtError: unable to verify existence of block copy target: Permission denied
/var/log/nova/nova-compute.log:2020-09-23 15:47:33.126 2144 ERROR oslo_messaging.rpc.server libvirt.libvirtError: unable to verify existence of block copy target: Permission denied

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Also, an each attempt of doing the snapshot results in the following in the libvirtd logs:

Sep 23 15:47:32 node09 libvirtd[1757]: invalid argument: disk vda does not have an active block job
Sep 23 15:47:32 node09 libvirtd[1757]: invalid argument: disk vda does not have an active block job
Sep 23 15:47:32 node09 libvirtd[1757]: unable to verify existence of block copy target: Permission denied

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Access is denied within a tmp dir created during the snapshot attempt:

$ sudo ls -al /var/lib/nova/instances/snapshots/tmpkajuir8o
total 204
drwx-----x 2 nova nova 4096 Sep 23 19:12 .
drwxr-x--- 3 nova nova 4096 Sep 23 19:12 ..
-rw-r--r-- 1 nova nova 197248 Sep 23 19:12 0ece1fb912104f2c849ea4bd6036712c.delta

If I chmod /var/lib/nova/instances/snapshots/tmpkajuir8o to 777 the snapshot is successful.

In that case the user/group of the delta file changes from nova:nova to libvirt-qemu:kvm. So it appears that libvirt-qemu needs access to the tmp directory.

The tmp directory is created at run-time and I'm not yet sure how the permissions are determined. The --x for other seems odd.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This is not related (it's been there since 2015) but the o+x can be explained by nova/virt/libvirt/driver.py:

 2384 with utils.tempdir(dir=snapshot_directory) as tmpdir:
 2385 try:
 2386 out_path = os.path.join(tmpdir, snapshot_name)
 2387 if live_snapshot:
 2388 # NOTE(xqueralt): libvirt needs o+x in the tempdir
 2389 os.chmod(tmpdir, 0o701)
 2390 self._live_snapshot(context, instance, guest,
 2391 disk_path, out_path, source_format,
 2392 image_format, instance.image_meta)
 2393 else:
 2394 root_disk.snapshot_extract(out_path, image_format)
 2395 LOG.info("Snapshot extracted, beginning image upload",
 2396 instance=instance)

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I was thinking this may be fixed by the following but so far no luck: https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1885269

Changed in charm-nova-compute:
assignee: nobody → Corey Bryant (corey.bryant)
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I'm fairly certain this is a package bug so I'm going to triage against the package for now.

Changed in nova (Ubuntu):
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Corey Bryant (corey.bryant)
Changed in charm-nova-compute:
assignee: Corey Bryant (corey.bryant) → nobody
status: New → Invalid
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Some directory comparisons (after enabling ussuri-proposed with the fix for 1885269). I'm seeing no differences, except that the snapshot is successful for bionic-train and still fails for focal-ussuri:

bionic-train:

ubuntu@juju-d93333-zaza-4dbb8b0e6cc9-21:~$ ls -al /var/lib/nova/instances/snapshots
total 12
drwxr-xr-x 3 nova nova 4096 Sep 23 20:13 .
drwxr-xr-x 6 nova nova 4096 Sep 23 19:55 ..
drwx-----x 2 nova nova 4096 Sep 23 20:13 tmpbd7qzli0

ubuntu@juju-d93333-zaza-4dbb8b0e6cc9-21:~$ sudo ls -al /var/lib/nova/instances/snapshots/tmpbd7qzli0/
total 204
drwx-----x 2 nova nova 4096 Sep 23 20:13 .
drwxr-xr-x 3 nova nova 4096 Sep 23 20:13 ..
-rw-r--r-- 1 nova nova 196928 Sep 23 20:13 d1af0f3a804e4109830ef78155b7a4ab.delta

ubuntu@juju-d93333-zaza-4dbb8b0e6cc9-21:~$ sudo ls -al /var/lib/nova/instances/snapshots/tmpbd7qzli0/
total 1731224
drwx-----x 2 nova nova 4096 Sep 23 20:14 .
drwxr-xr-x 3 nova nova 4096 Sep 23 20:13 ..
-rw-r--r-- 1 nova nova 1149894656 Sep 23 20:14 d1af0f3a804e4109830ef78155b7a4ab
-rw-r--r-- 1 nova kvm 622985216 Sep 23 20:14 d1af0f3a804e4109830ef78155b7a4ab.delta

ubuntu@juju-d93333-zaza-4dbb8b0e6cc9-21:~$ ls -al /var/lib/nova
total 40
drwxr-xr-x 10 nova nova 4096 Sep 23 19:07 .
drwxr-xr-x 52 root root 4096 Sep 23 19:10 ..
drwxr-xr-x 2 nova root 4096 Sep 23 19:14 .ssh
drwxr-xr-x 6 nova nova 4096 Sep 23 19:00 CA
drwxr-xr-x 2 nova nova 4096 Jun 17 13:47 buckets
drwxr-xr-x 2 nova nova 4096 Jun 17 13:47 images
drwxr-xr-x 6 nova nova 4096 Sep 23 20:14 instances
drwxr-xr-x 2 nova nova 4096 Jun 17 13:47 keys
drwxr-xr-x 2 nova nova 4096 Jun 17 13:47 networks
drwxr-xr-x 2 nova nova 4096 Jun 17 13:47 tmp

focal-ussuri:

ubuntu@node06:~$ sudo ls -al /var/lib/nova/instances/snapshots/
total 12
drwxr-xr-x 3 nova nova 4096 Sep 23 20:20 .
drwxr-xr-x 6 nova nova 4096 Sep 23 20:02 ..
drwx-----x 2 nova nova 4096 Sep 23 20:20 tmpt_x2bd57

ubuntu@node06:~$ sudo ls -al /var/lib/nova/instances/snapshots/tmpt_x2bd57
total 204
drwx-----x 2 nova nova 4096 Sep 23 20:20 .
drwxr-xr-x 3 nova nova 4096 Sep 23 20:20 ..
-rw-r--r-- 1 nova nova 197248 Sep 23 20:20 f7d09ac696a04cb5b31925850e4dcfef.delta

ubuntu@node06:~$ ls -al /var/lib/nova
total 40
drwxr-xr-x 10 nova nova 4096 Sep 23 09:39 .
drwxr-xr-x 55 root root 4096 Sep 23 09:46 ..
drwxr-xr-x 2 nova root 4096 Sep 23 09:57 .ssh
drwxr-xr-x 6 nova nova 4096 Sep 23 09:37 CA
drwxr-xr-x 2 nova nova 4096 May 16 00:08 buckets
drwxr-xr-x 2 nova nova 4096 May 16 00:08 images
drwxr-xr-x 6 nova nova 4096 Sep 23 20:02 instances
drwxr-xr-x 2 nova nova 4096 May 16 00:08 keys
drwxr-xr-x 2 nova nova 4096 May 16 00:08 networks
drwxr-xr-x 2 nova nova 4096 May 16 00:08 tmp

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I'm fairly certain that 1885269 fixes this. @vlad it's working for node-06 and instance 049f76c6-3f6d-4299-b332-bf4c264b8741 on your deployment. I upgraded all of your nova-compute-kvm's to ussuri-proposed and it didn't work at first. Either it was something else I changed or a restart of libvirtd that was also needed and it is working now. I'm deploying a new ussuri-proposed to try there. Your nova-compute-kvm units have varying degress of changes from me.

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

@Corey, you're mentioning that it works on instance with id 049f.... but it was in SHUTOFF state (thus it was working), but if you'd start the instance:

$ os server list

| 049f76c6-3f6d-4299-b332-bf4c264b8741 | ubuntu-tests2 | SHUTOFF | internal=10.0.0.30 | ubuntu-bionic-cloudimg | m1.medium |
| 645f031e-7426-4ad0-8263-5cc35e8be8a8 | ubuntu-test | ACTIVE | internal=10.0.0.161, 172.27.86.120 | ubuntu-bionic-cloudimg | m1.medium |

$ os server start 049f76c6-3f6d-4299-b332-bf4c264b8741
$ os server image create --wait 049f76c6-3f6d-4299-b332-bf4c264b8741
HTTP 404 Not Found: No image found with ID 9b5dd242-9d46-41de-955e-5ff97ef50d28

Same as in original issue.

Lee Yarwood (lyarwood)
Changed in nova:
status: Incomplete → Invalid
Revision history for this message
Corey Bryant (corey.bryant) wrote :

@Vlad, confirmed on my own deployment that it is not fixed in ussuri-proposed

Revision history for this message
Maysam Fazeli (maysamfz) wrote :

@Vlad, I had reported this bug previously on https://bugs.launchpad.net/nova/+bug/1885418.

My research with different scenarios showed that the problem is probably related to the latest versions of libvirtd libraries and modules. I did test the previous versions of libvirtd and they worked seamlessly. So this may help you in resolving the issue.

Thank you

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I'm still really confused by this but some thoughts on the nova os.chmod() call mentioned in an earlier commit that would fix this.

If I chmod the tmp dir that gets created by nova (e.g. /var/lib/nova/instances/snapshots/tmpkajuir8o) to 755 just before the snapshot (after the nova chmod), the snapshot is successful.

As mentioned in https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1896617/comments/18, the upstream nova code sets permissions for the tmp dir with:

os.chmod(tmpdir, 0o701)

That code has been that way since 2015, so it's not new in ussuri, see git blame:

824c3706a3e nova/virt/libvirt/driver.py (Nicolas Simonds 2015-07-23 12:47:24 -0500 2388) # NOTE(xqueralt): libvirt needs o+x in the tempdir
824c3706a3e nova/virt/libvirt/driver.py (Nicolas Simonds 2015-07-23 12:47:24 -0500 2389) os.chmod(tmpdir, 0o701)

However, this seems like a heavy handed chmod if the goal, as the comment above it mentions, is to give libvirt o+x in the tempdir. I say this because it overrides any default permissions that were set previously by the operating system.

It seems that this should really be a lighter touch such as the following (equivalent to chmod o+x tmpdir):

st = os.stat(tmpdir)
os.chmod(tmpdir, st.st_mode | stat.S_IXOTH)

That would fix this bug for us, but still doesn't explain what changed in Ubuntu to cause this to fail. We did make some permissions changes in the nova package in focal but as compared above (with ussuri-proposed) file/directory permissions above in comment #21 I'm seeing no differences.

Changed in nova:
status: Invalid → New
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I moved this back to New for upstream nova.

@Lee or anyone else from upstream nova, do you have an opinion on changing the chmod in nova/virt/libvirt/driver.py from:

os.chmod(tmpdir, 0o701)

to:

st = os.stat(tmpdir)
os.chmod(tmpdir, st.st_mode | stat.S_IXOTH)

Revision history for this message
Corey Bryant (corey.bryant) wrote :

It turns out the tempfile.mkdtemp() call in nova/utils.py creates the directory with the restrictive permissions, in our case 0o700.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This is caused because the libvirt-qemu user is added to the nova group as part of the nova-compute-libvirt package post-install script.

Following up on comment #17 above, the user/group of the delta file changes from nova:nova to libvirt-qemu:kvm, whereas in comment #21 above, the user/group of the delta file changes to nova:kvm.

Dropping libvirt-qemu from nova in /etc/group fixes this as a work-around. I'm building packages with a fix now and will get this fixed for ussuri and victoria.

Marking the upstream bug as invalid.

Changed in nova:
status: New → Invalid
Revision history for this message
Corey Bryant (corey.bryant) wrote :

As background, adding libvirt-qemu user to the nova group was an attempt to make /var/lib/nova/* directories more restricted, but that proved to be difficult with ownership changes between changes nova and libvirt/qemu.

summary: - Creation of image (or live snapshot) from the existing VM fails if
+ [SRU] Creation of image (or live snapshot) from the existing VM fails if
libvirt-image-backend is configured to qcow2 starting from Ussuri
Changed in nova (Ubuntu Focal):
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Corey Bryant (corey.bryant)
description: updated
Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Can confirm that

> Dropping libvirt-qemu from nova in /etc/group fixes this as a work-around.

AND restarting the compute node helped. Without the reboot the fix didn't get applied (probably, I had to restart something else rather than libvirtd and nova-compute services).

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Can confirm that 'service libvirt-guests restart' did the trick (I was restarting the libvirtd itself).

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Workaround that finally worked for me:

juju run --application nova-compute-kvm 'sudo usermod -G kvm,libvirt-qemu libvirt-qemu; sudo service libvirt-guests restart'

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

However, the above has broken the new instance creation:

$ os server show demo-http -f yaml
OS-DCF:diskConfig: MANUAL
OS-EXT-AZ:availability_zone: ''
OS-EXT-SRV-ATTR:host: null
OS-EXT-SRV-ATTR:hypervisor_hostname: null
OS-EXT-SRV-ATTR:instance_name: instance-000002d8
OS-EXT-STS:power_state: NOSTATE
OS-EXT-STS:task_state: null
OS-EXT-STS:vm_state: error
OS-SRV-USG:launched_at: null
OS-SRV-USG:terminated_at: null
accessIPv4: ''
accessIPv6: ''
addresses: ''
config_drive: ''
created: '2020-09-24T22:14:20Z'
fault:
  code: 500
  created: '2020-09-24T22:14:48Z'
  details: "Traceback (most recent call last):\n File \"/usr/lib/python3/dist-packages/nova/conductor/manager.py\"\
    , line 652, in build_instances\n filter_properties, instances[0].uuid)\n File\
    \ \"/usr/lib/python3/dist-packages/nova/scheduler/utils.py\", line 919, in populate_retry\n\
    \ raise exception.MaxRetriesExceeded(reason=msg)\nnova.exception.MaxRetriesExceeded:\
    \ Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance\
    \ 711291b9-19fc-4e84-bc3e-423eda042630. Last exception: Cannot access storage\
    \ file '/var/lib/nova/instances/711291b9-19fc-4e84-bc3e-423eda042630/disk' (as\
    \ uid:64055, gid:117): Permission denied\n"
  message: 'Exceeded maximum number of retries. Exceeded max scheduling attempts 3
    for instance 711291b9-19fc-4e84-bc3e-423eda042630. Last exception: Cannot access
    storage file ''/var/lib/nova/instances/711291b9-19fc-4e84-bc3e-423eda042630/disk''
    (as uid:64055, gid:117'
flavor: m1.medium (3)
hostId: ''
id: 711291b9-19fc-4e84-bc3e-423eda042630
image: bionic-kvm (63727d33-4312-4c22-843e-2f5dfe4cb24c)
key_name: ubuntu-keypair
name: demo-http
project_id: 491dec5fd31d45108bd5fb8bb1486ffe
properties: ''
status: ERROR
updated: '2020-09-24T22:14:48Z'
user_id: 8170a7c8b627431eb37444dc504f84cb
volumes_attached: ''

description: updated
Revision history for this message
Corey Bryant (corey.bryant) wrote :

@vlad, I am able to create instances after applying the fixed package. Did you restart nova and libvirt daemons?

Revision history for this message
Corey Bryant (corey.bryant) wrote :

@vlad I have patched packages uploaded now, the focal version is awaiting SRU team review.

Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello Vladimir, or anyone else affected,

Accepted nova into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/2:21.0.0-0ubuntu0.20.04.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in nova (Ubuntu Focal):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Vladimir, or anyone else affected,

Accepted nova into ussuri-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ussuri-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ussuri-needed to verification-ussuri-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ussuri-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ussuri-needed
Changed in cloud-archive:
status: Triaged → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2:22.0.0~b3~git2020091410.76b2fbd90e-0ubuntu2

---------------
nova (2:22.0.0~b3~git2020091410.76b2fbd90e-0ubuntu2) groovy; urgency=medium

  * d/nova-compute-libvirt.postinst: Drop libvirt-qemu user from nova group.
    This is no longer needed with recent /var/lib/nova permission changes and
    causes live snapshots to fail (LP: #1896617).

 -- Corey Bryant <email address hidden> Thu, 24 Sep 2020 15:56:15 -0400

Changed in nova (Ubuntu Groovy):
status: Triaged → Fix Released
Revision history for this message
Michael Skalka (mskalka) wrote :

Dropping this down to field-high as a successful workaround has been found and tested.

Revision history for this message
James Page (james-page) wrote :

One comment on the fix in proposed - it does not actually update an existing deployment to remove the libvirt-qemu user from the nova group. Was this intentional? I appreciate that all qemu processes need to be restarted to actually resolve this issue as well.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

@James, That's not intentional, I need to fix that. Thanks for reviewing.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I've uploaded nova 2:22.0.0~b3~git2020091410.76b2fbd90e-0ubuntu3 and nova 2:21.0.0-0ubuntu0.20.04.4 to the groovy and focal unapproved queues, respectively. This version will remove the libvirt-qemu user from the nova group on upgrade if it is part of the nova group.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

nova 2:21.1.0-0ubuntu1 has been uploaded to the focal unapproved queue and supersedes 2:21.0.0-0ubuntu0.20.04.4.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I've updated the tags to verification-failed as upgrades were not fixed in the first upload, only new installs were fixed. Upgrades are fixed in the uploads that are currently in the groovy and focal unapproved queues.

tags: added: verification-failed verification-failed-focal verification-ussuri-failed
removed: verification-needed verification-needed-focal verification-ussuri-needed
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

Hello Vladimir, or anyone else affected,

Accepted nova into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/2:21.1.0-0ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

tags: added: verification-needed verification-needed-focal
removed: verification-failed verification-failed-focal
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Vladimir, or anyone else affected,

Accepted nova into ussuri-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ussuri-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ussuri-needed to verification-ussuri-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ussuri-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ussuri-needed
removed: verification-ussuri-failed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package nova - 2:22.0.0~b3~git2020091410.76b2fbd90e-0ubuntu3~cloud0
---------------

 nova (2:22.0.0~b3~git2020091410.76b2fbd90e-0ubuntu3~cloud0) focal-victoria; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 nova (2:22.0.0~b3~git2020091410.76b2fbd90e-0ubuntu3) groovy; urgency=medium
 .
   * d/nova-compute-libvirt.postinst: Ensure libvirt-qemu user is removed
     from nova group on package upgrade (LP: #1896617).

Changed in cloud-archive:
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Testing was successful on focal-proposed. Please see attached document.

tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Test was successful on ussuri-proposed. Please see attached document.

tags: added: verification-ussuri-done
removed: verification-ussuri-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2:21.1.0-0ubuntu1

---------------
nova (2:21.1.0-0ubuntu1) focal; urgency=medium

  * New stable point release for OpenStack Ussuri (LP: #1896476).

nova (2:21.0.0-0ubuntu0.20.04.4) focal; urgency=medium

  * d/nova-compute-libvirt.postinst: Ensure libvirt-qemu user is removed
    from nova group on package upgrade (LP: #1896617).

nova (2:21.0.0-0ubuntu0.20.04.3) focal; urgency=medium

  * d/nova-compute-libvirt.postinst: Drop libvirt-qemu user from nova group.
    This is no longer needed with recent /var/lib/nova permission changes and
    causes live snapshots to fail (LP: #1896617).

 -- Chris MacNaughton <email address hidden> Mon, 21 Sep 2020 12:53:36 +0000

Changed in nova (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for nova has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package nova - 2:21.1.0-0ubuntu1~cloud0
---------------

 nova (2:21.1.0-0ubuntu1~cloud0) bionic-ussuri; urgency=medium
 .
   * New upstream release for the Ubuntu Cloud Archive.
 .
 nova (2:21.1.0-0ubuntu1) focal; urgency=medium
 .
   * New stable point release for OpenStack Ussuri (LP: #1896476).
 .
 nova (2:21.0.0-0ubuntu0.20.04.4) focal; urgency=medium
 .
   * d/nova-compute-libvirt.postinst: Ensure libvirt-qemu user is removed
     from nova group on package upgrade (LP: #1896617).

Revision history for this message
Maysam Fazeli (maysamfz) wrote :
Download full text (10.8 KiB)

Guys I followed the procedures and updated the packages and the libvirt-qemu user is removed from nova group but still getting the same error. The image type is raw not qcow2 though. Here are the logs:

2020-10-31 17:41:15.579 1735 INFO nova.compute.manager [req-4a879436-1412-45ad-b461-12aaceec4a72 53151ac9de83404882af6c50c66c0278 adf11129db9f4dc494a848389f1d82e0 - 29dc2dc4b31344cfbbb7e896e44026d6 29dc2dc4b31344cfbbb7e896e44026d6] [instance: e05d55df-85e2-44e7-8ea7-f7f060fbc3ba] Successfully reverted task state from image_pending_upload on failure for instance.
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server [req-4a879436-1412-45ad-b461-12aaceec4a72 53151ac9de83404882af6c50c66c0278 adf11129db9f4dc494a848389f1d82e0 - 29dc2dc4b31344cfbbb7e896e44026d6 29dc2dc4b31344cfbbb7e896e44026d6] Exception during message handling: libvirt.libvirtError: unable to verify existence of block copy target: Permission denied
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 2432, in snapshot
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server metadata['location'] = root_disk.direct_snapshot(
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/virt/libvirt/imagebackend.py", line 452, in direct_snapshot
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server raise NotImplementedError(_('direct_snapshot() is not implemented'))
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server NotImplementedError: direct_snapshot() is not implemented
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred:
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 276, in dispatch
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 196, in _do_dispatch
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/exception_wrapper.py", line 77, in wrapped
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server _emit_exception_notification(
2020-10-31 17:41:15.583 1735 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packa...

Revision history for this message
Maysam Fazeli (maysamfz) wrote :

The packages installed are as follows:

root@:/var/lib/nova/instances/e05d55df-85e2-44e7-8ea7-f7f060fbc3ba# apt policy nova-common
nova-common:
  Installed: 2:21.1.0-0ubuntu1
  Candidate: 2:21.1.0-0ubuntu1
  Version table:
 *** 2:21.1.0-0ubuntu1 500
        500 http://us.archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
        500 http://us.archive.ubuntu.com/ubuntu focal-updates/main i386 Packages
        100 /var/lib/dpkg/status
     2:21.0.0~b3~git2020041013.57ff308d6d-0ubuntu2 500
        500 http://us.archive.ubuntu.com/ubuntu focal/main amd64 Packages
        500 http://us.archive.ubuntu.com/ubuntu focal/main i386 Packages

Rebooted after upgrading and all procedures followed as stated by above posts but still getting the error.

Revision history for this message
Maysam Fazeli (maysamfz) wrote :

This is not a new install and just a system update, disk file permissions are also as below:

root@:/var/lib/nova/instances/e05d55df-85e2-44e7-8ea7-f7f060fbc3ba# ls -lh
total 100G
-rw------- 1 root root 0 Oct 31 18:24 console.log
-rw-r--r-- 1 libvirt-qemu kvm 100G Oct 31 19:24 disk
-rw-r--r-- 1 nova nova 77 Sep 9 17:14 disk.info

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Maysam, did you restart libvirt daemons?

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Ah, you rebooted so daemons were restarted.

Revision history for this message
Maysam Fazeli (maysamfz) wrote :

Hi Corey, Yes rebooted few times but not luck yet. So what can be the issue? If you need more information or output(permissions, special configs etc) let me know so I can provide. This started happening after Focal Update and ussuri which changed some user permission changes on directories.

Revision history for this message
Maysam Fazeli (maysamfz) wrote :

Is there a possibility that libvirt python libraries in v6+ have also set new permission changes in accordance with nova and it causes that to fail?

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hi Maysam, to debug you could try temporarily changing permissions like I did in comment #17. It is likely related to the permission changes in the nova package. One though is there could be a permission/ownership in /var/lib/nova/instances/ from prior to the fix that is preventing access after the fix is applied. Was the instance created prior to upgrading the nova package? Does the same issue occur if you create an instance after upgrading to the new nova package? I hope to get to this soon to debug/fix but if you can create any more details in the mean time that would be helpful.

Revision history for this message
Corey Bryant (corey.bryant) wrote (last edit ):

I'm opening this bug back up for upstream nova awareness.

To summarize the issue, I'll recap comment #27 above and add some more details as to what the issue is.

In nova/virt/libvirt/driver.py there is a chmod on a tempdir that is made with the assumption that libvirt is evaluated by the "other users" mode bits.

# NOTE(xqueralt): libvirt needs o+x in the tempdir
os.chmod(tmpdir, 0o701)

In the case of Ubuntu, we need to ensure the nova package remains functional on hardened systems. A big part of the hardening results in zeroing out "other users" mode bits in /var/lib/nova. As a result, we added the libvirt-qemu user to the nova group as it needs access to files/dirs in /var/lib/nova (most files/dirs in /var/lib/nova are owned by nova:nova). The result of adding libvirt-qemu to the nova group is that access to files/dirs by libvirt-qemu are often evaluated by it's membership in the nova group. Thefore the 0o701 permissions of the tempdir will deny access to libvirt-qemu.

For example:
$ sudo ls -al /var/lib/nova/instances/snapshots/tmpkajuir8o
total 204
drwx-----x 2 nova nova 4096 Sep 23 19:12 . # <--- libvirt-qemu denied access as it is in nova group
drwxr-x--- 3 nova nova 4096 Sep 23 19:12 ..
-rw-r--r-- 1 nova nova 197248 Sep 23 19:12 0ece1fb912104f2c849ea4bd6036712c.delta

To fix this in ubuntu, I'm looking to carry the following patch:

+- # NOTE(xqueralt): libvirt needs o+x in the tempdir
+- os.chmod(tmpdir, 0o701)
++ # NOTE(coreycb): libvirt needs g+x in the tempdir
++ st = os.stat(tmpdir)
++ os.chmod(tmpdir, st.st_mode | stat.S_IXGRP)

I don't know what the right answer is upstream. I don't know that a chmod 0o711 makes sense either. If 0x710 made sense for all users/distros we could move to that, but that's hard to assess. For now I'll patch in ubuntu. I'm planning to do this work in LP: #1967956 to consolidate with similar work.

Changed in nova:
status: Invalid → New
Revision history for this message
Uggla (rene-ribaud) wrote :

Sounds like a valid bug.
I think setting 0x711 should not cause any issues and should prevent the problem you are facing on hardened Ubuntu systems. By the way, it seems more logical than 0x701.

Setting to 0x710 should be better, but like you, I'm unsure about distro side effects, and I lack experience with nova, to be sure. Advice from the nova team should help here.

Changed in nova:
status: New → Triaged
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Putting the bug to Opinion/Wishlist as this sounds half a Nova problem (since we set the chmod) and half a distro-specific configuration.

I'm not against any modification but we need to correctly address this gap as a blueprint ideally.

Changed in nova:
status: Triaged → Opinion
importance: Undecided → Wishlist
Revision history for this message
sean mooney (sean-k-mooney) wrote :

so the general assumtion is that nova is in the libvirt-qemu group not the other way around

so really nova should probably chown the tempdir to nova:libvirt-qemu
with 770

the libvirt group is disto specific too so that would have to be configurable.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.