libvirt.libvirtError: unable to connect to server at

Bug #1912310 reported by Lee Yarwood
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Unassigned

Bug Description

Description
===========

nova-live-migration fails due to the following error on the src during a migration test:

https://zuul.opendev.org/t/openstack/build/76c37d93ba274cd6b90fbe0659255959/log/compute1/logs/screen-n-cpu.txt

Jan 18 12:03:25.145915 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: ERROR nova.virt.libvirt.driver [-] [instance: 9b9d2344-6c8e-4dba-a8f9-60f86e52c854] Migration operation has aborted
Jan 18 12:03:25.194466 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: INFO nova.compute.manager [-] [instance: 9b9d2344-6c8e-4dba-a8f9-60f86e52c854] Swapping old allocation on dict_keys(['022fd821-f027-4019-ac4b-cb1e1d2d5423']) held by migration 7b332f03-b32e-4cc8-8e2d-23ed7bee03bd for instance
Jan 18 12:03:25.239903 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: ERROR nova.virt.libvirt.driver [-] [instance: 9b9d2344-6c8e-4dba-a8f9-60f86e52c854] Live Migration failure: unable to connect to server at 'ubuntu-focal-limestone-regionone-0022589715:49152': Connection refused: libvirt.libvirtError: unable to connect to server at 'ubuntu-focal-limestone-regionone-0022589715:49152': Connection refused
Jan 18 12:03:25.240119 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: DEBUG nova.virt.libvirt.driver [-] [instance: 9b9d2344-6c8e-4dba-a8f9-60f86e52c854] Migration operation thread notification {{(pid=52916) thread_finished /opt/stack/nova/nova/virt/libvirt/driver.py:9448}}
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: Traceback (most recent call last):
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/usr/local/lib/python3.8/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: timer()
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/usr/local/lib/python3.8/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: cb(*args, **kw)
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/usr/local/lib/python3.8/dist-packages/eventlet/event.py", line 175, in _do_send
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: waiter.switch(result)
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/usr/local/lib/python3.8/dist-packages/eventlet/greenthread.py", line 221, in main
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: result = function(*args, **kwargs)
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/opt/stack/nova/nova/utils.py", line 660, in context_wrapper
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: return func(*args, **kwargs)
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 9103, in _live_migration_operation
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: LOG.error("Live Migration failure: %s", e, instance=instance)
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/usr/local/lib/python3.8/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: self.force_reraise()
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/usr/local/lib/python3.8/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: six.reraise(self.type_, self.value, self.tb)
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/usr/local/lib/python3.8/dist-packages/six.py", line 703, in reraise
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: raise value
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 9091, in _live_migration_operation
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: guest.migrate(self._live_migration_uri(dest),
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 649, in migrate
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: self._domain.migrateToURI3(
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 190, in doit
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: result = proxy_call(self._autowrap, f, *args, **kwargs)
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 148, in proxy_call
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: rv = execute(f, *args, **kwargs)
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 129, in execute
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: six.reraise(c, e, tb)
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/usr/local/lib/python3.8/dist-packages/six.py", line 703, in reraise
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: raise value
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/usr/local/lib/python3.8/dist-packages/eventlet/tpool.py", line 83, in tworker
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: rv = meth(*args, **kwargs)
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: File "/usr/local/lib/python3.8/dist-packages/libvirt.py", line 2099, in migrateToURI3
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: raise libvirtError('virDomainMigrateToURI3() failed')
Jan 18 12:03:25.251131 ubuntu-focal-limestone-regionone-0022589731 nova-compute[52916]: libvirt.libvirtError: unable to connect to server at 'ubuntu-focal-limestone-regionone-0022589715:49152': Connection refused

Steps to reproduce
==================

$ cat queries/libvirt.yaml
query: >-
  message:"libvirt.libvirtError: unable to connect to server at" AND
  tags:screen-n-cpu.txt

$ elastic-recheck-query queries/libvirt.yaml
total hits: 12
build_branch
  83% master
  16% stable/ussuri
build_change
  16% 711625
  16% 743231
  16% 754702
  16% 764401
  16% 770547
build_hostids
  16% 13cc9efd4550bb9c731b89fae9c27cc77058360313dea9eebe140cac 1ccf2ea621af4ba5d658324f8716441c624749f7847700ac6aa69490
  16% 322bfced1bb62c11480e9c8cf97e066ab7cb1067b23b38e8004c6f69 23320365a917d59498b30bb8308684708801795984a6ca42b1cd06b7
  16% 422fc847b99ec49a4a48de150fd4c453e34d4b85847603a3435e1511 e1ef7b86aaeb7f61d567417ced5d641b20cff65c4c36e91a6173705f
  16% 77681beb0ddf7a44597e1e8fa6f648861b5edf5445510df6083cdb72 c4902db85fd9c45a1c8b61c8e612d117a8a8f2393128fd1cae285326
  16% 82ed50455b38e6ed7725b8020a977c59f2aaa9db51f018ed12e648b2 3a893be3caac821988438433df3cb21c389ffed93878670819a924f4 24505d497fe40296a37e97a5a1aa1caf33a2c685b883fb60eabf95d5
build_name
  16% devstack-plugin-ceph-multinode-tempest-py3
  16% neutron-ovn-tempest-full-multinode-ovs-master
  16% neutron-tempest-dvr-ha-multinode-full
  16% nova-live-migration
  16% nova-next
build_node
  83% ubuntu-focal
  16% ubuntu-bionic
build_queue
  100% check
build_status
  100% FAILURE
build_zuul_url
  100% N/A
filename
  100% compute1/logs/screen-n-cpu.txt
log_url
  16% https://7b4153b100b6f570d4fc-1ca4271300c9fb8c2811a73921fc79fe.ssl.cf2.rackcdn.com/764401/14/check/neutron-tempest-dvr-ha-multinode-full/d253afd/compute1/logs/screen-n-cpu.txt
  16% https://b0f3338313a78e19953a-f09976335bb1c2233a92ccf49f573637.ssl.cf5.rackcdn.com/754702/5/check/nova-live-migration/76c37d9/compute1/logs/screen-n-cpu.txt
  16% https://cfb2c6e30c49a7b551c1-e4b92f5e1fb788c5da82bf1447ffa549.ssl.cf2.rackcdn.com/711625/27/check/devstack-plugin-ceph-multinode-tempest-py3/677c92c/compute1/logs/screen-n-cpu.txt
  16% https://e7591f9adbfbe7dc24c4-cd07d3426af01024dbeaac2f8639d2e0.ssl.cf2.rackcdn.com/743231/9/check/nova-next/cbe27d5/compute1/logs/screen-n-cpu.txt
  16% https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_09a/770547/1/check/tempest-multinode-full-py3/09a5f18/compute1/logs/screen-n-cpu.txt
loglevel
  100% ERROR
module
  100% nova.virt.libvirt.driver
node_provider
  33% ovh-bhs1
  16% airship-kna1
  16% limestone-regionone
  16% rax-ord
  16% vexxhost-ca-ymq-1
port
  16% 39628
  16% 41358
  16% 46144
  16% 52362
  16% 58212
project
  50% openstack/neutron
  33% openstack/nova
  16% openstack/devstack-plugin-ceph
syslog_pid
  16% 19491
  16% 46189
  16% 46724
  16% 52916
  16% 53510
syslog_program
  16% ubuntu-bionic-ovh-bhs1-0022571568 nova-compute
  16% ubuntu-focal-airship-kna1-0022565736 nova-compute
  16% ubuntu-focal-limestone-regionone-0022589731 nova-compute
  16% ubuntu-focal-ovh-bhs1-0022599932 nova-compute
  16% ubuntu-focal-rax-ord-0022553677 nova-compute
tags
  50% screen-n-cpu.txt screen oslofmt
  50% screen-n-cpu.txt screen oslofmt _grokparsefailure
voting
  50% 0
  50% 1
zuul_attempts
  83% 1
  16% 2
zuul_executor
  33% ze06.openstack.org
  33% ze11.openstack.org
  16% ze04.openstack.org
  16% ze09.openstack.org

Expected result
===============
No connection issues with libvirtd.

Actual result
=============
Connection issues with libvirtd cause CI runs to fail.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   As above, master and stable/ussuri appear impacted thus far.

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   libvirt

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Logs & Configs
==============

Tags: gate-failure
tags: added: gate-failure
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :
Download full text (7.2 KiB)

Lee on IRC mentioned he hit something similar today again; copy/pasting his comment from this review[1]:

Weird one, n-cpu (well the libvirt module) on the source (compute1) reports that libvirtd on the dest (controller) has refused the connection but the migration had already failed on the dest.

This smells like a python-libvirt bug tbh, surely the migrateToURI3 call on the source should be able to detect the failure on the dest and return a useful error to the caller?

-----------------------------------------------------------------------
https://zuul.opendev.org/t/openstack/build/f3b829801901417c9310ad5cc5a0e886/log/compute1/logs/screen-n-cpu.txt#11964

Jun 10 06:41:59.477301 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: DEBUG nova.virt.libvirt.driver [-] [instance: 0a929010-7361-4372-a785-f9eb052aca46] Migration operation thread notification {{(pid=67549) thread_finished /opt/stack/nova/nova/virt/libvirt/driver.py:9921}}
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: Traceback (most recent call last):
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: File "/usr/local/lib/python3.8/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: timer()
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: File "/usr/local/lib/python3.8/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: cb(*args, **kw)
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: File "/usr/local/lib/python3.8/dist-packages/eventlet/event.py", line 175, in _do_send
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: waiter.switch(result)
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: File "/usr/local/lib/python3.8/dist-packages/eventlet/greenthread.py", line 221, in main
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: result = function(*args, **kwargs)
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: File "/opt/stack/nova/nova/utils.py", line 654, in context_wrapper
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: return func(*args, **kwargs)
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 9576, in _live_migration_operation
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: LOG.error("Live Migration failure: %s", e, instance=instance)
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: File "/usr/local/lib/python3.8/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: self.force_reraise()
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 nova-compute[67549]: File "/usr/local/lib/python3.8/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
Jun 10 06:41:59.486684 ubuntu-focal-rax-dfw-0025050045 ...

Read more...

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Download full text (6.1 KiB)

Another hit today.

https://zuul.opendev.org/t/openstack/build/674a5f21a91d4af0861d6ab3c2cc6834/log/controller/logs/screen-n-cpu.txt#12057

Jun 11 10:31:38.867508 ubuntu-focal-rax-ord-0025071874 nova-compute[108685]: ERROR os_brick.initiator.linuxscsi [None req-1333f0cc-148b-4f4c-a845-460900ec49d7 tempest-LiveMigrationTest-666538249 tempest-LiveMigrationTest-666538249-project-admin] multipathd is not running: exit code None: oslo_concurrency.processutils.ProcessExecutionError: [Errno 2] No such file or directory

and

https://zuul.opendev.org/t/openstack/build/674a5f21a91d4af0861d6ab3c2cc6834/log/compute1/logs/screen-n-cpu.txt#11399

Jun 11 10:31:38.471494 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: ERROR nova.virt.libvirt.driver [-] [instance: 34236c1a-1cbf-448b-b36a-9daf1cf760e8] Migration operation has aborted

Jun 11 10:31:38.472905 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: ERROR nova.virt.libvirt.driver [-] [instance: 34236c1a-1cbf-448b-b36a-9daf1cf760e8] Live Migration failure: unable to connect to server at 'ubuntu-focal-rax-ord-0025071874:49152': Connection refused: libvirt.libvirtError: unable to connect to server at 'ubuntu-focal-rax-ord-0025071874:49152': Connection refused

Jun 11 10:31:38.473132 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: DEBUG nova.virt.libvirt.driver [-] [instance: 34236c1a-1cbf-448b-b36a-9daf1cf760e8] Migration operation thread notification {{(pid=68341) thread_finished /opt/stack/nova/nova/virt/libvirt/driver.py:9921}}

Jun 11 10:31:38.481216 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: Traceback (most recent call last):

Jun 11 10:31:38.481216 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: File "/usr/local/lib/python3.8/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers

Jun 11 10:31:38.481216 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: timer()

Jun 11 10:31:38.481216 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: File "/usr/local/lib/python3.8/dist-packages/eventlet/hubs/timer.py", line 59, in __call__

Jun 11 10:31:38.481216 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: cb(*args, **kw)

Jun 11 10:31:38.481216 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: File "/usr/local/lib/python3.8/dist-packages/eventlet/event.py", line 175, in _do_send

Jun 11 10:31:38.481216 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: waiter.switch(result)

Jun 11 10:31:38.481216 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: File "/usr/local/lib/python3.8/dist-packages/eventlet/greenthread.py", line 221, in main

Jun 11 10:31:38.481216 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: result = function(*args, **kwargs)

Jun 11 10:31:38.481216 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: File "/opt/stack/nova/nova/utils.py", line 654, in context_wrapper

Jun 11 10:31:38.481216 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: return func(*args, **kwargs)

Jun 11 10:31:38.481216 ubuntu-focal-rax-ord-0025071879 nova-compute[68341]: File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 9576, in _live_migration_operation

Jun 11 10:31:38.481216 ubuntu-focal-rax-ord-0025071879 nova-compute[683...

Read more...

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Revision history for this message
Lee Yarwood (lyarwood) wrote :

Another hit of this but I think this time it's pointing at bug #1929446 as it appears the thread on the source n-cpu stops sending keepalives to the dest libvirtd leading to it killing the domain and the eventual lost connection on the source?

https://zuul.opendev.org/t/openstack/build/de2dc6a875fe4d058332f2631ea76685/log/controller/logs/screen-n-cpu.txt#10492

Jun 14 18:28:44.730359 ubuntu-focal-rax-iad-0025108170 nova-compute[108412]: File "/usr/local/lib/python3.8/dist-packages/libvirt.py", line 2099, in migrateToURI3
Jun 14 18:28:44.730359 ubuntu-focal-rax-iad-0025108170 nova-compute[108412]: raise libvirtError('virDomainMigrateToURI3() failed')
Jun 14 18:28:44.730359 ubuntu-focal-rax-iad-0025108170 nova-compute[108412]: libvirt.libvirtError: operation failed: Lost connection to destination host

https://zuul.opendev.org/t/openstack/build/de2dc6a875fe4d058332f2631ea76685/log/compute1/logs/libvirt/libvirtd_log.txt

82355 2021-06-14 18:27:52.063+0000: 57794: error : virKeepAliveTimerInternal:137 : internal error: connection closed due to keepalive timeout
82356 2021-06-14 18:27:52.064+0000: 57794: debug : virCloseCallbacksRun:316 : conn=0x7f4170000b60
82357 2021-06-14 18:27:52.064+0000: 57794: debug : virCloseCallbacksGetOne:262 : conn=0x7f4170000b60, thisconn=0x7f4170000b60, uuid=27f054b7-f061-4978-84fb-9b61766fe80e, cb=0x7f418ce93c50
82358 2021-06-14 18:27:52.064+0000: 57794: debug : qemuProcessAutoDestroy:7606 : vm=instance-0000000e, conn=0x7f4170000b60
82359 2021-06-14 18:27:52.064+0000: 57794: debug : qemuProcessAutoDestroy:7612 : vm=instance-0000000e has long-term job active, cancelling
82360 2021-06-14 18:27:52.064+0000: 57794: debug : qemuProcessAutoDestroy:7617 : Killing domain

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/796813

Revision history for this message
Lee Yarwood (lyarwood) wrote :

I assumed this had something to do with bug #1929446 but it's still reproducing now:

https://46859657da8176be8119-30b0a72bc63e960bd7a500e425d4bd8d.ssl.cf5.rackcdn.com/779275/3/check/nova-live-migration/1122e93/controller/logs/libvirt/libvirtd_log.txt

 86445 2021-06-24 11:16:50.947+0000: 73097: info : qemuMonitorIOWrite:450 : QEMU_MONITOR_IO_WRITE: mon=0x7f6a600d7180 buf={"execute":"migrate-incoming","arguments":{"uri":"tcp:[::]:49152"},"id":"libvirt-373"}^M
 86446 len=88 ret=88 errno=0
 86447 2021-06-24 11:16:50.948+0000: 73097: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 1624533410, "microseconds": 948823}, "event": "MIGRATION", "data": {"status": "setup"}}]
 86448 2021-06-24 11:16:50.948+0000: 73097: info : qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: mon=0x7f6a600d7180 event={"timestamp": {"seconds": 1624533410, "microseconds": 948823}, "event": "MIGRATION", "data": {"status": "setup"}}
 86449 2021-06-24 11:16:50.948+0000: 73097: debug : qemuMonitorJSONIOProcessEvent:181 : mon=0x7f6a600d7180 obj=0x55a2c5a87b20
 86450 2021-06-24 11:16:50.948+0000: 73097: debug : qemuMonitorEmitEvent:1198 : mon=0x7f6a600d7180 event=MIGRATION
 86451 2021-06-24 11:16:50.948+0000: 73097: debug : qemuProcessHandleEvent:549 : vm=0x7f6a600d4e70
 86452 2021-06-24 11:16:50.948+0000: 73097: debug : virObjectEventNew:631 : obj=0x55a2c5a8ac10
 86453 2021-06-24 11:16:50.949+0000: 73097: debug : qemuMonitorJSONIOProcessEvent:205 : handle MIGRATION handler=0x7f6a841388f0 data=0x55a2c58fec00
 86454 2021-06-24 11:16:50.949+0000: 73097: debug : qemuMonitorEmitMigrationStatus:1484 : mon=0x7f6a600d7180, status=setup
 86455 2021-06-24 11:16:50.949+0000: 73097: debug : qemuProcessHandleMigrationStatus:1642 : Migration of domain 0x7f6a600d4e70 instance-0000000e changed state to setup
 86456 2021-06-24 11:16:50.949+0000: 73097: debug : virObjectEventDispose:124 : obj=0x55a2c5a8ac10
 86457 2021-06-24 11:16:50.949+0000: 73097: debug : virNetlinkEventCallback:862 : dispatching to max 0 clients, called from event watch 7
 86458 2021-06-24 11:16:50.949+0000: 73097: debug : virNetlinkEventCallback:876 : event not handled.
 86459 2021-06-24 11:16:50.949+0000: 73097: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"return": {}, "id": "libvirt-373"}]
 86460 2021-06-24 11:16:50.949+0000: 73097: info : qemuMonitorJSONIOProcessLine:239 : QEMU_MONITOR_RECV_REPLY: mon=0x7f6a600d7180 reply={"return": {}, "id": "libvirt-373"}
[..]
87097 2021-06-24 11:17:20.249+0000: 73097: error : virKeepAliveTimerInternal:137 : internal error: connection closed due to keepalive timeout

I'm going to see if I can play around with the keepalive libvirtd configurables in devstack to address this.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

From the last 10 nova-live-migration job failure 9 was due to
  "Live Migration failure: unable to connect to server at"
from those 9 failures in 4 cases
 "internal error: connection closed due to keepalive timeout"
is also visible. In the other 5 cases where keepalive error is not visible I see other internal errors in the libvirtd log around the time of failure like:

2021-07-30 08:54:22.634+0000: 57630: error : virProcessRunInFork:1159 : internal error: child reported (status=125): unable to open /dev/sdb: No such device or address
2021-07-30 08:54:22.648+0000: 57625: error : virNetClientSendInternal:2088 : internal error: client socket is closed
2021-07-30 08:54:22.656+0000: 57625: error : virNetClientSendInternal:2088 : internal error: client socket is closed
2021-07-30 08:54:22.656+0000: 57625: error : virNetClientSendInternal:2088 : internal error: client socket is closed
2021-07-30 08:56:25.528+0000: 57632: error : virProcessRunInFork:1159 : internal error: child reported (status=125): unable to open /dev/sda: No such device or address

List of builds and logs: https://paste.opendev.org/show/807827/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/803585

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/803585
Committed: https://opendev.org/openstack/nova/commit/4e18f07670714af605c24448636e994f022c6589
Submitter: "Zuul (22348)"
Branch: master

commit 4e18f07670714af605c24448636e994f022c6589
Author: Lee Yarwood <email address hidden>
Date: Thu Aug 5 11:55:27 2021 +0100

    zuul: Mark live migration jobs as non-voting due to bug #1912310

    Bug #1912310 details an underlying QEMU live migration failure seen
    during both volume backend and block migration based live migration
    tests in the nova-live-migration job. The failure rate is getting high
    enough now that we should mark these jobs as non-voting until it is
    resolved.

    Change-Id: Ifc211fadb3ca9c7c0a74d7dcd225bdb3f3008d26
    Related-Bug: #1912310

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Lee Yarwood <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/796813

Revision history for this message
Lee Yarwood (lyarwood) wrote (last edit ):

The issue here is that a previous test is blocking libvirtd's single event loop thread (changed in 6.2.0 to event threads per domain FWIW) during the deletion of the instance after the test had already passed:

LiveAutoBlockMigrationV225Test:test_live_migration_with_trunk

10135 Oct 01 11:14:33.313929 ubuntu-focal-iweb-mtl01-0026751352 nova-compute[67743]: INFO nova.compute.manager [None req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project] [instance: 45adbb55-491d-418b-ba68- 7db43d1c235b] Took 240.14 seconds to destroy the instance on the hypervisor.

This leads to the connection between the source and dest during our failing test timing out due to missed heart beats.

I'm going to suggest that we don't run the trunk tests for the time being, move the job back to voting and resolve the above deletion issue with the os-vif and Neutron folks in a fresh bug.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/812266

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Lee Yarwood <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/812266

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/812392
Committed: https://opendev.org/openstack/nova/commit/cf4e516f553ccbc4fad6014153d9b51095297607
Submitter: "Zuul (22348)"
Branch: master

commit cf4e516f553ccbc4fad6014153d9b51095297607
Author: Lee Yarwood <email address hidden>
Date: Mon Oct 4 18:11:39 2021 +0100

    zuul: Move live migration jobs back to voting

    With the resolution of bug #1945983 within devstack we can now move our
    live migration jobs back to voting.

    Related-Bug: #1945983
    Closes-Bug: #1912310
    Depends-On: https://review.opendev.org/c/openstack/devstack/+/812391
    Depends-On: https://review.opendev.org/c/openstack/devstack/+/812925
    Change-Id: I25177554802579952510c73985287fd76681012c

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 25.0.0.0rc1

This issue was fixed in the openstack/nova 25.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/846292

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.