leaked threads run after test case teardown causing intermittent functional test failures

Bug #1946339 reported by Lee Yarwood
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Medium
Balazs Gibizer

Bug Description

https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4d3/812392/2/check/nova-tox-functional-py38/4d32a96/testr_results.html

2021-10-07 00:06:06,980 ERROR [nova.compute.manager] Instance failed to spawn
Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 6651, in _unshelve_instance
    update(
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/utils.py", line 1531, in update_pci_request_spec_with_allocated_interface_name
    raise ex(
nova.exception.UnexpectedResourceProviderNameForPCIRequest: Resource provider 125ffd8a-3163-425a-b397-f915c602b3eb used to allocate resources for the pci request 7059503b-a648-40fd-a561-5ca769304bee does not have a properly formatted name. Expected name format is <hostname>:<agentname>:<interfacename>, but got invalid-device-rp-name
2021-10-07 00:06:06,994 INFO [placement.requestlog] 127.0.0.1 "DELETE /placement/allocations/718d8d5b-8d34-4c11-9080-1952cc824086" status: 204 len: 0 microversion: 1.0
2021-10-07 00:06:06,995 INFO [nova.scheduler.client.report] Deleted allocations for instance 718d8d5b-8d34-4c11-9080-1952cc824086
2021-10-07 00:06:07,029 INFO [nova.compute.manager] Successfully reverted task state from spawning on failure for instance.
2021-10-07 00:06:07,052 ERROR [oslo_messaging.rpc.server] Exception during message handling
Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
    res = self.dispatcher.dispatch(message)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
    result = func(ctxt, **new_args)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/exception_wrapper.py", line 71, in wrapped
    _emit_versioned_exception_notification(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/home/zuul/src/opendev.org/openstack/nova/nova/exception_wrapper.py", line 63, in wrapped
    return f(self, context, *args, **kw)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 183, in decorated_function
    LOG.warning("Failed to revert task state for instance. "
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 154, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/utils.py", line 1433, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 211, in decorated_function
    compute_utils.add_instance_fault_from_exc(context,
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 200, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 6595, in unshelve_instance
    do_unshelve_instance()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_concurrency/lockutils.py", line 360, in inner
    return f(*args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 6592, in do_unshelve_instance
    self._unshelve_instance(
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 6697, in _unshelve_instance
    self._nil_out_instance_obj_host_and_node(instance)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 6651, in _unshelve_instance
    update(
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/utils.py", line 1531, in update_pci_request_spec_with_allocated_interface_name
    raise ex(
nova.exception.UnexpectedResourceProviderNameForPCIRequest: Resource provider 125ffd8a-3163-425a-b397-f915c602b3eb used to allocate resources for the pci request 7059503b-a648-40fd-a561-5ca769304bee does not have a properly formatted name. Expected name format is <hostname>:<agentname>:<interfacename>, but got invalid-device-rp-name
2021-10-07 00:06:07,183 INFO [oslo_db.sqlalchemy.provision] DROP ALL OBJECTS, BACKEND sqlite://
2021-10-07 00:06:07,201 INFO [oslo_db.sqlalchemy.provision] DISPOSE ENGINE Engine(sqlite://)
}}}

stderr: {{{
Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/_drivers/impl_fake.py", line 207, in _send
    reply, failure = reply_q.get(timeout=timeout)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/eventlet/queue.py", line 322, in get
    return waiter.wait()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/eventlet/queue.py", line 141, in wait
    return get_hub().switch()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch
    return self.greenlet.switch()
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 200, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 2263, in _do_build_and_run_instance
    self.compute_task_api.build_instances(context, [instance],
  File "/home/zuul/src/opendev.org/openstack/nova/nova/conductor/api.py", line 119, in build_instances
    self.conductor_compute_rpcapi.build_instances(context,
  File "/home/zuul/src/opendev.org/openstack/nova/nova/conductor/rpcapi.py", line 390, in build_instances
    cctxt.cast(context, 'build_instances', **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/fixtures/_fixtures/monkeypatch.py", line 89, in avoid_get
    return captured_method(*args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/rpc/client.py", line 189, in call
    result = self.transport._send(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/transport.py", line 123, in _send
    return self._driver.send(target, ctxt, message,
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/_drivers/impl_fake.py", line 222, in send
    return self._send(target, ctxt, message, wait_for_reply, timeout,
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/_drivers/impl_fake.py", line 213, in _send
    raise oslo_messaging.MessagingTimeout(
oslo_messaging.exceptions.MessagingTimeout: No reply on topic conductor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1799, in _execute_context
    self.dialect.do_execute(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 717, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: no such table: instance_faults

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/utils.py", line 1433, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 211, in decorated_function
    compute_utils.add_instance_fault_from_exc(context,
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/utils.py", line 153, in add_instance_fault_from_exc
    fault_obj.create()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_versionedobjects/base.py", line 226, in wrapper
    return fn(self, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/objects/instance_fault.py", line 76, in create
    db_fault = db.instance_fault_create(self._context, values)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/db/main/api.py", line 190, in wrapper
    return f(context, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/db/main/api.py", line 3881, in instance_fault_create
    fault_ref.save(context.session)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_db/sqlalchemy/models.py", line 48, in save
    session.flush()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 3339, in flush
    self._flush(objects)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 3479, in _flush
    transaction.rollback(_capture_exception=True)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 3439, in _flush
    flush_context.execute()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 456, in execute
    rec.execute(self)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 630, in execute
    util.preloaded.orm_persistence.save_obj(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 242, in save_obj
    _emit_insert_statements(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 1219, in _emit_insert_statements
    result = connection._execute_20(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1611, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1478, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1842, in _execute_context
    self._handle_dbapi_exception(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2021, in _handle_dbapi_exception
    util.raise_(newraise, with_traceback=exc_info[2], from_=e)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1799, in _execute_context
    self.dialect.do_execute(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 717, in do_execute
    cursor.execute(statement, parameters)
oslo_db.exception.DBNonExistentTable: (sqlite3.OperationalError) no such table: instance_faults
[SQL: INSERT INTO instance_faults (created_at, updated_at, deleted_at, deleted, instance_uuid, code, message, details, host) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)]
[parameters: ('2021-10-07 00:06:06.747150', None, None, 0, '5903406a-6fff-4179-99bf-8bab9c33f0cb', 500, 'MessagingTimeout', 'Traceback (most recent call last):\n File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging ... (2344 characters truncated) ... l_fake.py", line 213, in _send\n raise oslo_messaging.MessagingTimeout(\noslo_messaging.exceptions.MessagingTimeout: No reply on topic conductor\n', 'host2')]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1799, in _execute_context
    self.dialect.do_execute(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 717, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: no such table: instance_actions

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
    timer()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/eventlet/hubs/timer.py", line 59, in __call__
    cb(*args, **kw)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/utils.py", line 680, in context_wrapper
    func(*args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_concurrency/lockutils.py", line 360, in inner
    return f(*args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 2115, in _locked_do_build_and_run_instance
    result = self._do_build_and_run_instance(*args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/exception_wrapper.py", line 71, in wrapped
    _emit_versioned_exception_notification(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/home/zuul/src/opendev.org/openstack/nova/nova/exception_wrapper.py", line 63, in wrapped
    return f(self, context, *args, **kw)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 183, in decorated_function
    LOG.warning("Failed to revert task state for instance. "
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 154, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/utils.py", line 1433, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/utils.py", line 1394, in __exit__
    objects.InstanceActionEvent.event_finish_with_failure(
  File "/home/zuul/src/opendev.org/openstack/nova/nova/objects/base.py", line 355, in wrapper
    return fn.__get__(None, obj)(*args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper
    result = fn(cls, context, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/objects/instance_action.py", line 227, in event_finish_with_failure
    db_event = db.action_event_finish(context, values)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_db/api.py", line 154, in wrapper
    ectxt.value = e.inner_exc
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_db/api.py", line 142, in wrapper
    return f(*args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/db/main/api.py", line 190, in wrapper
    return f(context, *args, **kwargs)
  File "/home/zuul/src/opendev.org/openstack/nova/nova/db/main/api.py", line 4078, in action_event_finish
    action = _action_get_by_request_id(context, values['instance_uuid'],
  File "/home/zuul/src/opendev.org/openstack/nova/nova/db/main/api.py", line 4014, in _action_get_by_request_id
    result = model_query(context, models.InstanceAction).\
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 2762, in first
    return self.limit(1)._iter().first()
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 2846, in _iter
    result = self.session.execute(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1689, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1611, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1478, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1842, in _execute_context
    self._handle_dbapi_exception(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2021, in _handle_dbapi_exception
    util.raise_(newraise, with_traceback=exc_info[2], from_=e)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1799, in _execute_context
    self.dialect.do_execute(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 717, in do_execute
    cursor.execute(statement, parameters)
oslo_db.exception.DBNonExistentTable: (sqlite3.OperationalError) no such table: instance_actions
[SQL: SELECT instance_actions.created_at AS instance_actions_created_at, instance_actions.updated_at AS instance_actions_updated_at, instance_actions.deleted_at AS instance_actions_deleted_at, instance_actions.deleted AS instance_actions_deleted, instance_actions.id AS instance_actions_id, instance_actions.action AS instance_actions_action, instance_actions.instance_uuid AS instance_actions_instance_uuid, instance_actions.request_id AS instance_actions_request_id, instance_actions.user_id AS instance_actions_user_id, instance_actions.project_id AS instance_actions_project_id, instance_actions.start_time AS instance_actions_start_time, instance_actions.finish_time AS instance_actions_finish_time, instance_actions.message AS instance_actions_message
FROM instance_actions
WHERE instance_actions.deleted = ? AND instance_actions.instance_uuid = ? AND instance_actions.request_id = ? ORDER BY instance_actions.created_at DESC, instance_actions.id DESC
 LIMIT ? OFFSET ?]
[parameters: (0, '5903406a-6fff-4179-99bf-8bab9c33f0cb', 'req-14ab570f-ce9a-4a2e-a753-db18915fbe14', 1, 0)]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
}}}

Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/nova/nova/tests/functional/test_servers_resource_request.py", line 2491, in test_unshelve_offloaded_server_with_qos_port_pci_update_fails
    self.assertEqual(
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/testtools/testcase.py", line 393, in assertEqual
    self.assertThat(observed, matcher, message)
  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/testtools/testcase.py", line 480, in assertThat
    raise mismatch_error
testtools.matchers._impl.MismatchError: 'UnexpectedResourceProviderNameForPCIRequest' != 'DBNonExistentTable'

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

It seems it happens pretty frequently on the gate[1]. I run the whole functional suit locally for days with random ordering but I was not able to reproduce it. So it is really hard to figure out what happens.

The first exception nova.exception.UnexpectedResourceProviderNameForPCIRequest is expected that is part of the test case. The the logs shows that we reverted the allocation in placement as the VM creation is failed due to the first exception.

The I see the following from the stack trace

LOG.warning("Failed to revert task state for instance.

So we end up at[2]. Then the compute thread seems to end. Then the parallel conductor thread times out in self.conductor_compute_rpcapi.build_instances:

File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/_drivers/impl_fake.py", line 213, in _send
    raise oslo_messaging.MessagingTimeout(
oslo_messaging.exceptions.MessagingTimeout: No reply on topic conductor

But it seems that at that point DB was already deleted by the test env leading to errors like:

sqlite3.OperationalError: no such table: instance_faults

I don't see how this can be timing related, but other then having some kind of race condition between the compute and the conductor I have no other idea.

[1] https://paste.opendev.org/show/809926/
[2] https://github.com/openstack/nova/blob/fdfdba265833d237e22676f9a223ab8ca0fe1e03/nova/compute/manager.py#L183

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

On the timing:
The successful runs on the gate takes 3-7 seconds, the failed runs takes 4-6 seconds. So there is no big timing difference.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote (last edit ):

This is a reschedule in the stack trace[1]:

File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 2263, in _do_build_and_run_instance

Which does not happen in a successful test case run. So I have to figure out why and how we can hit re-schedule.

https://github.com/openstack/nova/blob/fdfdba265833d237e22676f9a223ab8ca0fe1e03/nova/compute/manager.py#L2263-L2268

Revision history for this message
sean mooney (sean-k-mooney) wrote :
Download full text (13.8 KiB)

looking at the logs we booted on host1 then sheleved and then shelve offloaded.

2021-10-08 01:20:33.107782 | ubuntu-focal | 2021-10-08 01:20:30,613 INFO [nova.scheduler.host_manager] Host filter only checking host host1 and node host1
2021-10-08 01:20:33.107801 | ubuntu-focal | 2021-10-08 01:20:30,628 INFO [placement.requestlog] 127.0.0.1 "GET /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 19 microversion: 1.28
2021-10-08 01:20:33.107819 | ubuntu-focal | 2021-10-08 01:20:30,676 INFO [placement.requestlog] 127.0.0.1 "PUT /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 204 len: 0 microversion: 1.36
2021-10-08 01:20:33.107837 | ubuntu-focal | 2021-10-08 01:20:30,873 INFO [placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/e11c9de0-4624-49ef-835c-a56da97a5f68" status: 200 len: 890 microversion: 1.14
2021-10-08 01:20:33.107855 | ubuntu-focal | 2021-10-08 01:20:30,881 INFO [placement.requestlog] 127.0.0.1 "GET /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 644 microversion: 1.28
2021-10-08 01:20:33.107886 | ubuntu-focal | 2021-10-08 01:20:30,885 INFO [nova.compute.claims] Claim successful on node host1
2021-10-08 01:20:33.107905 | ubuntu-focal | 2021-10-08 01:20:31,138 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 1942 microversion: 2.90 time: 0.146510
2021-10-08 01:20:33.107924 | ubuntu-focal | 2021-10-08 01:20:31,145 INFO [nova.compute.manager] Took 0.00 seconds to spawn the instance on the hypervisor.
2021-10-08 01:20:33.107942 | ubuntu-focal | 2021-10-08 01:20:31,187 INFO [nova.compute.manager] Took 0.33 seconds to build instance.
2021-10-08 01:20:33.107960 | ubuntu-focal | 2021-10-08 01:20:31,231 INFO [nova.scheduler.host_manager] Received an update from an unknown host 'host1'. Re-created its InstanceList.
2021-10-08 01:20:33.107979 | ubuntu-focal | 2021-10-08 01:20:31,677 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 2436 microversion: 2.90 time: 0.033292
2021-10-08 01:20:33.107997 | ubuntu-focal | 2021-10-08 01:20:31,688 INFO [placement.requestlog] 127.0.0.1 "GET /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 548 microversion: 1.0
2021-10-08 01:20:33.108015 | ubuntu-focal | 2021-10-08 01:20:31,789 INFO [nova.api.openstack.requestlog] 127.0.0.1 "POST /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac/action" status: 202 len: 0 microversion: 2.90 time: 0.097403
2021-10-08 01:20:33.108033 | ubuntu-focal | 2021-10-08 01:20:31,825 INFO [nova.compute.manager] Shelving
2021-10-08 01:20:33.108060 | ubuntu-focal | 2021-10-08 01:20:31,828 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 2442 microversion: 2.90 time: 0.034799
2021-10-08 01:20:33.108080 | ubuntu-focal | 2021-10-08 01:20:31,885 INFO [nova.compute.manager] Shelve offloading
2021-10-08 01:20:33.108098 | ubuntu-focal | 2021-10-08 01:20:31,911 INFO [placement.requestlog] 127.0.0.1 "DELETE /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/813674

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote (last edit ): Re: test_unshelve_offloaded_server_with_qos_port_pci_update_fails

I managed to create a stable reproduction locally. \o/

1) duplicate the failing tests

diff --git a/nova/tests/functional/test_servers_resource_request.py b/nova/tests/functional/test_servers_resource_request.py
index e2746b3669..d678067c18 100644
--- a/nova/tests/functional/test_servers_resource_request.py
+++ b/nova/tests/functional/test_servers_resource_request.py
@@ -2572,6 +2572,18 @@ class ServerMoveWithPortResourceRequestTest(
         self._delete_server_and_check_allocations(
             server, qos_normal_port, qos_sriov_port)

+ def test_unshelve_offloaded_server_with_qos_port_pci_update_fails0(self):
+ self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+
+ def test_unshelve_offloaded_server_with_qos_port_pci_update_fails1(self):
+ self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+
+ def test_unshelve_offloaded_server_with_qos_port_pci_update_fails2(self):
+ self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+
+ def test_unshelve_offloaded_server_with_qos_port_pci_update_fails3(self):
+ self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+

2) run tox with tox -e functional-py38 "test_description_errors|test_unshelve_offloaded_server_with_qos_port_pci_update_fails" -- --serial

This make one of the test_unshelve_offloaded_server_with_qos_port_pci_update_fails test case fail with the reported error.

What I know so far:
* nova.tests.functional.test_servers.ServersTestV219.test_description_errors() starts a new instance but does not wait for it to become ACTIVE. The test case passes and finishes.

* But the build_and_run_instance RPC call still runs in compute service in a greenlet building up the instance. The service.kill at the end of the test case does not kill the running / waiting greenlets. I proved this by dumping a gmr at the end of the test run after the service.kill was called by the Fixture.cleanup. The build_and_run_instance greenlet was visible there.

* Still other fixture cleanup drops the database from behind the compute service. This leads to ComputeNodeNotFound error. You can see this by simply adding a self.fail() at the end of test_description_errors test case and run it. There will be ComputeNodeNotFound in the logs.

* After test_description_errors passes the test executor runs the next test cases

* After ~60 seconds an RPC timeout happens in the original greenlet from test_description_errors() and it *somehow* interferes with the currently running test making that to fail.

What I don't know yet is what is the way of the interference.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I know now what it the global that facilitate the effect between the two test cases.

When a versioned notification is emitted the code in [1] calls the nova.rpc.get_versioned_notifier() which uses the global NOTIFIER[2] from the rpc module. This global is properly reset between tests, but still if the first testcase runs [1] 60 seconds after the testcase is finished, another testcase already re-initialized the NOTIFIER with a valid FakeVersionedNotifier that can be used to deliver notifications. In this case the first test case delivers a notification to the later testcase making that to fail.

[1] https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/notifications/objects/base.py#L211
[2] https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/rpc.py#L239

Changed in nova:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/814036

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Balazs Gibizer <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/813674
Reason: fix is up in https://review.opendev.org/c/openstack/nova/+/814036

Changed in nova:
assignee: nobody → Balazs Gibizer (balazs-gibizer)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)
Download full text (3.4 KiB)

Reviewed: https://review.opendev.org/c/openstack/nova/+/814036
Committed: https://opendev.org/openstack/nova/commit/61fc81a6761d34afdfc4a6d1c4c953802fd8a179
Submitter: "Zuul (22348)"
Branch: master

commit 61fc81a6761d34afdfc4a6d1c4c953802fd8a179
Author: Balazs Gibizer <email address hidden>
Date: Thu Oct 14 18:09:18 2021 +0200

    Prevent leaked eventlets to send notifications

    In out functional tests we run nova services as eventlets. Also those
    services can spawn there own eventlets for RPC or other parallel
    processing. The test case executor only sees and tracks the main
    eventlet where the code of the test case is running. When that is
    finishes the test executor considers the test case to be finished
    regardless of the other spawned eventlets. This could lead to leaked
    eventlets that are running in parallel with later test cases.

    One way that it can cause trouble is via the global variables in
    nova.rpc module. Those globals are re-initialized for each test case so
    they are not directly leaking information between test cases. However if
    a late eventlet calls nova.rpc.get_versioned_notifier() it will get a
    totally usable FakeVersionedNotifier object regardless of which test
    case this notifier is belongs to or which test case the eventlet belongs
    to. This way the late eventlet can send a notification to the currently
    running test case and therefore can make it fail.

    The current case we saw is the following:

    1) The test case
      nova.tests.functional.test_servers.ServersTestV219.test_description_errors
      creates a server but don't wait for it to reach terminal state (ACTIVE
      / ERROR). This test case finishes quickly but leaks running eventlets
      in the background waiting for some RPC call to return.
    2) As the test case finished the cleanup code deletes the test case
       specific setup, including the DB.
    3) The test executor moves forward and starts running another test case
    4) 60 seconds later the leaked eventlet times out waiting for the RPC
       call to return and tries doing things, but fails as the DB is already
       gone. Then it tries to report this as an error notification. It calls
       nova.rpc.get_versioned_notifier() and gets a fresh notifier that is
       connected to the currently running test case. Then emits the error
       notification there.
    5) The currently running test case also waits for an error notification
       to be triggered by the currently running test code. But it gets the
       notification form the late eventlet first. As the content of the
       notification does not match with the expectations the currently
       running test case fails. The late eventlet prints a lot of
       error about the DB being gone making the troubleshooting pretty hard.

    This patch proposes a way to fix this by marking each eventlet at spawn
    time with the id of the test case that was directly or indirectly
    started it.

    Then when the NotificationFixture gets a notification it compares the
    test case id stored in the calling eventlet with the id of the test case
    in...

Read more...

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/816487

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/xena)
Download full text (3.5 KiB)

Reviewed: https://review.opendev.org/c/openstack/nova/+/816487
Committed: https://opendev.org/openstack/nova/commit/f24e0c1da2c099609388555091d987b8807b5f0d
Submitter: "Zuul (22348)"
Branch: stable/xena

commit f24e0c1da2c099609388555091d987b8807b5f0d
Author: Balazs Gibizer <email address hidden>
Date: Thu Oct 14 18:09:18 2021 +0200

    Prevent leaked eventlets to send notifications

    In out functional tests we run nova services as eventlets. Also those
    services can spawn there own eventlets for RPC or other parallel
    processing. The test case executor only sees and tracks the main
    eventlet where the code of the test case is running. When that is
    finishes the test executor considers the test case to be finished
    regardless of the other spawned eventlets. This could lead to leaked
    eventlets that are running in parallel with later test cases.

    One way that it can cause trouble is via the global variables in
    nova.rpc module. Those globals are re-initialized for each test case so
    they are not directly leaking information between test cases. However if
    a late eventlet calls nova.rpc.get_versioned_notifier() it will get a
    totally usable FakeVersionedNotifier object regardless of which test
    case this notifier is belongs to or which test case the eventlet belongs
    to. This way the late eventlet can send a notification to the currently
    running test case and therefore can make it fail.

    The current case we saw is the following:

    1) The test case
      nova.tests.functional.test_servers.ServersTestV219.test_description_errors
      creates a server but don't wait for it to reach terminal state (ACTIVE
      / ERROR). This test case finishes quickly but leaks running eventlets
      in the background waiting for some RPC call to return.
    2) As the test case finished the cleanup code deletes the test case
       specific setup, including the DB.
    3) The test executor moves forward and starts running another test case
    4) 60 seconds later the leaked eventlet times out waiting for the RPC
       call to return and tries doing things, but fails as the DB is already
       gone. Then it tries to report this as an error notification. It calls
       nova.rpc.get_versioned_notifier() and gets a fresh notifier that is
       connected to the currently running test case. Then emits the error
       notification there.
    5) The currently running test case also waits for an error notification
       to be triggered by the currently running test code. But it gets the
       notification form the late eventlet first. As the content of the
       notification does not match with the expectations the currently
       running test case fails. The late eventlet prints a lot of
       error about the DB being gone making the troubleshooting pretty hard.

    This patch proposes a way to fix this by marking each eventlet at spawn
    time with the id of the test case that was directly or indirectly
    started it.

    Then when the NotificationFixture gets a notification it compares the
    test case id stored in the calling eventlet with the id of the test case
 ...

Read more...

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.1.0

This issue was fixed in the openstack/nova 24.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 25.0.0.0rc1

This issue was fixed in the openstack/nova 25.0.0.0rc1 release candidate.

Revision history for this message
melanie witt (melwitt) wrote : Re: test_unshelve_offloaded_server_with_qos_port_pci_update_fails
Download full text (56.1 KiB)

I'm not totally sure but I think I might have seen this again today in the nova-tox-functional-py38 job[1]:

2022-03-24 05:08:04,163 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/servers/0606b03e-f05f-407c-9fd8-33a8d673e2f4" status: 200 len: 2147 microversion: 2.90 time: 0.028787
2022-03-24 05:08:04,193 ERROR [root] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/_drivers/impl_fake.py", line 207, in _send\n reply, failure = reply_q.get(timeout=timeout)\n', ' File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/eventlet/queue.py", line 322, in get\n return waiter.wait()\n', ' File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/eventlet/queue.py", line 141, in wait\n return get_hub().switch()\n', ' File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch\n return self.greenlet.switch()\n', '_queue.Empty\n', '\nDuring handling of the above exception, another exception occurred:\n\n', 'Traceback (most recent call last):\n', ' File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 201, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 8969, in _post_live_migration\n self._update_scheduler_instance_info(ctxt, instance)\n', ' File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 2109, in _update_scheduler_instance_info\n self.query_client.update_instance_info(context, self.host,\n', ' File "/home/zuul/src/opendev.org/openstack/nova/nova/scheduler/client/query.py", line 69, in update_instance_info\n self.scheduler_rpcapi.update_instance_info(context, host_name,\n', ' File "/home/zuul/src/opendev.org/openstack/nova/nova/scheduler/rpcapi.py", line 174, in update_instance_info\n return cctxt.cast(ctxt, \'update_instance_info\', host_name=host_name,\n', ' File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/fixtures/_fixtures/monkeypatch.py", line 89, in avoid_get\n return captured_method(*args, **kwargs)\n', ' File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/rpc/client.py", line 189, in call\n result = self.transport._send(\n', ' File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/transport.py", line 123, in _send\n return self._driver.send(target, ctxt, message,\n', ' File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/_drivers/impl_fake.py", line 222, in send\n return self._send(target, ctxt, message, wait_for_reply, timeout,\n', ' File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/_drivers/impl_fake.py", line 213, in _send\n raise oslo_messa...

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote (last edit ):

@Melanie: nice catch. It is similar as it seems a long(er) running thread, the _live_migration_monitor, keeps running after the actual test that started it has already finished. Then when the monitor awakes and try to record an instance fault that fails as the test cleanup already removed the DB under it. At least I was able to decode that much so far from the stack trace. I will keep digging this...

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I managed to reproduce the issue once (in ~20 runs) locally with some extra print statements but no breakthrough so far.

[1] run log https://paste.ubuntu.com/p/rf3mpgXhtc/
[2] test case list used in the run https://paste.opendev.org/show/bITh68wTY3XNq3jdLy3h/

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

one more reproduction: https://paste.ubuntu.com/p/svZTwCWT25/ It shows that the test_case_id propagation does not work in this case so the original fix is not effective in this case https://review.opendev.org/c/openstack/nova/+/814036/1/nova/tests/fixtures/nova.py#1527

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Another hit in https://review.opendev.org/c/openstack/nova/+/823229 with logs: https://zuul.opendev.org/t/openstack/build/956bd217bbf04d97bfed9479ee7bbc75/log/job-output.txt#3497

The common thing in the new hits is that the libvirt monitoring thread is leaked. Unfortunately I have no time at the moment to dig more. So I made a TODO to look back later.

Revision history for this message
melanie witt (melwitt) wrote (last edit ):

I looked at this for awhile trying to find if there's some way we could reduce the number of leaked threads, since killing them (naked greenlet) is not always possible as evidenced by gibi's abandoned patch [1] and some similar things I have tried locally since.

I noticed that when the nova-compute service is stopped by calling stop(), the part that cleans up live migrations in the thread pool uses wait=False [2]:

    def cleanup_host(self):
        self.driver.register_event_listener(None)
        self.instance_events.cancel_all_events()
        self.driver.cleanup_host(host=self.host)
        self._cleanup_live_migrations_in_pool()

    def _cleanup_live_migrations_in_pool(self):
        # Shutdown the pool so we don't get new requests.
        self._live_migration_executor.shutdown(wait=False)

which "If True then shutdown will not return until all running futures have finished executing and the resources used by the executor have been reclaimed." [3]. So wait=False will move on even while running futures are still executing.

I wondered if a patch like the following might reduce the chances of a leaked greenlet from a previous test running during a new test:

diff --git a/nova/test.py b/nova/test.py
index a6449c01f0..69b9d06cdc 100644
--- a/nova/test.py
+++ b/nova/test.py
@@ -37,6 +37,7 @@ import pprint
 import sys

 import fixtures
+import futurist
 import mock
 from oslo_cache import core as cache
 from oslo_concurrency import lockutils
@@ -290,6 +291,15 @@ class TestCase(base.BaseTestCase):
         # instead of only once initialized for test worker
         wsgi_app.init_global_data.reset()

+ orig_shutdown = futurist.GreenThreadPoolExecutor.shutdown
+
+ def wrap_shutdown(*a, **kw):
+ kw['wait'] = True
+ return orig_shutdown(*a, **kw)
+
+ self.useFixture(fixtures.MonkeyPatch(
+ 'futurist.GreenThreadPoolExecutor.shutdown', wrap_shutdown))
+

but I haven't been able to reproduce the failure to tell if it might help.

[1] https://review.opendev.org/c/openstack/nova/+/815017
[2] https://github.com/openstack/nova/blob/b8cc5704558d3c08fda9db2f1bb7fecb2bcd985d/nova/compute/manager.py#L1627
[3] https://docs.openstack.org/futurist/latest/reference/index.html#futurist.GreenThreadPoolExecutor.shutdown

melanie witt (melwitt)
summary: - test_unshelve_offloaded_server_with_qos_port_pci_update_fails
+ test_unshelve_offloaded_server_with_qos_port_pci_update_fails (leaked
+ eventlet threads run after test teardown)
Revision history for this message
melanie witt (melwitt) wrote : Re: test_unshelve_offloaded_server_with_qos_port_pci_update_fails (leaked eventlet threads run after test teardown)

Reopening this because (1) we're still seeing it the gate (albeit less often since the notifications related workaround) and (2) when the bug is closed it won't show up in the gate bug list: https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure for easy finding for recheck comments.

Changed in nova:
status: Fix Released → Confirmed
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

FWIW, I'm in favor of changing the bug report title to make it broader. There are evidences that this problem isn't related to a single test but spans across our func tests (and not being dependent on a single python version)

summary: - test_unshelve_offloaded_server_with_qos_port_pci_update_fails (leaked
- eventlet threads run after test teardown)
+ leaked threads run after test case teardown causing intermittent
+ functional test failures
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :
Download full text (7.5 KiB)

OK, let me wrap here a few findings.
First, the number of occurences is way lower than what I said in c#25 (fortunately).
Something around 19 occurrences around the last 20 days (which is still pretty big tho)

Secondly, I think I identified the real exception leading to a failed test condition :

2023-01-24 18:29:03.537384 | ubuntu-jammy | During handling of the above exception, another exception occurred:
2023-01-24 18:29:03.537389 | ubuntu-jammy |
2023-01-24 18:29:03.537394 | ubuntu-jammy | Traceback (most recent call last):
2023-01-24 18:29:03.537400 | ubuntu-jammy | File "/home/zuul/src/opendev.org/openstack/nova/nova/virt/libvirt/driver.py", line 10619, in _live_migration
2023-01-24 18:29:03.537405 | ubuntu-jammy | self.live_migration_abort(instance)
2023-01-24 18:29:03.537411 | ubuntu-jammy | File "/home/zuul/src/opendev.org/openstack/nova/nova/virt/libvirt/driver.py", line 10087, in live_migration_abort
2023-01-24 18:29:03.537416 | ubuntu-jammy | guest = self._host.get_guest(instance)
2023-01-24 18:29:03.537422 | ubuntu-jammy | File "/home/zuul/src/opendev.org/openstack/nova/nova/virt/libvirt/host.py", line 673, in get_guest
2023-01-24 18:29:03.537427 | ubuntu-jammy | return libvirt_guest.Guest(self._get_domain(instance))
2023-01-24 18:29:03.537432 | ubuntu-jammy | File "/home/zuul/src/opendev.org/openstack/nova/nova/virt/libvirt/guest.py", line 112, in __init__
2023-01-24 18:29:03.537438 | ubuntu-jammy | libvirt = importutils.import_module('libvirt')
2023-01-24 18:29:03.537482 | ubuntu-jammy | File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py310/lib/python3.10/site-packages/oslo_utils/importutils.py", line 73, in import_module
2023-01-24 18:29:03.537494 | ubuntu-jammy | __import__(import_str)
2023-01-24 18:29:03.537500 | ubuntu-jammy | File
2023-01-24 18:29:03.537505 | ubuntu-jammy | "/home/zuul/src/opendev.org/openstack/nova/nova/tests/fixtures/nova.py", line 1829, in find_spec
2023-01-24 18:29:03.537511 | ubuntu-jammy | raise ImportError(fullname)
2023-01-24 18:29:03.537516 | ubuntu-jammy | ImportError: libvirt
2023-01-24 18:29:03.537521 | ubuntu-jammy |
2023-01-24 18:29:03.537527 | ubuntu-jammy | During handling of the above exception, another exception occurred:
2023-01-24 18:29:03.537532 | ubuntu-jammy |
2023-01-24 18:29:03.537537 | ubuntu-jammy | Traceback (most recent call last):
2023-01-24 18:29:03.537543 | ubuntu-jammy | File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 8854, in _do_live_migration
2023-01-24 18:29:03.537548 | ubuntu-jammy | self.driver.live_migration(context, instance, dest,
2023-01-24 18:29:03.537553 | ubuntu-jammy | File "/home/zuul/src/opendev.org/openstack/nova/nova/virt/libvirt/driver.py", line 10076, in live_migration
2023-01-24 18:29:03.537559 | ubuntu-jammy | self._live_migration(context, instance, dest,
2023-01-24 18:29:03.537564 | ubuntu-jammy | File "/home/zuul/src/opendev.org/openstack/nova/nova/virt/libvirt/driver.py", line 10620, in _live_migration
2023-01-24 18:29:03.537569 | ubuntu-jammy | except libvirt.libvirtError:
2023-01-24 18:29:03.537575 | ubuntu-jammy | AttributeError: 'NoneType' o...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/872975

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/872975
Committed: https://opendev.org/openstack/nova/commit/1778a9c589cf24e17b44f556680b17af9577df11
Submitter: "Zuul (22348)"
Branch: master

commit 1778a9c589cf24e17b44f556680b17af9577df11
Author: Sylvain Bauza <email address hidden>
Date: Tue Feb 7 16:30:59 2023 +0100

    Add logging to find test cases leaking libvirt threads

    We see functional test failures due to leaked libvirt event handling
    thread weaking up after its original test finished and importing
    libvirt. If it happens when the libvirt package import is poisoned then
    the currently executing test will fail. This patch logs the name of the
    test case that leaked the libvirt event handling thread.

    We will revert his before RC1.

    Change-Id: I3146e9afb411056d004fc118ccfa31126a3c6b15
    Related-Bug: #1946339

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

After a week, here is the paste of the added logging :

https://paste.opendev.org/show/bfvZX0XeKsELzY54EGb8/

Looks like each of the failures we got was done by a different functest.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

New paste :

Search call is "logsearch log --job-group nova-functional --result FAILURE --branch master '!!!---!!!' --days 7"
https://paste.opendev.org/show/bzFdND2j4YbJRK1JLfcY/

Looks like actually most of the failing tests inherit from nova.tests.functional.libvirt base except from that failing test :
https://github.com/openstack/nova/blob/00299dcaad9a42c7f81be62a3966c59b299dff1d/nova/tests/functional/test_cross_az_attach.py#L69

More to come, investigating.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/883317

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/883317
Committed: https://opendev.org/openstack/nova/commit/c095cfe04e2c71efcfbfdd95948af080a98065e6
Submitter: "Zuul (22348)"
Branch: master

commit c095cfe04e2c71efcfbfdd95948af080a98065e6
Author: melanie witt <email address hidden>
Date: Thu May 11 19:23:52 2023 +0000

    tests: Use GreenThreadPoolExecutor.shutdown(wait=True)

    We are still having some issues in the gate where greenlets from
    previous tests continue to run while the next test starts, causing
    false negative failures in unit or functional test jobs.

    This adds a new fixture that will ensure
    GreenThreadPoolExecutor.shutdown() is called with wait=True, to wait
    for greenlets in the pool to finish running before moving on.

    In local testing, doing this does not appear to adversely affect test
    run times, which was my primary concern.

    As a baseline, I ran a subset of functional tests in a loop
    until failure without the patch and after 11 hours, I got a failure
    reproducing the bug. With the patch, running the same subset of
    functional tests in a loop has been running for 24 hours and has not
    failed yet.

    Based on this, I think it may be worth trying this out to see if it
    will help stability of our unit and functional test jobs. And if it
    ends up impacting test run times or causes other issues, we can
    revert it.

    Partial-Bug: #1946339

    Change-Id: Ia916310522b007061660172fa4d63d0fde9a55ac

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/908806

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/2023.1)

Change abandoned by "Balazs Gibizer <email address hidden>" on branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/nova/+/908806

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/nova/+/908806
Committed: https://opendev.org/openstack/nova/commit/22cccb9bd9b5e1eb27f7e94855e734ed3863ab8a
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 22cccb9bd9b5e1eb27f7e94855e734ed3863ab8a
Author: melanie witt <email address hidden>
Date: Thu May 11 19:23:52 2023 +0000

    tests: Use GreenThreadPoolExecutor.shutdown(wait=True)

    We are still having some issues in the gate where greenlets from
    previous tests continue to run while the next test starts, causing
    false negative failures in unit or functional test jobs.

    This adds a new fixture that will ensure
    GreenThreadPoolExecutor.shutdown() is called with wait=True, to wait
    for greenlets in the pool to finish running before moving on.

    In local testing, doing this does not appear to adversely affect test
    run times, which was my primary concern.

    As a baseline, I ran a subset of functional tests in a loop
    until failure without the patch and after 11 hours, I got a failure
    reproducing the bug. With the patch, running the same subset of
    functional tests in a loop has been running for 24 hours and has not
    failed yet.

    Based on this, I think it may be worth trying this out to see if it
    will help stability of our unit and functional test jobs. And if it
    ends up impacting test run times or causes other issues, we can
    revert it.

    Partial-Bug: #1946339

    Change-Id: Ia916310522b007061660172fa4d63d0fde9a55ac
    (cherry picked from commit c095cfe04e2c71efcfbfdd95948af080a98065e6)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.