OpenStack Compute (nova)

Strange apparent atomicity failure in nova

Bug #943312 reported by David Kranz on 2012-02-29

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Opinion	Undecided	Unassigned

Bug Description

This problem ocurred in a stable-diablo, ubuntu 11.10, kvm 2-compute-node cluster with multi_host that has been running for two months with multiple users. A user reported that she could not ssh into 2 of her vms. I saw that those vms were running on the same compute node and on that compute node, nova-compute was running but had stopped making any new log entries. Also 'virsh list' hung. She also tried to reboot and then delete the vms. The nova-compute log had errors like:

Error: trying to destroy already destroyed instance: 200

The nova-network log had the following. The really scary thing is that this same kind of error, for instance 200, appeared at about the same time in the nova-network log of the *other* compute node. Somehow the other compute node was trying to do network operations for a vm that was owned by a different compute node. Unless I misunderstand how multi_host works this should not be possible. The log files are large so I will attach a file with the time-window snippets. I have the full log files if any one wants them. The ids of the two vms were 155 and 200.

Restarting libvirt did not help and I had to reboot the compute node e to restore sanity to the system. At that point nova was confused in that it had
marked the 2 vms as gone but they were still running even after the reboot and I did not have auto-restart set for vms. I had to kill them with virsh. As an aside, in this two month run the
issue of libvirt hanging happened at least one other time for no reason. In that case restarting libvirt fixed the problems.

2012-02-27 17:35:27,309 DEBUG nova.network.manager [43e0bcc2-2048-4875-b289-0270d1973c0a tester testproject] floating IP deallocation for instance |200| from (pid=1201) deallocate_for_instance /usr/lib/python2.7/dist-packages/nova/network/manager.py:251
2012-02-27 17:35:27,314 DEBUG nova.network.manager [43e0bcc2-2048-4875-b289-0270d1973c0a tester testproject] network deallocation for instance |200| from (pid=1201) deallocate_for_instance /usr/lib/python2.7/dist-packages/nova/network/manager.py:465
2012-02-27 17:35:27,445 ERROR nova.rpc [4a59acea-7ec9-4eff-b346-7afcae98cfbc None None] Exception during message handling
(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 620, in _process_data
(nova.rpc): TRACE: rval = node_func(context=ctxt, **node_args)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 275, in deallocate_for_instance
(nova.rpc): TRACE: super(FloatingIP, self).deallocate_for_instance(context, **kwargs)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 468, in deallocate_for_instance
(nova.rpc): TRACE: self.deallocate_fixed_ip(context, fixed_ip['address'], **kwargs)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 641, in deallocate_fixed_ip
(nova.rpc): TRACE: instance_id)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 404, in _do_trigger_security_group_members_refresh_for_instance
(nova.rpc): TRACE: instance_ref = self.db.instance_get(admin_context, instance_id)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/db/api.py", line 500, in instance_get
(nova.rpc): TRACE: return IMPL.instance_get(context, instance_id)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 119, in wrapper
(nova.rpc): TRACE: return f(*args, **kwargs)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 1171, in instance_get
(nova.rpc): TRACE: raise exception.InstanceNotFound(instance_id=instance_id)
(nova.rpc): TRACE: InstanceNotFound: Instance 200 could not be found.
(nova.rpc): TRACE:

Revision history for this message

David Kranz (david-kranz) wrote on 2012-02-29:

log-snippets.txt Edit (59.8 KiB, text/plain)

Revision history for this message

Mark McLoughlin (markmc) wrote on 2012-06-07:

Thanks for the very detailed report David

As detailed as it is, I'm going to have to mark it as Incomplete because (a) we've no idea how to reproduce this and (b) an awful lot has changed since Diablo which might have fixed this

If you can come up with a reliable reproducer on Diablo or see the issue again with Essex or Folsom, please do re-open

Changed in nova:
status:	New → Incomplete

Revision history for this message

David Kranz (david-kranz) wrote on 2012-06-07:

I obviously can't reproduce this. The system was long-running and I don't even know exactly what the user did. It's probably some race somewhere which can't occur anymore if we are lucky.

Revision history for this message

Thierry Carrez (ttx) wrote on 2012-07-12:

Can't be reproduced. Please set back to "New" if you see it again