Strange apparent atomicity failure in nova
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Opinion
|
Undecided
|
Unassigned |
Bug Description
This problem ocurred in a stable-diablo, ubuntu 11.10, kvm 2-compute-node cluster with multi_host that has been running for two months with multiple users. A user reported that she could not ssh into 2 of her vms. I saw that those vms were running on the same compute node and on that compute node, nova-compute was running but had stopped making any new log entries. Also 'virsh list' hung. She also tried to reboot and then delete the vms. The nova-compute log had errors like:
Error: trying to destroy already destroyed instance: 200
The nova-network log had the following. The really scary thing is that this same kind of error, for instance 200, appeared at about the same time in the nova-network log of the *other* compute node. Somehow the other compute node was trying to do network operations for a vm that was owned by a different compute node. Unless I misunderstand how multi_host works this should not be possible. The log files are large so I will attach a file with the time-window snippets. I have the full log files if any one wants them. The ids of the two vms were 155 and 200.
Restarting libvirt did not help and I had to reboot the compute node e to restore sanity to the system. At that point nova was confused in that it had
marked the 2 vms as gone but they were still running even after the reboot and I did not have auto-restart set for vms. I had to kill them with virsh. As an aside, in this two month run the
issue of libvirt hanging happened at least one other time for no reason. In that case restarting libvirt fixed the problems.
2012-02-27 17:35:27,309 DEBUG nova.network.
2012-02-27 17:35:27,314 DEBUG nova.network.
2012-02-27 17:35:27,445 ERROR nova.rpc [4a59acea-
(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: rval = node_func(
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: super(FloatingIP, self).deallocat
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: self.deallocate
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: instance_id)
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: instance_ref = self.db.
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: return IMPL.instance_
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: return f(*args, **kwargs)
(nova.rpc): TRACE: File "/usr/lib/
(nova.rpc): TRACE: raise exception.
(nova.rpc): TRACE: InstanceNotFound: Instance 200 could not be found.
(nova.rpc): TRACE:
Thanks for the very detailed report David
As detailed as it is, I'm going to have to mark it as Incomplete because (a) we've no idea how to reproduce this and (b) an awful lot has changed since Diablo which might have fixed this
If you can come up with a reliable reproducer on Diablo or see the issue again with Essex or Folsom, please do re-open