virsh start domain sometimes fail

Bug #961217 reported by C de-Avillez
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
High
Unassigned
Precise
Confirmed
Medium
Unassigned

Bug Description

I have at least four domains I control over Jenkins, in a matrix job. When I start the Jenkins job, all four domains will be started, each on its own process (i.e., since this is a matrix job, Jenkins will expand & start all instances at the same time).

Most of the times, at least one of the domains fail to be started. There is no visible fix on one of the domain, either could fail.

When it fails, I see this:

+ virsh -d 5 -l visrh_start.log start clean-oneiric-generic-i386
error: failed to get domain 'clean-oneiric-generic-i386'
error: server closed connection:start: domain(optdata): clean-oneiric-generic-i386start: found option <domain>: clean-oneiric-generic-i386
start: <domain> trying as domain NAME

+ '[' 1 '!=' 0 ']'+ echo 'virsh start clean-oneiric-generic-i386 failed'virsh start clean-oneiric-generic-i386 failed

Chatting with Serge, he suggests bug 903212 as a possible hit, and asked me to run libvirt with debug=1 (and cross my fingers against a possible heisenbug). I am opening this bug as a placeholder for when I am able to run libvirt under debug.

ProblemType: BugDistroRelease: Ubuntu 11.10
Package: libvirt-bin 0.9.2-4ubuntu15.2
ProcVersionSignature: Ubuntu 3.0.0-16.28-server 3.0.17
Uname: Linux 3.0.0-16-server x86_64
ApportVersion: 1.23-0ubuntu4
Architecture: amd64
Date: Wed Mar 21 14:05:42 2012SourcePackage: libvirt
UpgradeStatus: No upgrade log present (probably fresh install)
mtime.conffile..etc.apparmor.d.abstractions.libvirt.qemu: 2012-02-17T09:08:00.610249
mtime.conffile..etc.apparmor.d.usr.sbin.libvirtd: 2012-02-17T08:42:01.392491
mtime.conffile..etc.libvirt.libvirtd.conf: 2012-02-17T20:12:41.677168
mtime.conffile..etc.libvirt.qemu.conf: 2011-12-20T13:13:27.771582

WORKAROUND:

retry the virsh start command; try *not* to run cuncurrent 'virsh list'.

Revision history for this message
C de-Avillez (hggdh2) wrote :
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

in bug 903212 it's suggested that a specific, not-cleanly-cherrypickable patchset might fix it. However, the rationale behind the patchset doesn't match the patchset. asctime_r etc *are* threadsafe. The only thing I can find in the source which isn't is one use of localtime (not localtime_r). It may be worth fixing that and seeing if it helps.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Except that one use is in tools/virsh.c, which isn't threaded.

summary: - virsh start domain sometime fail
+ virsh start domain sometime fail in oneiric
Changed in libvirt (Ubuntu):
importance: Undecided → High
Revision history for this message
C de-Avillez (hggdh2) wrote : Re: virsh start domain sometime fail in oneiric

XML for one of the failing domains:

<domain type='kvm'>
  <name>clean-oneiric-generic-i386</name>
  <uuid>55d032e1-b2e2-b557-3f68-99988111b5da</uuid>
  <memory>524288</memory>
  <currentMemory>524288</currentMemory>
  <vcpu>1</vcpu>
  <os>
    <type arch='x86_64' machine='pc-0.14'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='writeback'/>
      <source file='/var/lib/ubuntu-iso-testing/kernel-sru/clean-oneiric-generic-i386/disk0.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <disk type='block' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <address type='drive' controller='0' bus='1' unit='0'/>
    </disk>
    <controller type='ide' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <interface type='network'>
      <mac address='52:54:00:fe:ca:62'/>
      <source network='default'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <graphics type='vnc' port='-1' autoport='yes'/>
    <video>
      <model type='vmvga' vram='9216' heads='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </memballoon>
  </devices>
</domain>

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Reproduced after upgrade to precise.

Changed in libvirt (Ubuntu):
status: New → Confirmed
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

This time, I got

error: failed to get domain 'o4'
error: End of file while reading data: Input/output error

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Reproduced with upstream libvirt (git clone from today).

C de-Avillez (hggdh2)
summary: - virsh start domain sometime fail in oneiric
+ virsh start domain sometimes fail in oneiric
C de-Avillez (hggdh2)
description: updated
Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: virsh start domain sometimes fail in oneiric

If I hack qemuProcessStop() to not kill a vm, kvm keeps running, but the
monitor sock fd (13) is closed (presumably because libvirt has disconnected).

This seems to confirm that a race triggered by another virsh list/start
action is causing a qemu task which is starting up to get killed.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Finally.

I can reproduce this in fedora, if I create a slow qemu libvirt hook. Presumably then we are hitting this because of the virt-aa-helper hook.

http://people.canonical.com/~serge/breaklibvirt.sh should show all that is needed.

Dave Walker (davewalker)
summary: - virsh start domain sometimes fail in oneiric
+ virsh start domain sometimes fail
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I cannot reproduce this on quantal. Either with my breaklibvirt.sh script, nor by hand (doing

for i in `seq 1 4`; do virsh start cdboot$i >> /tmp/out$i 2>&1 & done; virsh list > /tmp/a 2>&1 & virsh list > /tmp/b 2>&1 & virsh list; virsh list

for example)

Changed in libvirt (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Chris J Arges (arges) wrote :

Hi, I can reproduce this bug using 'breaklibvirt.sh' on precise, while on quantal it runs through the iterations just fine.
I'm targeting this bug to Precise.

Changed in libvirt (Ubuntu Precise):
importance: Undecided → Medium
Revision history for this message
Chris J Arges (arges) wrote :

Also I am using:
libvirt0:
  Installed: 0.9.8-2ubuntu17.4
  Candidate: 0.9.8-2ubuntu17.4

Changed in libvirt (Ubuntu Precise):
status: New → Confirmed
Revision history for this message
Chris J Arges (arges) wrote :

So just for completeness, I backported the quantal package into precise:
https://launchpad.net/~christopherarges/+archive/ppa-test/+build/3914168

0.9.13-0ubuntu12 libvirt-bin libvirt0, does not fail after 100 iterations.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@christopherarges,

Could you test with the current libvirt-bin in precise-proposed? I'm wondering whether the fix for bug 1055658 fixes this bug as well.

Revision history for this message
Chris J Arges (arges) wrote :

@serge,
Looks like 0.9.8-2ubuntu17.5 does the trick. 100 iterations with your script and no errors.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Awesome, thanks. I'll go ahead and actually mark this a duplicate then, as I'm pretty sure it was exactly the same thing.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.