running guests freeze when a guest is powered down

Bug #673705 reported by Mathias Gug
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Fedora)
Fix Released
Medium
libvirt (Ubuntu)
Fix Released
Low
Unassigned
linux (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Binary package hint: qemu-kvm

I'm running multiple guests via libvirt+kvm on my laptop. When I power off one them, all remaining running guests freeze for 30 seconds.

On the host, top shows the ksmd process as using the most cpu.

I've attached a sar output file. One guest was powered down 02:56:07 PM. The remaining running guest froze until 02:56:37 PM.

ProblemType: Bug
DistroRelease: Ubuntu 10.10
Package: qemu-kvm 0.12.5+noroms-0ubuntu7
ProcVersionSignature: Ubuntu 2.6.35-22.35-generic 2.6.35.4
Uname: Linux 2.6.35-22-generic x86_64
Architecture: amd64
Date: Wed Nov 10 14:58:57 2010
EcryptfsInUse: Yes
InstallationMedia: Ubuntu 10.04 LTS "Lucid Lynx" - Release amd64 (20100429)
KvmCmdLine:
 UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
 117 28028 1 9 155582 296184 2 14:54 ? 00:00:26 /usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 390 -smp 1,sockets=1,cores=1,threads=1 -name t-test2 -uuid d013faff-add9-48d4-8aa3-70a24f8717b2 -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/t-test2.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -boot cd -drive file=/home/mathiaz/reference/vms/t-test2/disk.qcow2,if=none,id=drive-virtio-disk0,boot=on,format=qcow2 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -device virtio-net-pci,vlan=0,id=net0,mac=52:54:00:ae:d2:51,bus=pci.0,addr=0x3 -net tap,fd=40,vlan=0,name=hostnet0 -usb -vnc 127.0.0.1:1 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
MachineType: LENOVO 3249CTO
ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.35-22-generic root=UUID=97b2f151-9aee-416a-8156-0585e0766d3d ro quiet splash
ProcEnviron:
 PATH=(custom, user)
 LANG=en_CA.utf8
 SHELL=/bin/bash
SourcePackage: qemu-kvm
dmi.bios.date: 01/26/2010
dmi.bios.vendor: LENOVO
dmi.bios.version: 6QET35WW (1.05 )
dmi.board.name: 3249CTO
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvr6QET35WW(1.05):bd01/26/2010:svnLENOVO:pn3249CTO:pvrThinkPadX201:rvnLENOVO:rn3249CTO:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 3249CTO
dmi.product.version: ThinkPad X201
dmi.sys.vendor: LENOVO

Revision history for this message
Mathias Gug (mathiaz) wrote :
Revision history for this message
Mathias Gug (mathiaz) wrote :

The kvm command line included in the attached information is the remaining running guest that freezes for 30 seconds after another guest is shutdown. It's started from libvirt and both guests use a qemu snapshot file as the backend for their disk device.

summary: - running guests freeze when one of the guest is powered down
+ running guests freeze when a guest is powered down
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I assume you've not made any customizations to /etc/default/qemu-kvm
or /etc/init/qemu-kvm.conf? I'll try to reproduce this tonight.

Revision history for this message
Mathias Gug (mathiaz) wrote :

It seems that the network is actually frozen when another guest is shutdown. Sharing a screen session between an ssh connection and a console session (via virt-viewer) shows that only the ssh connection is frozen. The running guest is still running correctly as noticed with the console connection.

Revision history for this message
Mathias Gug (mathiaz) wrote :
Revision history for this message
Mathias Gug (mathiaz) wrote :
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

This sounds most like comments #2 and #16 in bug 584048.

It can't be bug #579892, because that patch is in maverick.

Do you have wicd or network-manager running?

Can you give the output of:

   iptables -L?
   cat /etc/network/interfaces
   cat /proc/net/arp (both before and during a network freeze)

Also, could you follow the recipe at https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/616064/comments/33 ? That should determine whether this is a kernel or libvirt/qemu bug.

thanks,
-serge

Changed in qemu-kvm (Ubuntu):
status: New → Incomplete
importance: Undecided → Medium
assignee: nobody → Serge Hallyn (serge-hallyn)
importance: Medium → Low
Revision history for this message
Mathias Gug (mathiaz) wrote :
Download full text (3.9 KiB)

Network manager is running.

$ sudo iptables -nL
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:53
ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:53
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:67
ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:67
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:53
ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:53
ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:67
ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:67

Chain FORWARD (policy ACCEPT)
target prot opt source destination
ACCEPT all -- 0.0.0.0/0 192.168.233.0/24 state RELATED,ESTABLISHED
ACCEPT all -- 192.168.233.0/24 0.0.0.0/0
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0
REJECT all -- 0.0.0.0/0 0.0.0.0/0 reject-with icmp-port-unreachable
REJECT all -- 0.0.0.0/0 0.0.0.0/0 reject-with icmp-port-unreachable
ACCEPT all -- 0.0.0.0/0 192.168.122.0/24 state RELATED,ESTABLISHED
ACCEPT all -- 192.168.122.0/24 0.0.0.0/0
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0
REJECT all -- 0.0.0.0/0 0.0.0.0/0 reject-with icmp-port-unreachable
REJECT all -- 0.0.0.0/0 0.0.0.0/0 reject-with icmp-port-unreachable

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

$ sudo iptables -nL -t nat
Chain PREROUTING (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
MASQUERADE tcp -- 192.168.233.0/24 !192.168.233.0/24 masq ports: 1024-65535
MASQUERADE udp -- 192.168.233.0/24 !192.168.233.0/24 masq ports: 1024-65535
MASQUERADE all -- 192.168.233.0/24 !192.168.233.0/24
MASQUERADE tcp -- 192.168.122.0/24 !192.168.122.0/24 masq ports: 1024-65535
MASQUERADE udp -- 192.168.122.0/24 !192.168.122.0/24 masq ports: 1024-65535
MASQUERADE all -- 192.168.122.0/24 !192.168.122.0/24

$ cat /etc/network/interfaces
auto lo
iface lo inet loopback

Two guests: 179 was shutdown, 110 froze.

Before:
$ cat /proc/net/arp
IP address HW type Flags HW address Mask Device
192.168.242.1 0x1 0x2 00:12:17:1a:50:47 * eth0
192.168.122.110 0x1 0x2 52:54:00:a2:4e:07 * virbr0
192.168.122.179 0x1 0x2 52:54:00:72:58:e3 * virbr0

While the freeze:
$ cat /proc/net/arp
IP address HW type Flags HW address Mask Device
192.168.242.1 0x1 0x2 00:12:17:1a:50:47 * eth0
192.168.122.110 0x1 0x2 52:54:00:a2:4e:07 ...

Read more...

Revision history for this message
Mathias Gug (mathiaz) wrote :

Running https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/616064/comments/33 recipe for 15 minutes didn't lead to a lock up.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Sorry, Mathiaz, that recipe tests for connections which just
pause randomly. We need to test when a guest shuts down,
or, in other words, when the veth device is removed from the
bridge. So:

1. fire up a guest with libvirt. Monitor its network
continuously (i.e. fire up a screen session over ssh doing
 while [ 1 ]; do echo -n .; sleep 5s; done
and keep that open so you can see any pauses.

2. Get a usable ns_exec:

    git clone git://git.sr71.net/~hallyn/cr_tests.git
    cd cr_tests
    git checkout ns_exec
    make ns_exec
    cp ns_exec /bin/

3. Create a veth tunnel

    sudo ip link add type veth

4. Open two root terminals to configure a network namespace for our test

terminal 1:
 ip link add type veth
terminal 2:
 /bin/ns_exec -cmn /bin/bash
 echo $$ # call this $pid henceforth
terminal 1:
 ifconfig veth0 0.0.0.0 up
 brctl addif virbr0 veth0
 ip link set veth1 netns $pid # use pid from above
terminal 2:
 ifconfig veth1 up
 dhclient veth1

5. Now we want to emulate shutting down a libvirt guest. Let's try
several ways:

 A. From the host root shell, just remove veth0 from the bridge:

  brctl delif virbr0 veth0

 B. Shut down the veth interfaces. Try veth0 and veth1 on separate
 runs (ifconfig veth0 down).

 C. Just exit the child shell.

 D. Shut down the child shell, and then remove the veth interfaces
 altogether, by doing:

  ip link del veth0

After each test please remove the veth devices:

 ip link del veth0

Just to make sure that the commands in step 4 (referencing veth0/veth1) stay
correct.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Mathias,

This bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 673705

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Derek Simkowiak (ubuntu-cool-st) wrote :

This bug is probably a duplicate issue as this:

https://bugs.launchpad.net/ubuntu/maverick/+source/qemu-kvm/+bug/584048

The problem is with Linux bridging. When adding or removing a MAC address (like for KVM, VirtualBox, or even LXC) then if the bridge changes its MAC, this symptom happens. See comment #60 at the URL above, and also see this:

https://www.redhat.com/archives/libvir-list/2010-July/msg00450.html

The workaround is to use a MAC address that starts with "fe" (or any high really high number) for your guests. This causes the kernel to default to the hardware MAC for the bridge.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 673705] Re: running guests freeze when a guest is powered down

Quoting Derek Simkowiak (<email address hidden>):
> This bug is probably a duplicate issue as this:
>
> https://bugs.launchpad.net/ubuntu/maverick/+source/qemu-kvm/+bug/584048
>
> The problem is with Linux bridging. When adding or removing a MAC
> address (like for KVM, VirtualBox, or even LXC) then if the bridge
> changes its MAC, this symptom happens. See comment #60 at the URL
> above, and also see this:
>
> https://www.redhat.com/archives/libvir-list/2010-July/msg00450.html
>
> The workaround is to use a MAC address that starts with "fe" (or any
> high really high number) for your guests. This causes the kernel to
> default to the hardware MAC for the bridge.

Derek, thanks so much for this. You're almost 100% correct. I had even
suspected that bug (in comment #7), but I failed to make the simple
connection explaining this behavior.

The proposed solution does *not* suffice, and in fact I just reproduced
the bug in natty!

The current solution counts on the bridge being associated with a physical
NIC. But our default configuration uses a NAT bridge which starts with
no physical devices attached. So it starts with a zero-ed out mac addr.
Then every time you start a VM with a lower MAC address than the first
VM's, you can see this pause. And if you shut down the VM with the
lowest MACADDR, you'll again see the pause.

Changed in qemu-kvm (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
status: Incomplete → Invalid
Changed in qemu-kvm (Ubuntu):
importance: Low → High
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Hi Mathias,

I've contacted upstream about this issue.

For now, the simplest workaround would be to create a tap or veth NIC with a low macaddr and always keep it attached to your virbr0.

affects: qemu-kvm (Ubuntu) → libvirt (Ubuntu)
Changed in libvirt (Ubuntu):
assignee: Serge Hallyn (serge-hallyn) → nobody
Dave Walker (davewalker)
tags: added: server-nrs
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

upstream fixed this in 0.9.0 with commit 5754dbd56d4738112a86776c09e810e32f7c3224

Changed in libvirt (Ubuntu):
status: Confirmed → In Progress
assignee: nobody → Serge Hallyn (serge-hallyn)
Changed in libvirt (Ubuntu):
importance: High → Low
status: In Progress → Triaged
assignee: Serge Hallyn (serge-hallyn) → nobody
tags: removed: server-nrs
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Marking this fix releases as the upstream commit fixing it is in oneiric.

Changed in libvirt (Ubuntu):
status: Triaged → Fix Released
Changed in libvirt (Fedora):
importance: Unknown → Medium
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.