virtualization/kvm_check_vm test corrupts pre-downloaded image

Bug #1259879 reported by Yung Shen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Checkbox
Fix Released
High
Daniel Manrique

Bug Description

Since testers are able to set pre-downloaded image in /etc/checkbox.d/virtualization.cfg

They will having issue with kvm test when they running a second round of tests, because the image corrupted by the first testing.

Use precise cloud image as an example:
    md5sum before testing:
    84591f667407228b17f3523ca0f32861 precise-server-cloudimg-i386-disk1.img
    after testing:
    1b26d9fd0772defd72ee76ea60b4d165 precise-server-cloudimg-i386-disk1.img

however it does not effect for people who run the script directly, since it will download directly from the image,
but certainly it needs a way to verify the image before testing.

Tags: scripts

Related branches

Revision history for this message
Daniel Manrique (roadmr) wrote :

Steps to reproduce:

bzr branch lp:checkbox
wget http://cloud-images.ubuntu.com/raring/current/raring-server-cloudimg-i386-disk1.img
cp raring-server-cloudimg-i386-disk1.bak
# edit /etc/checkbox.d/virtualization.cfg and put the following:
# image: /path/to/raring-server-cloudimg-i386-disk1.img
# timeout: 30
sudo checkbox/checkbox-old/scripts/virtualization kvm --debug
# Will Boot successfully the first time
sudo checkbox/checkbox-old/scripts/virtualization kvm --debug

Expected result: "Booted successfully" for the second time
Actual result: KVM instance failed to boot

this is due to a problem mounting filesystems, as seen in virt_debug:

[ 2.754950] init: mountall main process (284) killed by FPE signal
General error mounting filesystems.
A maintenance shell will now be started.
CONTROL-D will terminate this shell and reboot the system.

also, as Yung reported, md5sum raring-server-* will show that the image is different from what the original one had, although that doesn't necessarily mean it was corrupted, maybe it was not unmounted properly and is now damaged.

Daniel Manrique (roadmr)
Changed in checkbox:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Daniel Manrique (roadmr) wrote :

OK, this was a fun one to debug.

The test was designed to download a fresh cloud image every time, that's why we hadn't noticed the behavior where the VM is "corrupted" after the first boot. This problem is evident when using a local cloud image, as it's a writable disk image so changes are persistent.

What happens is that our test waits a fixed amount of time, then simply terminates qemu forcefully, the equivalent of shutting down the VM ungracefully, which for some reason (outside the scope of our purposes) causes it to spit those errors out when it reboots.

This was easily solved by amending our cloud-init file to have a power_state stanza. This causes the VM to boot, perform the initialization process, then power_state halts it (not powering off, just want it to leave it in a state where yanking the plug won't damage the disk), and then when we do exactly that, the disk is left in a state that will boot successfully the next time.

Once that was fixed, the VM still "failed to boot". The reason is that the script looks for the "END SSH HOST KEYS" string to determine whether the VM booted correctly. However, this string is only produced on the *first* boot, when creating those keys; subsequent boots will NOT have this string. This was fixed by using the "final_message" cloud-init option, so we have a consistent message when boot is completed successfully. I'm still checking for END SSH HOST KEYS to identify the first boot.

With the proposed changes, the VM can be reused many times for testing.

Another option, which I didn't implement, is to simply copy the .img file to a temporary file, and use that to perform the testing. That way we always have the "pristine" image as a starting point. But this is just a proof of concept which I didn't really work on as it felt more complicated.

Changed in checkbox:
assignee: nobody → Daniel Manrique (roadmr)
status: Confirmed → In Progress
milestone: none → 2014-01-17
Daniel Manrique (roadmr)
Changed in checkbox:
status: In Progress → Fix Committed
Daniel Manrique (roadmr)
Changed in checkbox:
milestone: 2014-01-17 → 2013-dec-20
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.