Bug #893450 “libvirt fails to start correctly because LVM is not...” : Bugs : libvirt package : Ubuntu

Revision history for this message

Horacio (hgdeoro) wrote on 2011-11-22:

#1

Error found on syslog Edit (11.3 KiB, text/plain)

Serge Hallyn (serge-hallyn) on 2011-11-22

Changed in libvirt (Ubuntu):
importance:	Undecided → High

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-11-22:

#2

Thanks for submitting this bug. I'm marking it high priority as it fits 'severe impact on a small number of users'.

The problem appears to be that the logical volume is slow in coming up, such that libvirt starts before the udev creates its device node. Working around this in your particular case should be pretty easy, but a general fix will be harder.

To work around it in your case, please try adding the following to your /etc/init/libvirt-bin.conf file, at the end of the 'pre-start script' section:

   while [ ! -e /dev/vg_default/lv_robot-pv0 ]; do
      sleep 1s
   done

So, for instance, the pre-start script section would probably become:

pre-start script
[ -r /etc/default/libvirt-bin ] && . /etc/default/libvirt-bin
[ ! "x$start_libvirtd" = "xyes" ] && { stop; exit 0; }
mkdir -p /var/run/libvirt
# Clean up a pidfile that might be left around
rm -f /var/run/libvirtd.pid
date >> /var/log/libvirt-upstart
echo $$ >> /var/log/libvirt-upstart
cat /proc/self/cmdline >> /var/log/libvirt-upstart
cat /proc/self/cgroup >> /var/log/libvirt-upstart
while [ ! -e /dev/vg_default/lv_robot-pv0 ]; do
sleep 1s
done
end script

Please let us know if that does not work around your problem for now.

Changed in libvirt (Ubuntu):
status:	New → Incomplete

Revision history for this message

Horacio (hgdeoro) wrote on 2011-11-23:

#3

Download full text (3.9 KiB)

Thank you very much for the response! I was able to do some more test, and something strange happened.
I made the changes to /etc/init/libvirt-bin.conf and added some sort of "timeout":

ls -lh /dev/vg_default | logger -t libvirt-bin-start-script
for a_device in $LVM_DEVS_TO_CHECK ; do
    echo "Device $a_device" | logger -t libvirt-bin-start-script
    iter=0
    while [ ! -e $a_device -a $iter -lt 5 ]; do
echo " + Not found: $a_device" | logger -t libvirt-bin-start-script
ls -lh /dev/vg_default | logger -t libvirt-bin-start-script
sleep 1s
iter=$(( $iter + 1))
    done
done

One of the KVM guest didn't started. Searching in syslog, I found:

Nov 23 01:18:36 web4 kernel: [ 33.760027] br1: port 1(eth1) entering forwarding state
Nov 23 01:18:38 web4 kernel: [ 36.120089] br0: port 1(eth0) entering forwarding state
Nov 23 01:18:49 web4 ntpdate[1847]: no server suitable for synchronization found
Nov 23 01:21:20 web4 init: udevtrigger post-stop process (404) terminated with status 1
Nov 23 01:21:20 web4 libvirt-bin-start-script: .................. \
Nov 23 01:21:20 web4 libvirt-bin-start-script: .................. | output of "logger", with the contents of /dev/vg_default
Nov 23 01:21:20 web4 libvirt-bin-start-script: .................. /
Nov 23 01:21:21 web4 init: udev-fallback-graphics main process (2531) terminated with status 1
Nov 23 01:21:21 web4 init: plymouth-splash main process (2563) terminated with status 1
Nov 23 01:21:21 web4 udevd[366]: worker [393] unexpectedly returned with status 0x0100
Nov 23 01:21:21 web4 udevd[366]: worker [393] failed while handling '/devices/virtual/block/dm-2'
Nov 23 01:21:21 web4 udevd[366]: worker [388] unexpectedly returned with status 0x0100
Nov 23 01:21:21 web4 udevd[366]: worker [388] failed while handling '/devices/virtual/block/md1'
Nov 23 01:21:21 web4 libvirt-bin-start-script: ......................
Nov 23 01:21:24 web4 libvirt-bin-start-script: last message repeated 2 times
Nov 23 01:21:24 web4 udevd[366]: worker [376] unexpectedly returned with status 0x0100
Nov 23 01:21:24 web4 udevd[366]: worker [376] failed while handling '/devices/virtual/block/dm-8'
Nov 23 01:21:25 web4 libvirt-bin-start-script: ......................
Nov 23 01:21:26 web4 libvirt-bin-start-script: ......................
Nov 23 01:21:26 web4 libvirt-bin-start-script: ......................

The output of "logger" show me that 4 logical volumes were missing, and after waiting for 20 seconds, the scripts continued, started libvirt and one of the guest didn't started up.
Those devices were missing on /dev/mapper/ an /dev/vg_default. But doing a "lvs" worked OK and the missing devices where shown. So, the devices where found and recognized by LVM, but the links are missing.

So, I've changed the script to wait for 120 seconds for each device and rebooted. This time I got more time to see what's going on. Checking the syslog I got the same errors of "udevtrigger" and "udevd". 4 devices were missing again, so it took 8 minutes and then stop checking.

While this 8 minutes, the devices links didn't "appear". "lvs" list the devices OK, but no link were on /dev. I know this happened because of all the 'ls -lh /dev/vg_default'...

apport-collect didn't worked (the server has not direct internet connection).

-------------------------
lspci
-------------------------

00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01)
05:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064ET PCI-Express Fusion-MPT SAS (rev 02)

-------------------------
some of the dmesg output
-------------------------

[    1.152651] Fusion MPT base driver 3.04.18
[    1.152707] Copyright (c) 1999-2008 LSI Corporation
[    1.157884] async_tx: api initialized (async)
[    1.170812] Fusion MPT SAS Host driver 3.04.18
[    1.170911] mptsas 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[    1.171196] mptbase: ioc0: Initiating bringup
[    1.330050] raid6: int64x1   1469 MB/s
[    1.390018] Refined TSC clocksource calibration: 1866.732 MHz.
[    1.390078] Switching to clocksource tsc
[    1.500025] raid6: int64x2   1967 MB/s
[    1.670042] raid6: int64x4   1418 MB/s
[    1.840049] raid6: int64x8   1323 MB/s
[    2.010011] raid6: sse2x1    3152 MB/s
[    2.180022] raid6: sse2x2    3804 MB/s
[    2.350016] raid6: sse2x4    5600 MB/s
[    2.350070] raid6: using algorithm sse2x4 (5600 MB/s)
[    2.351541] xor: automatically using best checksumming function: generic_sse
[    2.400007]    generic_sse:  6836.000 MB/sec
[    2.400062] xor: using function: generic_sse (6836.000 MB/sec)
[    2.405934] md: raid6 personality registered for level 6
[    2.405994] md: raid5 personality registered for level 5
[    2.406051] md: raid4 personality registered for level 4
[    2.413276] md: raid10 personality registered for level 10
[    3.670031] ioc0: LSISAS1064E B1: Capabilities={Initiator}
[    3.670248] mptsas 0000:05:00.0: setting latency timer to 64
[    9.833435] scsi2 : ioc0: LSISAS1064E B1, FwRev=01100000h, Ports=1, MaxQ=511, IRQ=16
[    9.857778] mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 0, phy 0, sas_addr 0x455d3742e2bbac8f
[    9.866008] scsi 2:0:0:0: Direct-Access     ATA      WDC WD20EARS-00M AB51 PQ: 0 ANSI: 5
[    9.867787] sd 2:0:0:0: Attached scsi generic sg1 type 0
[    9.868136] sd 2:0:0:0: [sda] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
[    9.869384] mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 1, phy 1, sas_addr 0x455d393cddc2ad95
[    9.877777] scsi 2:0:1:0: Direct-Access     ATA      WDC WD20EARS-00M AB51 PQ: 0 ANSI: 5
[    9.879545] sd 2:0:1:0: Attached scsi generic sg2 type 0
[    9.879901] sd 2:0:1:0: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
[    9.886711] sd 2:0:0:0: [sda] Write Protect is off
[    9.886771] sd 2:0:0:0: [sda] Mode Sense: 73 00 00 08
[    9.894839] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    9.897900] sd 2:0:1:0: [sdb] Write Protect is off
[    9.897959] sd 2:0:1:0: [sdb] Mode Sense: 73 00 00 08
[    9.905889] sd 2:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    9.976991]  sda: sda1 sda2 sda3 sda4
[    9.988416]  sdb: sdb1 sdb2 sdb3 sdb4
[   10.003907] sd 2:0:0:0: [sda] Attached SCSI disk
[   10.015535] sd 2:0:1:0: [sdb] Attached SCSI disk
[   10.511656] md: bind<sda4>
[   10.525609] md: bind<sda2>
[   10.542465] md: bind<sdb4>
[   10.544430] bio: create slab <bio-1> at 1
[   10.544560] md/raid1:md1: active with 2 out of 2 mirrors
[   10.544640] md1: detected capacity change from 0 to 1985396702208
[   10.632182] md: bind<sdb2>
[   10.632774]  md1: unknown partition table
[   10.634310] md/raid1:md0: active with 2 out of 2 mirrors
[   10.634394] md0: detected capacity change from 0 to 9998877696
[   10.728161]  md0: unknown partition table
[   15.876872] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
[   19.034769] <30>udev[374]: starting version 167
[   19.043871] Adding 4882808k swap on /dev/sda3.  Priority:-1 extents:1 across:4882808k 
[   19.059394] Adding 4882808k swap on /dev/sdb3.  Priority:-2 extents:1 across:4882808k

-------------------------
/proc/mdstat 
-------------------------

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdb2[1] sda2[0]
      9764529 blocks super 1.2 [2/2] [UU]
      
md1 : active raid1 sdb4[1] sda4[0]
      1938863967 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

-------------------------
Disk layout
-------------------------

The disk are two 2TB sata. No multipath. No hardware raid (I thing the controller supports hardware raid). I don't know any further detail of the hardware, tell me if you need more information about that.

There are 2 raid partitions and 1 swap on each disk.
 - md0 (raid1) -> root filesystem
 - md1 (raid1) -> one LVM physical volume, one volume group, all the LV from here...

Some of the LV are PV for the guest machines. I've configured the "filter" on lvm.conf, so now only /dev/sd? and /dev/md? are scanned (and not the LV that are PV for the guests), but this doesn't fixed the problem.

I've to say that I've some trouble with this server some time ago. The server worked ok, but some times the boot process stopped with an error. It seems that the raid took too many time to startup, and failed to mount the filesystem. I workaround that adding file on /etc/initramfs-tools/scripts/local-premount/waitbeforemount with a "sleep 5" line... and with this pause we have no more boot problems.

Tell me if you need more information about something else...

Peter Petrakis (peter-petrakis) on 2011-11-28

summary:

- KVM guest fails to autostart sometimes with
- virSecurityDACRestoreSecurityFileLabel error
+ libvirt fails to start correctly because LVM is not ready

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-11-28:

#6

Hi,

to find out more, we need the udev debug output. Could you please edit /usr/share/initramfs-tools/scripts/init-top/udev so that the line starting udevd reads:

/sbin/udevd --daemon --debug --resolve-names=never > /dev/.initramfs/udev.out 2>&1

then run

sudo update-initramfs -k all -u

After a reboot, you'll find the debug info in /dev/.initramfs/udev.out. Please attach that here.

Please also give us the output of the commands:

  sudo pvscan
  sudo vgscan
  sudo lvscan

Changed in udev (Ubuntu):
status:	New → Incomplete

MR_JOC (joc-jpg) on 2011-12-05

Changed in udev (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-12-05:

#7

Please don't change the state from incomplete until the requested information has been provided.

Changed in udev (Ubuntu):
status:	Confirmed → Incomplete

Revision history for this message

Horacio (hgdeoro) wrote on 2011-12-14:

#8

This is the output of the lvm scan commands. I'll ask permission to reboot the server tonight to post the debug output of udev.

web4:~# sudo pvscan
PV /dev/md1 VG vg_default lvm2 [1.81 TiB / 1.33 TiB free]
Total: 1 [1.81 TiB] / in use: 1 [1.81 TiB] / in no VG: 0 [0 ]

web4:~# sudo vgscan
Reading all physical volumes. This may take a while...
Found volume group "vg_default" using metadata type lvm2

web4:~# sudo lvscan
  ACTIVE '/dev/vg_default/lv_desa_web_2-root' [5.00 GiB] inherit
  ACTIVE '/dev/vg_default/lv_desa_web_2-swap' [1.00 GiB] inherit
  ACTIVE '/dev/vg_default/lv_desa_web_2-pv0' [150.00 GiB] inherit
  ACTIVE '/dev/vg_default/lv_rsync' [200.00 GiB] inherit
  ACTIVE '/dev/vg_default/lv_cms2-pv0' [10.00 GiB] inherit
  ACTIVE '/dev/vg_default/lv_cms2-root' [4.00 GiB] inherit
  ACTIVE '/dev/vg_default/lv_jaime-root' [5.00 GiB] inherit
  ACTIVE '/dev/vg_default/lv_jaime-pv0' [10.00 GiB] inherit
  ACTIVE '/dev/vg_default/lv_monitor-swap' [1.00 GiB] inherit
  ACTIVE '/dev/vg_default/lv_monitor-root' [5.00 GiB] inherit
  ACTIVE '/dev/vg_default/lv_monitor-pv0' [50.00 GiB] inherit
  ACTIVE '/dev/vg_default/lv_monitor-var' [47.00 GiB] inherit

Revision history for this message

Horacio (hgdeoro) wrote on 2011-12-21:

#9

Contents of /dev/.initramfs/udev.out Edit (551.4 KiB, text/plain)

I've attached /dev/.initramfs/udev.out. I don'n know if this is relevant, but the write performance on the disks is horrible, I think this could be caused by mis-aligment of partitions ("dd if=/dev/zero of=zeros bs=64b oflag=dsync" gives me around 400kB/s)

Serge Hallyn (serge-hallyn) on 2012-01-03

Changed in libvirt (Ubuntu):
status:	Incomplete → New
Changed in udev (Ubuntu):
status:	Incomplete → New

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2012-01-03:

#10

Sure enough, lv_robot-pv0 (the backing lv for the vm) doesn't show up either in the lvscan or udev debug output.

Certainly a lvm failure here.

Revision history for this message

Peter Petrakis (peter-petrakis) wrote on 2012-01-03:

#11

In re-examining the evidence and chatting with Serge a few things come to mind.

1) There's a disparity between the conclusion formed in comment #3 and the latest
udev logs. It could easily be true that the udev logging simply was held long enough
for that last LV to be activated. It appears to be true that if we wait long enough
(approx 10 mins) that all LVs will come online.

2) The crux is udev being starved of events or is it being stalled by some other action?

3) Comment #9 compelled me to dig around concerning the backing store's to the
RAID 1 software set. The WD20EARS appear to use 4K sectors, which if true
would lend credence to *slowness* being the factor (udev is starved) which
determines why it takes almost 10 mins to discover all the LVs.

Follow up:
1) verify sector size of backing stores by installing sg3-utils and posting the result of
# sg_readcap /dev/sda
and
# cat /sys/block/sda/queue/physical_block_size

2) the basic dd test is fine, please re-run with a sweep of blocksizes, 512, 4k, 8k, 1024k
with "count=100" so it doesn't run forever

3) an additional io benchmark would be informative like iozone or iometer, whatever
is your preference.

Conclusion:
If this is an alignment issue then we've got a lot of work to do. You might be able
to resize/grow the MD array to change the stripe size to be a multiple of the sector
size. Though you would need to carry the exercise all the way through, including LVM
metadata offsets. It would be optimal if you had a duplicate set of drives to reproduce
this issue on a separate system. Thanks.

Peter Petrakis (peter-petrakis) on 2012-01-09

Changed in libvirt (Ubuntu):
status:	New → Incomplete
Changed in udev (Ubuntu):
status:	New → Incomplete

Revision history for this message

Horacio (hgdeoro) wrote on 2012-01-10:

#12

Ok, I've added seagate disks, manually created partitions (aligned to 1mb), added the new discs to the raid, remove the wd discs. The raid is now over two seagate disks, and the performance is still crap.

# hdparm -i /dev/sda
Model=ST2000DL003-9VT166, FwRev=CC32, SerialNo=6YD0RYED
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=3907029168

I've tryied to make the partition aligned to 1mb, maybe I made the partitions wrong.. These are the partitions on /dev/sda:

# sfdisk -s -d /dev/sda
1953514584
# partition table of /dev/sda
unit: sectors

/dev/sda1 : start= 2048, size= 19531776, Id=fd
/dev/sda2 : start= 19533824, size= 9762816, Id=83
/dev/sda3 : start= 29296640, size=3877730304, Id=fd
/dev/sda4 : start= 0, size= 0, Id= 0

Please, tell me if the partitions aren't aligned. I think I done this right, multiples of 2048 (1mb).

If the partitions are right, the problem aren't the disks, neither raid, lvm nor the fs, since I've tested with dd over /dev/sda2 (plain partition, no raid, no lvm, no filesystem):

# dd if=/dev/zero of=/dev/sda2 count=1000 oflag=dsync bs=16384
16384000 bytes (16 MB) copied, 12.5848 s, 1.3 MB/s

# dd if=/dev/zero of=/dev/sda2 count=1000 oflag=dsync bs=16384
16384000 bytes (16 MB) copied, 14.4387 s, 1.1 MB/s

With raid, lvm and ext4, the performance is:

# dd if=/dev/zero of=zeros count=1000 oflag=dsync bs=512
512000 bytes (512 kB) copied, 19.4532 s, 26.3 kB/s

# dd if=/dev/zero of=zeros count=1000 oflag=dsync bs=1024
1024000 bytes (1.0 MB) copied, 22.373 s, 45.8 kB/s

# dd if=/dev/zero of=zeros count=1000 oflag=dsync bs=2048
2048000 bytes (2.0 MB) copied, 28.7906 s, 71.1 kB/s

# dd if=/dev/zero of=zeros count=1000 oflag=dsync bs=4096
4096000 bytes (4.1 MB) copied, 68.0617 s, 60.2 kB/s

# dd if=/dev/zero of=zeros count=1000 oflag=dsync bs=8192
8192000 bytes (8.2 MB) copied, 61.7608 s, 133 kB/s

# dd if=/dev/zero of=zeros count=1000 oflag=dsync bs=16384
16384000 bytes (16 MB) copied, 67.1279 s, 244 kB/s

Ok, I've added seagate disks, manually created partitions (aligned to 1mb), added the new discs to the raid, remove the wd discs. The raid is now over two seagate disks, and the performance is still crap.

# hdparm -i /dev/sda
 Model=ST2000DL003-9VT166, FwRev=CC32, SerialNo=6YD0RYED
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
 BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=3907029168

I've tryied to make the partition aligned to 1mb, maybe I made the partitions wrong.. These are the partitions on /dev/sda:

# sfdisk -s -d /dev/sda
1953514584
# partition table of /dev/sda
unit: sectors

/dev/sda1 : start=     2048, size= 19531776, Id=fd
/dev/sda2 : start= 19533824, size=  9762816, Id=83
/dev/sda3 : start= 29296640, size=3877730304, Id=fd
/dev/sda4 : start=        0, size=        0, Id= 0

Please, tell me if the partitions aren't aligned. I think I done this right, multiples of 2048 (1mb).

If the partitions are right, the problem aren't the disks, neither raid, lvm nor the fs, since I've tested with dd over /dev/sda2 (plain partition, no raid, no lvm, no filesystem):

# dd if=/dev/zero of=/dev/sda2 count=1000 oflag=dsync bs=16384
16384000 bytes (16 MB) copied, 12.5848 s, 1.3 MB/s

# dd if=/dev/zero of=/dev/sda2 count=1000 oflag=dsync bs=16384
16384000 bytes (16 MB) copied, 14.4387 s, 1.1 MB/s

With raid, lvm and ext4, the performance is:

# dd if=/dev/zero of=zeros count=1000 oflag=dsync bs=512
512000 bytes (512 kB) copied, 19.4532 s, 26.3 kB/s

# dd if=/dev/zero of=zeros count=1000 oflag=dsync bs=1024
1024000 bytes (1.0 MB) copied, 22.373 s, 45.8 kB/s

# dd if=/dev/zero of=zeros count=1000 oflag=dsync bs=2048
2048000 bytes (2.0 MB) copied, 28.7906 s, 71.1 kB/s

# dd if=/dev/zero of=zeros count=1000 oflag=dsync bs=4096
4096000 bytes (4.1 MB) copied, 68.0617 s, 60.2 kB/s

# dd if=/dev/zero of=zeros count=1000 oflag=dsync bs=8192
8192000 bytes (8.2 MB) copied, 61.7608 s, 133 kB/s

# dd if=/dev/zero of=zeros count=1000 oflag=dsync bs=16384
16384000 bytes (16 MB) copied, 67.1279 s, 244 kB/s

Revision history for this message

Peter Petrakis (peter-petrakis) wrote on 2012-01-10:

#13

Worked with the user on irc and ruled out screaming interrupts
and the LSI card itself (he removed it). User is upgrading to 3.x
kernel and will apprise us of his progress here.

Revision history for this message

Horacio (hgdeoro) wrote on 2012-01-20:

#14

I've made more test with the guidance of Peter Petrakis, I've physically removed the LSI card and tryied with a newer kernel (3.0.0-15) without luck... the performance is the same (read ok, but slow writes).

Revision history for this message

Horacio (hgdeoro) wrote on 2012-01-20:

#15

Some comments about performance:
- reading works well with big and small block sizes
- writing with big block size works ok
- the problem arise when writing with small block size

----------

Reading works ok with small and large block size:

reading blocks of 1KiB from sda2

# /usr/bin/time dd if=/dev/sda2 of=/dev/null bs=1024 count=$(( 1024*32 ))
32768+0 records in
32768+0 records out
33554432 bytes (34 MB) copied, 0.48956 s, 68.5 MB/s
0.00user 0.07system 0:00.49elapsed 14%CPU (0avgtext+0avgdata 3552maxresident)k
66016inputs+0outputs (0major+268minor)pagefaults 0swaps

# /usr/bin/time dd if=/dev/sda2 of=/dev/null bs=1024 count=$(( 1024*1024 ))
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 8.74106 s, 123 MB/s
0.08user 2.23system 0:08.74elapsed 26%CPU (0avgtext+0avgdata 3536maxresident)k
2097632inputs+0outputs (0major+268minor)pagefaults 0swaps

reading blocks of 1MiB from sdb2 (to ensure that nothing is cached)

# /usr/bin/time dd if=/dev/sdb2 of=/dev/null bs=$((1024*1024)) count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 8.76866 s, 122 MB/s
0.00user 1.82system 0:08.77elapsed 20%CPU (0avgtext+0avgdata 7616maxresident)k
2097408inputs+0outputs (0major+522minor)pagefaults 0swaps

Now, writing with blocks of 16MiB works ok (64.5 MB/s!!!)

# /usr/bin/time dd if=/dev/zero of=/mnt/rsync/zeros bs=$((1024*1024*16)) count=128 oflag=dsync
128+0 records in
128+0 records out
2147483648 bytes (2.1 GB) copied, 33.2825 s, 64.5 MB/s
0.00user 6.69system 0:33.37elapsed 20%CPU (0avgtext+0avgdata 69072maxresident)k
1464inputs+4195328outputs (0major+4363minor)pagefaults 0swaps

With 1MiB the performance is bad (12.1 MB/s)

# /usr/bin/time dd if=/dev/zero of=/mnt/rsync/zeros bs=$((1024*1024)) count=256 oflag=dsync
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 22.1709 s, 12.1 MB/s
0.00user 0.97system 0:22.17elapsed 4%CPU (0avgtext+0avgdata 7616maxresident)k
40inputs+524288outputs (1major+521minor)pagefaults 0swaps

With 1KiB I got.. 33.3 kB/s!!! (I pressed Ctrl+C after some time)

# /usr/bin/time dd if=/dev/zero of=/mnt/rsync/zeros bs=1024 count=$(( 1024*32 )) oflag=dsync
(pressed Ctrl+C)
^C9069+0 records in
9069+0 records out
9286656 bytes (9.3 MB) copied, 279.276 s, 33.3 kB/s
Command terminated by signal 2
0.00user 1.30system 4:39.90elapsed 0%CPU (0avgtext+0avgdata 3504maxresident)k
0inputs+72552outputs (0major+265minor)pagefaults 0swaps

Some comments about performance:
- reading works well with big and small block sizes
- writing with big block size works ok
- the problem arise when writing with small block size

----------

Reading works ok with small and large block size:

reading blocks of 1KiB from sda2

# /usr/bin/time dd if=/dev/sda2 of=/dev/null bs=1024 count=$(( 1024*32 )) 
32768+0 records in
32768+0 records out
33554432 bytes (34 MB) copied, 0.48956 s, 68.5 MB/s
0.00user 0.07system 0:00.49elapsed 14%CPU (0avgtext+0avgdata 3552maxresident)k
66016inputs+0outputs (0major+268minor)pagefaults 0swaps

# /usr/bin/time dd if=/dev/sda2 of=/dev/null bs=1024 count=$(( 1024*1024 )) 
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 8.74106 s, 123 MB/s
0.08user 2.23system 0:08.74elapsed 26%CPU (0avgtext+0avgdata 3536maxresident)k
2097632inputs+0outputs (0major+268minor)pagefaults 0swaps

reading blocks of  1MiB from sdb2 (to ensure that nothing is cached)

# /usr/bin/time dd if=/dev/sdb2 of=/dev/null bs=$((1024*1024)) count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 8.76866 s, 122 MB/s
0.00user 1.82system 0:08.77elapsed 20%CPU (0avgtext+0avgdata 7616maxresident)k
2097408inputs+0outputs (0major+522minor)pagefaults 0swaps

Now, writing with blocks of 16MiB works ok (64.5 MB/s!!!)

# /usr/bin/time dd if=/dev/zero of=/mnt/rsync/zeros bs=$((1024*1024*16)) count=128 oflag=dsync
128+0 records in
128+0 records out
2147483648 bytes (2.1 GB) copied, 33.2825 s, 64.5 MB/s
0.00user 6.69system 0:33.37elapsed 20%CPU (0avgtext+0avgdata 69072maxresident)k
1464inputs+4195328outputs (0major+4363minor)pagefaults 0swaps

With 1MiB the performance is bad (12.1 MB/s)

# /usr/bin/time dd if=/dev/zero of=/mnt/rsync/zeros bs=$((1024*1024)) count=256 oflag=dsync
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 22.1709 s, 12.1 MB/s
0.00user 0.97system 0:22.17elapsed 4%CPU (0avgtext+0avgdata 7616maxresident)k
40inputs+524288outputs (1major+521minor)pagefaults 0swaps

With 1KiB I got.. 33.3 kB/s!!! (I pressed Ctrl+C after some time)

# /usr/bin/time dd if=/dev/zero of=/mnt/rsync/zeros bs=1024 count=$(( 1024*32 )) oflag=dsync
(pressed Ctrl+C)
^C9069+0 records in
9069+0 records out
9286656 bytes (9.3 MB) copied, 279.276 s, 33.3 kB/s
Command terminated by signal 2
0.00user 1.30system 4:39.90elapsed 0%CPU (0avgtext+0avgdata 3504maxresident)k
0inputs+72552outputs (0major+265minor)pagefaults 0swaps

Revision history for this message

Horacio (hgdeoro) wrote on 2012-01-20:

#16

And some iozone tests, with 2 child process, O_DIRECT and SYNC Mode (full log on pastebin: http://pastebin.com/amTbjfdr).

All with:
        O_DIRECT feature enabled
        SYNC Mode.
        Output is in Kbytes/sec
        Min process = 2
        Max process = 2
        Throughput test with 2 processes

----- Record size 1k

Children see throughput for 2 initial writers = 11.50 KB/sec
Avg throughput per process = 5.75 KB/sec

Children see throughput for 2 rewriters = 12.45 KB/sec
Avg throughput per process = 6.22 KB/sec

----- Record size 8k

Children see throughput for 2 initial writers = 100.18 KB/sec
Avg throughput per process = 50.09 KB/sec

Children see throughput for 2 rewriters = 75.27 KB/sec
Avg throughput per process = 37.64 KB/sec

----- Record size 16k

Children see throughput for 2 initial writers = 277.43 KB/sec
Avg throughput per process = 138.71 KB/sec

Children see throughput for 2 rewriters = 224.98 KB/sec
Avg throughput per process = 112.49 KB/sec

----- Record size 32k

Children see throughput for 2 initial writers = 398.72 KB/sec
Avg throughput per process = 199.36 KB/sec

Children see throughput for 2 rewriters = 204.58 KB/sec
Avg throughput per process = 102.29 KB/sec

----- Record size 64k

Children see throughput for 2 initial writers = 406.26 KB/sec
Avg throughput per process = 203.13 KB/sec

Children see throughput for 2 rewriters = 447.78 KB/sec
Avg throughput per process = 223.89 KB/sec

----- Record size 128k

Children see throughput for 2 initial writers = 1711.68 KB/sec
Avg throughput per process = 855.84 KB/sec

Children see throughput for 2 rewriters = 1563.11 KB/sec
Avg throughput per process = 781.56 KB/sec

----- Record size 256k

Children see throughput for 2 initial writers = 3400.94 KB/sec
Avg throughput per process = 1700.47 KB/sec

Children see throughput for 2 rewriters = 2532.04 KB/sec
Avg throughput per process = 1266.02 KB/sec

And some iozone tests, with 2 child process, O_DIRECT and SYNC Mode (full log on pastebin: http://pastebin.com/amTbjfdr).

All with:
        O_DIRECT feature enabled
        SYNC Mode. 
        Output is in Kbytes/sec
        Min process = 2 
        Max process = 2 
        Throughput test with 2 processes

----- Record size 1k

Children see throughput for  2 initial writers  =      11.50 KB/sec
        Avg throughput per process                      =       5.75 KB/sec

Children see throughput for  2 rewriters        =      12.45 KB/sec
        Avg throughput per process                      =       6.22 KB/sec

----- Record size 8k

Children see throughput for  2 initial writers  =     100.18 KB/sec
        Avg throughput per process                      =      50.09 KB/sec

Children see throughput for  2 rewriters        =      75.27 KB/sec
        Avg throughput per process                      =      37.64 KB/sec

----- Record size 16k

Children see throughput for  2 initial writers  =     277.43 KB/sec
        Avg throughput per process                      =     138.71 KB/sec

Children see throughput for  2 rewriters        =     224.98 KB/sec
        Avg throughput per process                      =     112.49 KB/sec

----- Record size 32k

Children see throughput for  2 initial writers  =     398.72 KB/sec
        Avg throughput per process                      =     199.36 KB/sec

Children see throughput for  2 rewriters        =     204.58 KB/sec
        Avg throughput per process                      =     102.29 KB/sec

----- Record size 64k

Children see throughput for  2 initial writers  =     406.26 KB/sec
        Avg throughput per process                      =     203.13 KB/sec

Children see throughput for  2 rewriters        =     447.78 KB/sec
        Avg throughput per process                      =     223.89 KB/sec

----- Record size 128k

Children see throughput for  2 initial writers  =    1711.68 KB/sec
        Avg throughput per process                      =     855.84 KB/sec

Children see throughput for  2 rewriters        =    1563.11 KB/sec
        Avg throughput per process                      =     781.56 KB/sec

----- Record size 256k

Children see throughput for  2 initial writers  =    3400.94 KB/sec
        Avg throughput per process                      =    1700.47 KB/sec

Children see throughput for  2 rewriters        =    2532.04 KB/sec
        Avg throughput per process                      =    1266.02 KB/sec

Revision history for this message

Horacio (hgdeoro) wrote on 2012-01-20:

#17

The same tests, now on a ext4 filesystem over plain partition (no RAID)

----- Record size 1k

Children see throughput for 2 initial writers = 34.09 KB/sec
Avg throughput per process = 17.04 KB/sec

Children see throughput for 2 rewriters = 25.68 KB/sec
Avg throughput per process = 12.84 KB/sec

----- Record size 8k

Children see throughput for 2 initial writers = 195.30 KB/sec
Avg throughput per process = 97.65 KB/sec

Children see throughput for 2 rewriters = 199.05 KB/sec
Avg throughput per process = 99.52 KB/sec

----- Record size 128k

Children see throughput for 2 initial writers = 3248.37 KB/sec
Avg throughput per process = 1624.19 KB/sec

Children see throughput for 2 rewriters = 3197.46 KB/sec
Avg throughput per process = 1598.73 KB/sec

Revision history for this message

Peter Petrakis (peter-petrakis) wrote on 2012-01-20:

#18

Are these ext4 test results done at the bare metal level or from the guest? With the
guests themselves, are they using the raw volume for the image or do they exist as an
image file on the filesystem. If they're images files, then what fs type do they live on?
Thanks.

Revision history for this message

Horacio (hgdeoro) wrote on 2012-01-20:

#19

All the tests where ran at the bare metal level.
The results on comment 16 where ran over a ext4 over lvm over raid1.
The results on comment 17 where ran over a ext4 over a plain partition.

I don't know how bad are those numbers... Maybe it's normal that working with 1k blocks kill's the disk performance?

Revision history for this message

Peter Petrakis (peter-petrakis) wrote on 2012-01-20:

#20

The disk sector size and how it's managed really determine how well it will
perform under various access sizes. In your case remember, the disk sector
size is 4k, so we would hope that Linux is "doing the right thing" in coalescing
these 1k accesses into chunks which get the most out of your hard disks.

That's not our primary problem here, the primary problem here is LVM
having trouble bringing all LVs online. These performance issues are
in addition to that. The LVM problem appears to have been addressed
by adding a root delay so that's that.

We have yet to pin down where any of them have come from. That this
problem appeared after an update is suspicious, yet iirc you've shown
that rolling kernels back and forth have no net impact on the symptoms.

If an update was really the root of all of this you should be able to
install natty to a separate set of disks, re-run these tests and see dramatically
different numbers for the 1K test runs.

We are now left with the actual configuration of your entire stack, the
loads generated by your VMs, and how they might be affecting your
IO subsystem.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2012-03-21:

#21

[Expired for libvirt (Ubuntu) because there has been no activity for 60 days.]

Changed in libvirt (Ubuntu):
status:	Incomplete → Expired

Revision history for this message

Launchpad Janitor (janitor) wrote on 2012-03-21:

#22

[Expired for udev (Ubuntu) because there has been no activity for 60 days.]

Changed in udev (Ubuntu):
status:	Incomplete → Expired

Affects		Status	Importance	Assigned to	Milestone
	libvirt (Ubuntu)	Expired	High	Unassigned
	udev (Ubuntu)	Expired	Undecided	Unassigned

Ubuntu
libvirt package

libvirt fails to start correctly because LVM is not ready

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntulibvirt package

libvirt fails to start correctly because LVM is not ready

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
libvirt package