Boot with Software Raid most times causes mdadm to not complete (possible race)

Bug #140854 reported by netslayer
6
Affects Status Importance Assigned to Milestone
Ubuntu
New
Undecided
Unassigned

Bug Description

Since Feisty when I implemented this raid configuration mdadm does not always boot correctly even though it seems setup perfectly. I just did a format and install of fresh Gutsy and it still is doing it. Right during early boot the status bar will start to load and gets about 1/5th of the way and locks (disk activity light on solid/high I/O). I can goto a console, and I know at this time todo ctrl/alt/del to have it reboot or else it will eventually fail out. Eventually my system will boot.

Once I was able to get a terminal: ps aux | grep mdadm
root 4508 0.0 0.0 10360 608 ? S< 00:49 0:00 /lib/udev/watershed /sbin/mdadm --assemble --scan --no-degraded
root 4509 0.0 0.0 10364 580 ? S< 00:49 0:00 /lib/udev/watershed /sbin/mdadm --assemble --scan --no-degraded
root 8436 0.0 0.0 12392 528 ? Ss 00:54 0:00 /sbin/mdadm --monitor --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog
root 8892 34.1 26.6 562048 550600 ? D< 00:56 0:34 /sbin/mdadm --assemble --scan --no-degraded
chris 8951 0.0 0.0 5124 836 pts/2 S+ 00:57 0:00 grep mdadm

Notice, the last mdadm process kicked off during boot is using 27% of my memory and creeping up fast, in a matter of a minute all my ram is gone (2GB) and it hits swap until it consumes it all and the system becomes unusable. This occurs half the time I start my computer and the drives are always detected during POST. I even can killall the mdmadm processes and they reappear with the same behavior. I feel it is some kind of race condition with udev/mount/mdadm placing my drives in a state that will not work. I even tried adding the udevsettle timeout in the init but it didnt help.

I either ran mdadm --assemble --scan manually or checked dmesg and it would flood with this:
md: array md0 already has disks!

Until recently the --no-degraded option was not on by default and I'd always seem to loose a drive during boot if I restarted during the mount attempt by using ctrl alt del. I would have to re add it.

Setup:
Latest Gutsy Ubuntu
AMD Opteron 170 (X2)

/dev/md0:
4x300GB drives

/dev/md1:
5x500GB drives

md0 : active raid5 sde[0] hdd[3] hdb[2] sdf[1]
      879171840 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid5 sdh1[0] sdd1[4] sdc1[3] sdb1[2] sda1[1]
      1953535744 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]

chris@delorean:~$ cat /etc/mdadm/mdadm.conf
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE partitions

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays
#ARRAY /dev/md0 level=raid0 num-devices=2 UUID=db874dd1:759d986d:cc05af5c:cfa1abed
ARRAY /dev/md1 level=raid5 num-devices=5 UUID=9c2534ac:1de3420b:e368bf24:bd0fce41
ARRAY /dev/md2 level=raid0 num-devices=2 UUID=0cc8706d:e9eedd66:33a70373:7f0eea01
ARRAY /dev/md3 level=raid0 num-devices=2 UUID=af45c1d8:338c6b67:e4ad6b92:34a89b78

# This file was auto-generated on Wed, 06 Jun 2007 19:58:28 -0700
# by mkconf $Id: mkconf 261 2006-11-09 13:32:35Z madduck $

Revision history for this message
netslayer (netslayer007) wrote :

I just noticed the mdadm.conf I posted doesnt match my drive configuration and I dont think it's using it since all the UUIDs are on the partition tables. In feisty I had this file setup correctly and it made no difference. (this is what gutsy setup for me)

My drive ids as mapped to the above cat /proc/mdstat
/dev/md0
/dev/sde UUID : 2fc65d39:8f7cbdb5:7072d7cc:0e4fe29d (local to host delorean)
/dev/hdd UUID : 2fc65d39:8f7cbdb5:7072d7cc:0e4fe29d (local to host delorean)
/dev/hdb UUID : 2fc65d39:8f7cbdb5:7072d7cc:0e4fe29d (local to host delorean)
/dev/sdf UUID : 2fc65d39:8f7cbdb5:7072d7cc:0e4fe29d (local to host delorean)

/dev/md1
/dev/sdh1 UUID : 2fc65d39:8f7cbdb5:7072d7cc:0e4fe29d (local to host delorean)
/dev/sdd1 UUID : 2fc65d39:8f7cbdb5:7072d7cc:0e4fe29d (local to host delorean)
/dev/sdc1 UUID : 2fc65d39:8f7cbdb5:7072d7cc:0e4fe29d (local to host delorean)
/dev/sdb1 UUID : 2fc65d39:8f7cbdb5:7072d7cc:0e4fe29d (local to host delorean)
/dev/sda1 UUID : 2fc65d39:8f7cbdb5:7072d7cc:0e4fe29d (local to host delorean)

Revision history for this message
netslayer (netslayer007) wrote :

** sighs (one more correction)
The second UUID set are all the same (poor copy n paste):
chris@delorean:~$ sudo mdadm -E /dev/sdh1 | grep UUID
           UUID : 9c2534ac:1de3420b:e368bf24:bd0fce41
chris@delorean:~$ sudo mdadm -E /dev/sdd1 | grep UUID
           UUID : 9c2534ac:1de3420b:e368bf24:bd0fce41
chris@delorean:~$ sudo mdadm -E /dev/sdc1 | grep UUID
           UUID : 9c2534ac:1de3420b:e368bf24:bd0fce41
chris@delorean:~$ sudo mdadm -E /dev/sdb1 | grep UUID
           UUID : 9c2534ac:1de3420b:e368bf24:bd0fce41
chris@delorean:~$ sudo mdadm -E /dev/sda1 | grep UUID
           UUID : 9c2534ac:1de3420b:e368bf24:bd0fce4

Revision history for this message
luthos (luthos11) wrote :

I have exactly the same problem.

My setup:

  Xubuntu Gutsy with latest Updates
  Athlon 64 1,6 MHz, 1GB RAM
  /dev/md0 is a Raid5 with 3x250GB disks (2x SATA and 1 PATA)

I'm not too experienced with Linux, the only idea I came up with was disabling the Splashscreen with the status bar to see what's going on. It loads some hardware drivers I think and then it's the same as netslayer described, its printing "md: array md0 already has disks!" over and over again.

I always have to try several times until the system boots but if it finally does everything is working fine, even the raid.

Revision history for this message
netslayer (netslayer007) wrote :

I should experiment with my 4 disk raid 5 array disconnected. It's the one that has half SATA and half PATA drives as well. We have pretty much the same setup, and I mean I just formatted my drive and installed Gutsy so it's probably not just us.

Revision history for this message
netslayer (netslayer007) wrote :

I just tried the solution in the dup bug https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/139802

It looks like /usr/share/mdadm/mkconf >/etc/mdadm/mdadm.conf is detecting an array I dont have from somewhere. You can see the output of my mdadm.conf is totally messed up and this makes sense now why it's trying to add a drive to the array that doesnt exist. It could be an older one I had.. ??

ARRAY /dev/md0 level=raid5 num-devices=4 UUID=2fc65d39:8f7cbdb5:7072d7cc:0e4fe29d
ARRAY /dev/md0 level=raid0 num-devices=2 UUID=db874dd1:759d986d:cc05af5c:cfa1abed
ARRAY /dev/md1 level=raid5 num-devices=5 UUID=9c2534ac:1de3420b:e368bf24:bd0fce41

mdadm -E /dev/sdf
UUID : db874dd1:759d986d:cc05af5c:cfa1abed

mdadm -E /dev/sdh
UUID : db874dd1:759d986d:cc05af5c:cfa1abed

Interesting, so what happened is that I bought 3 new drives when I created this array, and the remaining two I formatted and put in here are the ones that still have the old UUID's. Since I applied the raid array in partitions for these drives instead of the physical device I have two UUIDs for two of my drives. Then I guess udev finds that at different times and causes it to fail if it tries to bring up the old array and add it to an existing one.. total race condition.

So how do I get rid of the other UUIDs safely..

Revision history for this message
netslayer (netslayer007) wrote :

I can confirm this fixed it, but I think it's still a bug in the mdadm tool or udev that handles it poorly

1. Use mdadm -E /dev/sdX1 and mdadm -E /dev/sdX device to examine the UUIDs of the raid array. One of them will be the one you no longer use, check the other drives and determine which one is no longer valid
2. Then unmount the raid array umount /dev/md1, mdadm --stop /dev/md1
3. Then mdadm --zero-superblock /dev/sdf
4. Then mdadm --zero-superblock /dev/sdh as I have to bad ones
5. Remount the array, Reboot

Mine works perfect now :-)
Please make sure you check your zeroing out the right drives before you do this..

Revision history for this message
luthos (luthos11) wrote :

Hm... I just had to add the array definition to mdadm.conf:

# definitions of existing MD arrays
ARRAY /dev/md0 level=raid5 num-devices=2 UUID=c48b6ad2:935cba2b:e9e4094c:aa5ad989

No zeroing of superblocks was necessary. I did a few reboots and everything seems to work perfectly.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.