Data corruption with ext3 in striped logical volume

Bug #100126 reported by Andreas Schiffer
4
Affects Status Importance Assigned to Milestone
e2fsprogs (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

I'm using Kubuntu 7.04 Beta1 on a FSJ AMILO Xi 1554 notebook (Core 2 Duo T7200, 2 GB RAM, 17", ATI Mobility Radeon X1900).
My notebook has two 160 GB harddisks.
I created an LVM volume group that includes both harddisks.
Then I created logical volumes for my root and home partitions; to improve the performance, I created these logical volumes as RAID-0 volumes using the "-i 2" option of the lvcreate command:
root@linux ~# lvcreate -n lvstriped -L 1000M -i 2 volg1
Finally I formatted these logical volumes with mkfs.ext3 using the default parameters.

During the four weeks that I used this setup, I noticed that (nearly) every unclean unmount of a partition during shutdown or crash resulted in a forced complete fsck (not just a journal replay, but a time-consuming check covering the whole filesystem) during the next boot sequence. These full fsck's often showed a lot of errors that were corrected (duplicated inodes, wrong counters in inodes, etc.); sometimes the fsck even dropped me to busybox or rebooted the system to continue with a fresh fsck.
From time to time I found some files in lost+found, and some KDE configuration files were missing. The worst thing I saw was that a script was corrupted during a crash. Before the crash the script was fine, and after the crash it contained only binary garbage. What made me feel very uneasy is that the script was definitely used in read-only mode during the crash. I hope that none of my personal data files (like vacation photos) have been corrupted without me noticing it.
By the way, I made sure that the ext3 partitions were mounted as ext3 with journaling enabled to data=ordered and data=journal (I tested both); this didn't help the problem.

Finally I stored all my files to a backup medium, removed the volume group from the hard disks, created plain ext3 partitions (no LVM, and no RAID-0 striping), and restored my files to these partitions. Now everything works fine. When the system crashes, I see a journal replay on the next boot sequence, but no more lost files and no forced complete fsck.

Revision history for this message
Theodore Ts'o (tytso) wrote :

If the problem doesn't show up when you stop using LVM, it's highly unlikely to be an e2fsprogs bug!

Revision history for this message
Hew (hew) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. You reported this bug a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue for you. Can you try with the latest Ubuntu release (Hardy Heron)? Thanks in advance.

Changed in e2fsprogs:
status: New → Incomplete
Revision history for this message
Andreas Schiffer (andreas-schiffer) wrote :

For the past 14 months nobody cared about this bug, and obviously nobody tried to reproduce this bug on his system.

So why should I put any effort in checking if the bug is still there? It would be major effort for me to reproduce this bug.
And - as far as I can see - the only result would be that for the next 14 months this bug report would stay "Incomplete" and without anybody caring about it.

Revision history for this message
Hew (hew) wrote :

You have been the only reporter of this bug, and as you have mentioned, a year has gone by. Perhaps the bug was fixed in 7.04 release? Maybe it has been solved in the latest 8.04LTS? You filed this bug as present in e2fsprogs, and by looking at the changelog since 7.04, a lot has happened. This is why we ask that you, as the only reporter, please test if your issue is still present in the latest version. If we cannot confirm that this issue still exists, then the bug will be closed. If you find that it is still present in Ubuntu Hardy, then we can mark this report as confirmed.

Revision history for this message
Theodore Ts'o (tytso) wrote :

I very much doubt this was a problem with e2fsprogs, but rather with the RAID setup. If e2fsck was doing a full check after doing a journal replay, it's probably because the kernel had detected a filesystem consistency problem, and had set the "filesystem has an error" flag. The symptoms described sound very much like (a) hardware problems, or (b) a serious RAID configuration error (i.e., where part of the RAID is also getting used as a swap device, or something crazy like that). Given that all of your problems went away when you stopped using LVM, that makes it HIGHLY (as in 0.0000001% chance) that it is not an e2fsprogs problem, since e2fsprogs really doesn't care what the underlying disk device is. It (and the kernel filesystem code) only asks that the disk device be reliable, and what gets stored at a paticular block address stays stored there.

Part of the reason why no one probably looked at it is because it was coded as a e2fsprogs bug, and when I, as an unpaid volunteer, looked at the bug, it was obviously NOT a problem.

I can say that I am using LVM for all of my ext3 filesystem, as do many Ubuntu users, so if there was a major systemic problem, it would have been reported by now. This makes it highly likely that either (a) there is something specifically unusual about your system or your configuration, which is why no one else is seeing it, or (b) there is a hardware problem, or (c) this was some kind of user error.

Revision history for this message
Andreas Schiffer (andreas-schiffer) wrote :

Okay, let me apologize first for my last post sounding too accusing.
Of course nearly all Linux developers are unpaid volunteers, and I can't expect that anybody spends his leisure time on reproducing my problems.
I just had the impression that Hewus asked me carelessly to put a lot of effort in reproducing the defect without any justified reason that the problem was fixed in the meantime.
Sorry for the misunderstanding.

Is a swap partition on the RAID a bad thing to do?
Following the instructions on some web page, I used the whole space on my hard disks (except for a small boot partition) for the RAID-LVM, and created a swap partition in the LVM with commands like the following:
lvcreate -n swaplv -L 500M volg1
mkswap /dev/volg1/swaplv
swapon /dev/volg1/swaplv

Revision history for this message
Theodore Ts'o (tytso) wrote :

There's nothing wrong with a swap partition on the RAID. What I was suggesting was maybe you were using SWAP on a hard partition which perhaps overlapped with part of the physical volumes for the RAID. I'm not sure, but clearly it sounds like you have corruption showing up on your LVM device. Why that is the case is hard to say, but partition table misconfiguration is certainly one of the possibilities.

Ubuntu is free distribution, which means best efforts support for bug reports. If you pay for commercial support (which Canonical may or may not provide, but certainly companies like Red Hat, Novell, IBM, HP, all provide), you'll get someone who will respond right away, and say, "You know, this really doesn't like an e2fsprogs bug"; and help you determine whether it's a hardware problem, or maybe a LVM device driver problem, or maybe a user configuration problem. To have someone who can do this well and with a fast response time takes time, and money; training someone who can diplomatically suggest that maybe the Problem Exists Between Keyboard And Chair (PEBKAC) without truly pissing off the customer is difficult, and you pay the really good people who can do OS root cause problem determination a very good salary.

So here, when a bug report gets filed as an e2fsprogs bug, I as the upstream volunteer will do a quick scan of the queue, see common themes, and or things that are clearly bugs, and try to deal with them. Ubuntu's bug scanning folks don't scan very frequently, and fair enough --- this is a freebie service for them, and they are primarily focused on making the distribution better, as opposed to making sure someone isn't walking away from an expensive 7x24 support contract.

Regards,

Revision history for this message
Hew (hew) wrote :

We are closing this bug report because it lacks the information we need to investigate the problem, as described in the previous comments. Please reopen it if you can give us the missing information, and don't hesitate to submit bug reports in the future. To reopen the bug report you can click on the current status, under the Status column, and change the Status back to "New". Thanks again!

Changed in e2fsprogs:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.