lucid system randomly locks up, does not recover

Bug #688068 reported by Jeremy Anderson
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

This system had months upon months of uptime in LTS 8.04.01. After upgrading to LTS 10.04.01, the longest the system has gone before freezing is 10 hours (that was with the 2.6.34-020634-generic mainline kernel for lucid).

X is not running on this system. It runs the following software for my home network:
kvm w/6-10 VMs running
apache webserver w/php & mysql
nagios v3 (same version compiled under 8.04.01, running with those config files, as user nagios)
samba
bind9
iptables for use as firewall
dhcpd
NFS server
dovecot (serving up old mail archives)
postfix (in a forwarding-only mode)
OpenSSH server

Nothing is getting logged anywhere on these freezes -- I've enabled remote syslogging to another linux box on my network, and when the 8.04.01 system stops responding, it simply stops -- it doesn't log anything interesting at all -- it's like someone just switched the power off.

RAM checked out fine during an 8-hour memtest86+ run (3 passes were completed)
cpu temp was fine after an hour of using burnK7 from cpuburn
none of the 6 SATA drives reports any SMART errors
I found a couple of PSU calculators online, and all of them indicated the 500watt FSP powersupply in the system is more than sufficient for all attached hardware.
No USB devices are plugged in.

the screen goes blank immediately when the freezes occur. System activity seems to have little effect -- sometimes it freezes when the system is very idle, sometimes when the load average is around 2 (dual-core cpu in this machine, so that shouldn't be alarming). There are no signs of memory starvation at any point -- I've yet to see the swap active, even under very heavy load.

The system also experienced these hangs before KVM was installed. I was using VirtualBox, and I migrated to KVM because I thought VirtualBox was causing the crashes.

I have been experimenting with different timers.
my LONGEST uptime was with the mainline kernel, booted with the following options:

quiet ro splash ipv6.disable=1 clock_source=hpet

I'm am now experimenting with clock_source=jiffies on the 2.6.32-26-server kernel.
I have tried booting with the noacpi option, but that does not appear to help

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-26-server 2.6.32-26.48
Regression: Yes
Reproducible: No
ProcVersionSignature: Ubuntu 2.6.32-26.48-server 2.6.32.24+drm33.11
Uname: Linux 2.6.32-26-server x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', '/dev/snd/controlC0', '/dev/snd/hwC0D2', '/dev/snd/pcmC0D0c', '/dev/snd/pcmC0D0p', '/dev/snd/pcmC0D1c', '/dev/snd/pcmC0D1p', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info: Error: [Errno 2] No such file or directory
Card0.Amixer.values: Error: [Errno 2] No such file or directory
Date: Thu Dec 9 08:19:29 2010
Frequency: Once a day.
HibernationDevice: RESUME=UUID=c9c1e3a3-5f60-4e88-9192-038b30582b16
IwConfig: Error: [Errno 2] No such file or directory
MachineType: BIOSTAR Group A740G M2+
ProcCmdLine: root=UUID=5626f5d7-0210-432c-9200-ec6a1d599df3 ro quiet splash clock_source=jiffies
ProcEnviron:
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
RelatedPackageVersions: linux-firmware 1.34.1
RfKill: Error: [Errno 2] No such file or directory
SourcePackage: linux
dmi.bios.date: 05/02/2008
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 080014
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: A740G M2+
dmi.board.vendor: BIOSTAR Group
dmi.board.version: 6.0
dmi.chassis.asset.tag: None
dmi.chassis.type: 3
dmi.chassis.vendor: BIOSTAR Group
dmi.chassis.version: 6.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr080014:bd05/02/2008:svnBIOSTARGroup:pnA740GM2+:pvr6.0:rvnBIOSTARGroup:rnA740GM2+:rvr6.0:cvnBIOSTARGroup:ct3:cvr6.0:
dmi.product.name: A740G M2+
dmi.product.version: 6.0
dmi.sys.vendor: BIOSTAR Group

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :
Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

As suggested by tgardner in #ubuntu-kernel, I am now booting the machine with the following line:
/boot/vmlinuz-2.6.32-26-server root=UUID=5626f5d7-0210-432c-9200-ec6a1d599df3 ro pci=nomsi

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

I strongly suspect that https://bugs.launchpad.net/ubuntu/+source/linux/+bug/586901 is related to my issue, but that is just a layman's opinion.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

Lets not jump to conclusions. The motherboards are completely different.

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

now at 23 hours of uptime with the details from post #2 -- this is 2x as long as it has lasted before. Mr. Gardner, I believe you may have solved the issue. Once I've gone a week without an unexpected halt, I'll consider this issue resolved. Thank you so much for your help!

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

Unfortunately, I have spoken too soon. During heavy disk activity (kvm-img convert -O qcow2 /path/to/old_vmware.vmdk /path/to/kvm_img.qcow2 over nfsv4), the system crashed. I have rebooted it, with the following command line:

$ cat /proc/cmdline
root=UUID=5626f5d7-0210-432c-9200-ec6a1d599df3 ro crashkernel=384M-2G:64M,2G-:128M

$ uname -a
Linux valhalla 2.6.37-8-server #21~lucid1-Ubuntu SMP Mon Dec 6 17:43:33 UTC 2010 x86_64 GNU/Linux

perhaps this latest kernel will resolve whatever issue I was experiencing.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

You might be stumbling over an issue that upstream says is not supported, e.g., copying to an NFS export that is shared on the same machine from which the copy originates.

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

Mr. Gardner was kind enough to explain to me the issue here:

When a KVM guest mounts an NFS filesystem from the Host system (e.g. the nfs server is the same physical hardware on which the guest is running), there is no throttling to the amount of memory used. A file copy (or a big conversion, like I was doing with kvm-img convert) can completely exhaust memory -- and it does so in kernel space. In this case, the hard hang I experienced was likely due to memory exhaustion (that is supposition on my part). A more preferred alternative is to utilize rsync over ssh, since that should provide some throttling to the connection. In addition, ssh runs in userspace, rather than in kernel space, so memory exhaustion can be dealt with far more gracefully.

Alternate solutions would be to move the NFS server off to another machine, or to move the kvm guest to another piece of hardware. Mr. Gardner wasn't sure if Samba would also be affected by this, so I have devirtualized the single kvm guest which needed read-write NFS access.

The natty narwhal kernel has been stable so far, but I will be returning to the 2.6.32-26-server kernel in short order, to continue testing it.

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

Yesterday, after about 17 hours of uptime, the system crashed during heavy NFS load. This was running kernel vmlinuz-2.6.37-8-server, with command line:

Command line: root=UUID=5626f5d7-0210-432c-9200-ec6a1d599df3 ro crashkernel=384M-2G:64M,2G-:128M

the heavy NFS load was this:

client machine hlidskjalfe had mounted directory /more from server 'valhalla'
on client machine hlidskjalfe, a non-priviledged user ran the fdupes command across the entire /more filesystem, directing output to /more/fdupes.out

the machines are physically separate -- client machine hlidskjalfe is a core2duo box connected to server valhalla via a gigabit ethernet switch, to the Intel e1000 NIC in the server.

server is also set to remote syslog to client machine hlidskjalfe. It continued to log a few messages AFTER it had stopped logging them locally on server valhalla (I noticed these in /var/log/syslog) . Accordingly, I have attached logs from valhalla, and logs from hlidskjalfe -- after removing hlidskjalfe's log messages. If you like, I can upload the unexpurgated logs from hlidsjkalfe.
the tarfile extracts to bug688068-logs for simplicity. I can also upload the enter /var/log directory from both machines if that is helpful. This 19MB tarfile extracts to 210MB, and contains data from more than just the most recent crash. Note that the daemon.log file has sensor data, such as CPU temp and whatnot. Authlog shows the various nagios checks which are repeatedly run against the system.

I am now back on 2.6.32-26-server, running cmdline:
root=UUID=5626f5d7-0210-432c-9200-ec6a1d599df3 ro pci=nomsi crashkernel=384M-2G:64M,2G-:128M

For the time being, I will avoid heavy NFS loads such as the one I did last night -- I can always run jobs like this from an ssh session on the local machine.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

Unless you're going to kexec a kernel for debugging, the crashkernel= parameter does no good. In fact, in your case it consumes a fair chunk of memory. Try rerunning your load after editing /etc/defaults/grub, remove crashkernel, running update-grub, and rebooting. It'll give your NFS load a bit more memory headroom.

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

I have removed the crashkernel references, and have rebooted to 2.6.32-26-server, cmdline:

root=UUID=5626f5d7-0210-432c-9200-ec6a1d599df3 ro pci=nomsi

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

I also used the downtime to update BIOS to the latest stable version provided by Biostar:

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
 Vendor: American Megatrends Inc.
 Version: 080014
 Release Date: 11/18/2008
 Address: 0xF0000
 Runtime Size: 64 kB
 ROM Size: 1024 kB
 Characteristics:
  ISA is supported
  PCI is supported
  PNP is supported
  APM is supported
  BIOS is upgradeable
  BIOS shadowing is allowed
  ESCD support is available
  Boot from CD is supported
  Selectable boot is supported
  BIOS ROM is socketed
  EDD is supported
  5.25"/1.2 MB floppy services are supported (int 13h)
  3.5"/720 KB floppy services are supported (int 13h)
  3.5"/2.88 MB floppy services are supported (int 13h)
  Print screen service is supported (int 5h)
  8042 keyboard services are supported (int 9h)
  Serial services are supported (int 14h)
  Printer services are supported (int 17h)
  CGA/mono video services are supported (int 10h)
  ACPI is supported
  USB legacy is supported
  LS-120 boot is supported
  ATAPI Zip drive boot is supported
  BIOS boot specification is supported
  Targeted content distribution is supported
 BIOS Revision: 8.14

There is also a newer version of bios available, but it is marked as "beta", so I have held off upgrading to that for now.

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

Last night, after about 13 hours of uptime, the system became unresponsive. All KVM guests were disabled, and the system was pretty quiet . Unfortunately, I was sacked out, so I didn't see anything happen, and nothing unusual was logged. The system _was_ however transcoding an mkv video to mpeg2video, about 1.1 GB in size, but the process wasn't running as root, it was running as user pytivo. After rebooting and disabling all KVM guests, I once again transcoded two mkv videos as user pytivo, and this did not take the system down.

I've now booted with the natty narwhal kernel again: vmlinuz-2.6.37-8-server . While vmlinuz-2.6.37-9-server HAS loaded from the ppa, grub will not utilize it, claiming it is a xen kernel. I have logged a separate bug report on that (688977) .

I am currently booted with kernel option "nomodeset" to clean up logging a bit. I'm running vmstat to monitor free memory as well, so with the next system hang, I'll at least have a picture of the final vmstat situation.

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

On 12/16/10 at 03:05 hours, the system logged it's last syslog message. It had been up since 12/14/10 14:58:47 . Only a single KVM guest was running on the machine at that time. The 36 hour uptime would appear to be a new record. For the first day of that, no KVM guests were running.

The remote loghost had gone to sleep, so I didn't get memory consumption or a vmstat or top picture

I will fire up all the KVM guests today on it, as that typically accelerates the system crashing. Current cmdline (and the one I was running with when it ceased responding):

Linux valhalla 2.6.37-8-server #21~lucid1-Ubuntu SMP Mon Dec 6 17:43:33 UTC 2010 x86_64 GNU/Linux
root=UUID=5626f5d7-0210-432c-9200-ec6a1d599df3 ro pci=nomsi nomodeset

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

New server crash this morning. Two KVM guests running, but I was just running a small perl script which reads in an 8109 byte xml file, changes a couple of values via XML::Twig, and prints it to stdout. After rebooting, the same script ran just fine.

I DID get a vmstat picture of things, and the last output of top:

crash at:

lost connectivity at Fri Dec 17 08:40:11 CST 2010

final top screen:

top - 08:58:35 up 1 day, 47 min, 5 users, load average: 0.51, 0.48, 0.42
Tasks: 191 total, 1 running, 190 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.5%us, 1.8%sy, 0.0%ni, 94.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 8160340k total, 3287448k used, 4872892k free, 177300k buffers
Swap: 1630460k total, 0k used, 1630460k free, 2237028k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17939 root 20 0 528m 267m 3380 S 25 3.4 131:50.81 kvm
   54 root 25 5 0 0 0 S 19 0.0 80:09.12 ksmd
 1709 root 20 0 199m 5508 3420 S 2 0.1 9:37.83 libvirtd
17357 jeremy 20 0 19272 1432 1028 R 1 0.0 2:39.89 top
17904 jeremy 20 0 100m 1828 876 S 1 0.0 3:28.46 sshd
 1775 bind 20 0 204m 29m 2224 S 0 0.4 0:36.81 named
 2548 nagios 20 0 28472 1468 880 S 0 0.0 2:57.31 nagios
17186 jeremy 20 0 100m 1828 872 S 0 0.0 0:06.69 sshd
17708 jeremy 20 0 100m 1828 872 S 0 0.0 0:00.80 sshd
17910 jeremy 20 0 9612 428 336 S 0 0.0 1:31.52 nc
    1 root 20 0 23896 2036 1244 S 0 0.0 0:09.89 init
    2 root 20 0 0 0 0 S 0 0.0 0:00.01 kthreadd
    3 root 20 0 0 0 0 S 0 0.0 0:03.06 ksoftirqd/0
    4 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/0:0
    6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0
    7 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1
    8 root 20 0 0 0 0 S 0 0.0 0:00.01 kworker/1:0

last vmstat output (sampling at 1 second):

 0 0 0 4873808 177300 2237028 0 0 0 0 4283 8235 4 2 95 0

final command was running a perl script which parsed XML, reading in an 8109 byte file

jeremy@valhalla:~/sandbox/esx$ ./jda2.pl
Write failed: Broken pipe

Note that this machine has 8GB of ram in a 4x 2gb configuration. I will remove 2 sticks of RAM later today and see if that eliminates the crash. A friend has warned that he has seen random freezes in machines with fully populated RAM banks. I will be contacting the vendor, Biostar, to see if there are known issues with this RAM and this motherboard, or with fully populated RAM banks.

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

I'm continuing to experience hangs in various memory configurations.
I have also deactivated all routing functionality for this system, and it is running in a single NIC configuration. If it remains stable for a few days, I will add memory back in and see if it remains stable.

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

The system has been up for 47 hours straight now -- 11 hours longer than any other configuration. It is very much looking like it was the onboard NIC that was killing the system.

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

After 3 days of error-free uptime, I have reverted to the mainline 10.04 kernel, and reinstalled all 8 GB of memory. The suspected bad NIC is no longer being used. I have also eliminated fancy kernel boot options (except for nomodeset, since it was generating EDID errors in my logs). I will update this change ticket with any crashes, or, if a week has gone by without issue, I will request this bug be closed as "hardware issue"

jeremy@valhalla:~$ uname -a;cat /proc/cmdline
Linux valhalla 2.6.32-27-server #49-Ubuntu SMP Thu Dec 2 02:05:21 UTC 2010 x86_64 GNU/Linux
root=UUID=5626f5d7-0210-432c-9200-ec6a1d599df3 ro splash quiet nomodeset

Revision history for this message
Jeremy Anderson (jeremy-angelar) wrote :

System has been up 7 days w/o crashing or halting. I am running on all 8 gb of RAM, and have been pushing it plenty hard.
I am confident saying that all the system hangs were due to dodgy hardware. Please close this bug out -- there is no problem with the kernel, only with the onboard NIC in my system.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

hardware issue

Changed in linux (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.