<3>BUG: soft lockup detected on CPU#1!

Bug #63165 reported by TJ
2
Affects Status Importance Assigned to Milestone
linux-source-2.6.15 (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

I've just installed Server 6.06 on a dual-CPU Asus A7M266-D.

I also installed the LAMP packages, webmin and sshd.

I left the server ticking over for a soak test with no customisations or web applications installed.

After about 18 hours it crashed with the following report in kern.log. I'm not sure from the report where to being looking to debug this.
Since rebooting uptime is 38 1/2 hours, again just ticking over.

kernel: [43061450.240000] SMP
kernel: [43061450.240000] Modules linked in: nls_utf8 ntfs ipv6 dm_mod af_packet md_mod lp snd_cmipci gameport snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_opl3_lib snd_timer snd_hwdep snd_mpu401_uart snd_rawmidi snd_seq_device snd jedec_probe cfi_probe i2c_amd756 gen_probe i2c_core soundcore parport_pc mtdcore chipreg map_funcs natsemi hw_random pcspkr psmouse parport floppy serio_raw amd_k7_agp shpchp pci_hotplug agpgart evdev reiserfs ide_generic ehci_hcd ohci_hcd usbcore ide_cd cdrom ide_disk generic amd74xx thermal processor fan fbcon tileblit font bitblit softcursor capability commoncap
kernel: [43061450.240000] CPU: 1
kernel: [43061450.240000] EIP: 0060:[pg0+2512941/1069167616] Not tainted VLI
kernel: [43061450.240000] EFLAGS: 00010286 (2.6.15-26-server)
kernel: [43061450.240000] EIP is at 0xc06c082d
kernel: [43061450.240000] eax: 00000000 ebx: c11a537c ecx: df893040 edx: c1a10060
kernel: [43061450.240000] esi: 00aa8f00 edi: 00228c01 ebp: c014d140 esp: c1b4bf70
kernel: [43061450.240000] ds: 007b es: 007b ss: 0068
kernel: [43061450.240000] Process watchdog/1 (pid: 7, threadinfo=c1b4a000 task=dffdb560)
kernel: [43061450.240000] Stack: c1b4bf7c 008d4800 c1b21f34 00000000 00200200 00aa8f00 c012f360 dffdb560
kernel: [43061450.240000] c1a10ee0 c1b4a000 c1b21f34 c012f9b4 00000065 c1b4a000 c014d18f 000003e8
kernel: [43061450.240000] 00000001 c1b4bfb8 00000063 fffffffc c013b9d8 00000001 000000ff 00000000
kernel: [43061450.240000] Call Trace:
kernel: [43061450.240000] [process_timeout+0/16] process_timeout+0x0/0x10
kernel: [43061450.240000] [msleep_interruptible+84/112] msleep_interruptible+0x54/0x70
kernel: [43061450.240000] [watchdog+79/128] watchdog+0x4f/0x80
kernel: [43061450.240000] [kthread+200/208] kthread+0xc8/0xd0
kernel: [43061450.240000] [kthread+0/208] kthread+0x0/0xd0
kernel: [43061450.240000] [kernel_thread_helper+5/16] kernel_thread_helper+0x5/0x10
kernel: [43061450.240000] Code: d5 48 fe 45 fd ed 7f f0 05 47 f4 ca 96 37 c9 76 2a 73 dc 97 0b 86 14 9e fb 96 a7 29 5f 2b 5f 59 8f a9 f6 af f5 4b 20 f6 0f 78 ee <db> fe fc 14 bc 49 f7 bd ac d6 3a ec c4 35 19 38 8f a3 ec 2e c1
kernel: [43061450.250000] <3>BUG: soft lockup detected on CPU#1!
kernel: [43061468.310000]
kernel: [43061468.310000] Pid: 0, comm: swapper
kernel: [43061468.310000] EIP: 0060:[default_idle+44/96] CPU: 1
kernel: [43061468.310000] EIP is at default_idle+0x2c/0x60
kernel: [43061468.310000] EFLAGS: 00000246 Not tainted (2.6.15-26-server)
kernel: [43061468.310000] EAX: 00000000 EBX: 01612f60 ECX: c0101030 EDX: df902000
kernel: [43061468.310000] ESI: 00000001 EDI: 00000000 EBP: 00000000 DS: 007b ES: 007b
kernel: [43061468.310000] CR0: 8005003b CR2: b7fa1000 CR3: 3558f360 CR4: 000006b0
kernel: [43061468.310000] [cpu_idle+111/192] cpu_idle+0x6f/0xc0

Tags: kernel-oops
Revision history for this message
Matti Lindell (mlind) wrote :

Could you provide the output of
$ uname -r

Thanks.

Revision history for this message
TJ (tj) wrote :

The result is the same as reported in the EFLAGS line above:

2.6.15-26-server

The same error has happened twice more since the initial report. There was no log capture on these occasions as they were followed by kernel panics.

Revision history for this message
TJ (tj) wrote :

It is possible this error is caused by a misconfiguration of the DDR RAM modules on the Asus A7M266-D motherboard (mobo). I'll post a follow-up to this comment once I'm sure this is the solution.

The mobo has 3 of a possible 4 PC2100 (unbuffered) modules installed.
Memtest revealed a series of 'test-5' failures but moving the modules around didn't isolate one module.

Two DDR modules had been removed from this mobo and inserted in another identical mobo, and a spare module was installed. The installed modules are: 512, 512, 256; giving a total of 1280MB.

During boot the BIOS summary display usually disappears before it can be read but during one boot sequence I pressed 'Pause'. I noticed that the SDRAM slot report showed 2,3,4 rather than 1,2,3 as I would expect.

Examining the mobo manual and then the mobo itself revealed that slot-1 was empty. Slot-1 is nearest to the CPUs. The mobo isn't marked, so when the 2 modules were removed they were taken from slots 1 & 2.

I swapped the modules around, rebooted, and reran Memtest several times and it didn't detect any failures.

The server was rebooted and didn't crash in 19 hours. At that point it was restarted to test an install of Ubuntu 6.10 Server Edgy Eft Beta.

Note:
Asus A7M266-D manual states that the mobo will support either:

four (4) Registered DDR DIMMs (64MB - 3.5GB)
two (2) unbuffered DDR DIMMs (64MB - 2GB)

So running this mobo with more than 2 unbuffered modules could cause data corruption due to the load on the system bus. Something to be aware of of you're wanting to pack it with RAM.

However, I've been running these mobos with 4 unbuffered Crucial-branded 512MB DDR DIMMs for several years without problems.

Revision history for this message
TJ (tj) wrote :

Problem likely caused by incorrectly configured RAM modules

Changed in linux-source-2.6.15:
status: Needs Info → Unconfirmed
Revision history for this message
Gareth Fitzworthington (mapping-gp-deactivatedaccount) wrote :

This bug has had no activity for a considerable period. This is a check to see if there is still interest in investigating this bug report.
Appears to have been a hardware configuration problem.

Changed in linux-source-2.6.15:
status: New → Incomplete
Revision history for this message
Gareth Fitzworthington (mapping-gp-deactivatedaccount) wrote :

We are closing this bug report because it lacks the information we need to investigate the problem, as described in the previous comments. Please reopen it if you can give us the missing information, and don't hesitate to submit bug reports in the future. To reopen the bug report you can click on the current status, under the Status column, and change the Status back to "New". Thanks again!

Changed in linux-source-2.6.15:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.