linux-image 2.6.15-28.55 regression from 2.6.15-28.53, crashes under network load

Bug #116815 reported by Mattias Wadenstein
6
Affects Status Importance Assigned to Milestone
linux-source-2.6.15 (Ubuntu)
Invalid
High
7ARZAN1985

Bug Description

Binary package hint: linux-source-2.6.15

After doing the security upgrade to 2.6.15-28.55 our server started to crash within minutes of boot. Backing down to 28.53 put is back in stable operation again.

The lockups look like this:
[42950993.170000] BUG: soft lockup detected on CPU#0!
[42950993.170000]
[42950993.170000] Pid: 25106, comm: downloader
[42950993.170000] EIP: 0060:[<c0189a0c>] CPU: 0
[42950993.170000] EIP is at posix_locks_deadlock+0x5c/0xc0
[42950993.170000] EFLAGS: 00000202 Tainted: P (2.6.15-28-server)
[42950993.170000] EAX: dfb67e40 EBX: cf67934c ECX: ffffffff EDX: da95f460
[42950993.170000] ESI: da95fb30 EDI: da95f1d8 EBP: cf67917c DS: 007b ES: 007b
[42950993.170000] CR0: 8005003b CR2: b5d25000 CR3: 1fa9f340 CR4: 000006f0
[42950993.170000] [<c0189c12>] __posix_lock_file+0x82/0x5f0
[42950993.170000] [<c0171684>] nameidata_to_filp+0x44/0x50
[42950993.170000] [<c018b4b0>] fcntl_setlk+0x2d0/0x370
[42950993.170000] [<c013c130>] autoremove_wake_function+0x0/0x60
[42950993.170000] [<c0186b38>] sys_fcntl64+0xb8/0xe0
[42950993.170000] [<c0103313>] sysenter_past_esp+0x54/0x75

And there are serious page allocation failures, like:

ingrid-h.hpc2n.umu.se login: [42949719.420000] downloader: page allocation failure. order:1, mode:0x20
[42949719.420000] [<c0154217>] __alloc_pages+0x217/0x320
[42949719.420000] [<c014d674>] handle_IRQ_event+0x64/0x70
[42949719.420000] [<c0157cb9>] kmem_getpages+0x49/0xe0
[42949719.420000] [<c0158a67>] alloc_slabmgmt+0x57/0x60
[42949719.420000] [<c0158c48>] cache_grow+0xa8/0x1b0
[42949719.420000] [<c0158f54>] cache_alloc_refill+0x204/0x240
[42949719.420000] [<c015928e>] __kmalloc+0x7e/0x80
[42949719.420000] [<c028c5df>] __alloc_skb+0x5f/0x180
[42949719.420000] [<c02c9ed5>] tcp_collapse+0x125/0x350
[42949719.420000] [<c02ca233>] tcp_prune_queue+0x83/0x210
[42949719.420000] [<c02c9671>] tcp_data_queue+0x561/0xca0

[.... full dump on http://www.acc.umu.se/~maswan/ubuntu/page-alloc-28.55]

Running 28.53, we would occasionally get an OoM on a process, but not total crashes like on 28.55.

/Mattias Wadenstein

Revision history for this message
Mattias Wadenstein (maswan) wrote :

Forgot to say, it is linux-image-2.6.15-28-server_2.6.15-28.53_i386.deb, so server flavour on i386.

Revision history for this message
Colin Watson (cjwatson) wrote :

Regression in security update -> at least high

Changed in linux-source-2.6.15:
assignee: nobody → ubuntu-kernel-team
importance: Undecided → High
status: Unconfirmed → Confirmed
Revision history for this message
Ben Collins (ben-collins) wrote :

Can you tell me what "downloader" is? Would also appreciate a full dmesg, and please attach to this bug report (via web interface) as opposed to an off-site link).

Thanks

Revision history for this message
Mattias Wadenstein (maswan) wrote :

The "downloader" is a part of application software that stages files from remote sources onto an nfs-mounted filesystem. Typically 10-20 parallel streams in each downloader, and 10-20 downloaders running during high load. Aggregate performance 20-80MByte/s, most of this with large tcp windows.

Attached is a dmesg from a working .53, we'll try to get a .55 up and running and reliably crashing on monday on a test system with articifial load and send some more debug output from there.

Revision history for this message
Roger Miller (zill) wrote :

Is 2.6.15-28.55 safe for a desktop install? If not then suggest this should be removed from repositories until issue resolved. A warning should also be posted on ubuntuforums.org to help those who have already installed this upgrade.

Revision history for this message
Mattias Wadenstein (maswan) wrote :

Roger Miller: There seems to be little cause for fear in normal installations. Unless you are OoM:ing in heavy network usage, I don't think you have anything to fear. Had this been a cause for concern for regular users, this would have been handled much more "high-profile", and with greater speed too.

Revision history for this message
Roger Miller (zill) wrote :

Thanks for the assurance Mattias. Just thought the question should be raised as many normal desktop installations may well have some servers, such as NFS, enabled. I have never had OoM errors previously on my two Dapper PCs - if these do occur with the new kernel image then I shall report back.

Revision history for this message
Mattias Wadenstein (maswan) wrote :

Ok, further investigation here shows that this is not actually a regression, just that the load on our server after the upgrade was higher than earlier.

So, not a regression, but a general kernel bug during our extreme load pattern here.

Revision history for this message
ake sandgren (ake-sandgren) wrote :

Update

We just got hit by the soft lockup again. Now running 2.6.15-29.60. Backtrace slightly different though.

[43422009.670000] BUG: soft lockup detected on CPU#0!
[43422009.670000]
[43422009.670000] Pid: 6687, comm: downloader
[43422009.670000] EIP: 0060:[<c01899ac>] CPU: 0
[43422009.670000] EIP is at posix_locks_deadlock+0x5c/0xc0
[43422009.670000] EFLAGS: 00000202 Tainted: P (2.6.15-29-server)

[43422009.670000] EAX: dfeade40 EBX: df8b99c4 ECX: ffffffff EDX: e9398ad8
[43422009.670000] ESI: e7cdd4b8 EDI: d323fca0 EBP: e7cdd570 DS: 007b ES: 007b
[43422009.670000] CR0: 8005003b CR2: 080e5e28 CR3: 1fada980 CR4: 000006f0
[43422009.670000] [<c0189bb2>] __posix_lock_file+0x82/0x5f0
[43422009.670000] [<f8a78b00>] linvfs_open+0x0/0xb0 [xfs]
[43422009.670000] [<c0171376>] __dentry_open+0xe6/0x220
[43422009.670000] [<c01715f4>] nameidata_to_filp+0x44/0x50
[43422009.670000] [<c018b980>] fcntl_setlk64+0x2d0/0x370
[43422009.670000] [<c01716eb>] get_unused_fd+0x6b/0xd0
[43422009.670000] [<c01717cc>] fd_install+0x2c/0x70
[43422009.670000] [<c0186a8f>] sys_fcntl64+0x6f/0xe0
[43422009.670000] [<c0103313>] sysenter_past_esp+0x54/0x75

Same server as above.

Revision history for this message
Launchpad Janitor (janitor) wrote : Kernel team bugs

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

Revision history for this message
leifol (leiflois-75) wrote :

Sorry can not write or read english.

Changed in linux-source-2.6.15 (Ubuntu):
assignee: nobody → tarzan-city
Changed in linux-source-2.6.15 (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.