tcp connections hang in forwarding machine

Bug #791512 reported by LaMont Jones
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned
linux-meta (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

after upgrading a NATting firewall to 2.6.32-32-generic-pae, tcp connections stall (thru timeout) after passing data each direction:

I see the threestep handshake, data going to the (http) server, an ack with data coming back, which we ack, and then nothing (this per tcpdump on both the LAN side, and the ppp0 interface (over PPPoE) with great regularity). When next I get permission to break this firewall, I'll check the PPPoE interface to see what passes on it, as well as testing a non-pae kernel.

Revision history for this message
Steve Conklin (sconklin) wrote :

There are a handful of commits which obviously touch code that may relate to this. They are all pretty close to each other in the history.

I'm building test kernels from points before and after this collection of patches. If you can test these two, it should speed up bisection and identification of the problem.

As soon as they are build, I'll post them and provide a link here.

Thanks!

Revision history for this message
Steve Conklin (sconklin) wrote :

Test kernels here:

http://people.canonical.com/~sconklin/lp791512/

If possible please report your results with each of these.

Thanks a lot Lamont!

Steve

Revision history for this message
LaMont Jones (lamont) wrote :

- linux-image-2.6.32-32-generic-pae_2.6.32-32.63~01spc823e22f_i386.deb
appears to be fail.

- linux-image-2.6.32-32-generic-pae_2.6.32-32.63~01spce92585c_i386.deb
appears to be working

Said testing being done with exactly one website. I'll leave the apparently-working kernel up for a while and therefore further testing.

Revision history for this message
Herton R. Krzesinski (herton) wrote :

Lamont: kernel for testing which patches which Tim asked for reverted are at http://people.canonical.com/~herton/lp791512/ for you to test

Revision history for this message
Herton R. Krzesinski (herton) wrote :

Lamont confirms the kernel above with the following two patches reverted work for him until now:
af_unix: Only allow recv on connected seqpacket sockets.
dccp: handle invalid feature options length

Next step will be trying to revert only the af_unix one.

Revision history for this message
Herton R. Krzesinski (herton) wrote :

Lamont: when you finish testing the kernel with the two reverts, being sure about it, please try the next kernel just with the change "af_unix: Only allow recv on connected seqpacket sockets." reverted now, available on http://people.canonical.com/~herton/lp791512/r3/

When you finish testing this and if you don't have any issues, then it would be good to do a sanity check just in case, and install the kernel 2.6.32-33.68 from -proposed, and check that it really still has the bug.

Revision history for this message
LaMont Jones (lamont) wrote :

r3 seems to work fine, stock 33.68 from -proposed fails (though did actually give me the oft-tried-and-failed earlier web site)

I have r3 running live to see if that continues to work, but I believe it's solid.

Revision history for this message
Tim Gardner (timg-tpi) wrote :

<tgardner> lamont, are you still happy with the kernel herton built for you yesterday regarding bug #791512 ?
<lamont> tgardner: I have heard no complaints, and experienced no issues with it
<tgardner> lamont, great. I'll add that to the bug.

Revision history for this message
Herton R. Krzesinski (herton) wrote :

This is a testcase that shows what that af_unix patch reverted on test kernel is fixing.

Booted on a natty 8.42 kernel here, and the test case didn't throw an error, recv waits, and no oops happened.
On 10.44 with the problem fixed, the testcase exits with an error.

Revision history for this message
Steve Conklin (sconklin) wrote :

hey Lamont,

Next time you have access to that machine, how about doing:

lsof -i

to see whether that gives any indication of which process may be using sockets

There's likely to be a lot of noise, especially if it is a NAT box - but it might tell us something

Revision history for this message
Tim Gardner (timg-tpi) wrote :

Herton - how about instrumenting a kernel with a 'WARN_ON(sk->sk_state != TCP_ESTABLISHED)' in the right place. Wouldn't that tell us the name of the offending thread ?

Revision history for this message
Herton R. Krzesinski (herton) wrote :

Yes, WARN_ON should be enough to trace what app is at fault.

@Lamont: can you install the kernel at http://people.canonical.com/~herton/lp791512/r4/ and test? It should continue to work, but spill warnings in the kernel log, please attach the log here when it happens.

Revision history for this message
LaMont Jones (lamont) wrote :

Installed, will advise.

Revision history for this message
LaMont Jones (lamont) wrote :

Interestingly, the r4 kernel fails to pass some (but not all) forwarded traffic, and does not log anything. I have reverted to the r3 kernel, which continues to work.

Revision history for this message
Herton R. Krzesinski (herton) wrote :

This is strange. So the code shuffling made another bug appear somehow (timing?).

@Lamont: can you test a new kernel?, it's on http://people.canonical.com/~herton/lp791512/r5/

let us know if you get failures/warnings with it.

Changed in linux-meta (Ubuntu):
status: New → Invalid
Revision history for this message
Herton R. Krzesinski (herton) wrote :

LaMont, may be there is a memory corruption in this case here, can you also test the kernel at http://people.canonical.com/~herton/lp791512/r6/ , and boot it with 'slub_debug' kernel boot parameter added?

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in dianosing the problem. From a terminal window please run:

apport-collect 791512

and then change the status of the bug back to 'New'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.