Ethernet dies during large file transfers on vexpress

Bug #673820 reported by Matt Waddel
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Linaro Linux
Won't Fix
High
Paweł Moll
linux-linaro-vexpress (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

While transferring large files the ethernet silently dies. I was able to recreate this problem by transferring a kernel using scp -r. This particular kernel had some files >200M. Many small transfers succeeded, but when it tried to transfer the large file the ethernet stopped working. I could get the network back up by typing "ifconfig eth0 down" and "ifconfig eth0 up" again.

Tags: ve-a9x4
Loïc Minier (lool)
summary: - Ethernet dies during large file transfers
+ Ethernet dies during large file transfers on vexpress
Revision history for this message
Matt Waddel (mwaddel) wrote :

After looking a bit more at this problem I believe the fundamental problem is the USB drive
doesn't take data from the ethernet stream quick enough. I found that if I create a drive in
RAM and run the same "scp -r" command it doesn't fail. However, the ethernet still shouldn't
fail like this when this error condition occurs.

Revision history for this message
Matt Waddel (mwaddel) wrote :

More info:

# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:02:f7:00:3c:ec
          inet addr:192.168.1.109 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::202:f7ff:fe00:3cec/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:222808 errors:0 dropped:1095 overruns:0 frame:0
          TX packets:49092 errors:0 dropped:0 overruns:0 carrier:0
          collisions:45274 txqueuelen:1000
          RX bytes:336727204 (336.7 MB) TX bytes:3495507 (3.4 MB)
          Interrupt:47

This is with only my host, a hub, and the vexpress connected. The
number of collisions and dropped packets seems way too high.

Revision history for this message
Ken Werner (kwerner) wrote :

Hm, this is interesting. I've directly connected the vexpress to my thinkpad and didn't see any collisions.
My testcase was just to transfer a ~500 MB tarball via scp:
Thinkpad:
<snip>
  $ scp gcc-linaro-20101115.tar.bz2 192.168.100.2:
  gcc-linaro-20101115.tar.bz2 23% 126MB 231.5KB/s - stalled -Timeout, server not responding.
  lost connection
  $ ping -c 5 192.168.100.2
  PING 192.168.100.2 (192.168.100.2) 56(84) bytes of data.

  --- 192.168.100.2 ping statistics ---
  5 packets transmitted, 0 received, 100% packet loss, time 4032ms
</snip>

vexpress after the network died (via serial console):
<snip>
  # ifconfig eth0
  eth0 Link encap:Ethernet HWaddr 00:02:f7:00:3c:xx
            inet addr:192.168.100.2 Bcast:192.168.100.255 Mask:255.255.255.0
            inet6 addr: fe80::202:f7ff:fe00:3c9d/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
            RX packets:146824 errors:0 dropped:27 overruns:0 frame:0
            TX packets:21315 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:209272095 (209.2 MB) TX bytes:1661565 (1.6 MB)
            Interrupt:47
</snip>

I noticed that the number of RX/TX packets are still increasing on the vexpress when attempting to connect to the board.

Revision history for this message
John Rigby (jcrigby) wrote :

Matt: Is this still broken in Natty (2.6.37 kernel)?

Revision history for this message
Matt Waddel (mwaddel) wrote :

Yes, this still fails in 2.6.37-rc4.

John Rigby (jcrigby)
Changed in linux-linaro-vexpress (Ubuntu):
status: New → Confirmed
Changed in linux-linaro:
status: New → Confirmed
Revision history for this message
John Rigby (jcrigby) wrote :

This needs be assigned to KWG or vexpress landing team.

Revision history for this message
Loïc Minier (lool) wrote :

We don't have a vexpress landing team; could we escalate this to ARM?

Matt, any good contact to forward this to?

Revision history for this message
Matt Waddel (mwaddel) wrote :

Probably the best place to start would be with:

<email address hidden>

Revision history for this message
Loïc Minier (lool) wrote :

Hey Paweł,

would you know who to raise this bug to within ARM? We currently don't have people dedicated to vexpress specific bugs in Linaro, and we're looking from some help from BSP folks on your side :-)

Thanks!

Revision history for this message
Paweł Moll (pawel-moll) wrote :

Well, I feel it will end on my desk anyway ;-)

I can't commit to any dates right now, but I've put this on my list and let you keep you informed. Of course I'll have to reproduce it here...

Revision history for this message
Loïc Minier (lool) wrote :

Ok; I think dmart and Will Deacon were discussing this bug on IRC earlier and planned getting together tomorrow to see whether they could do something about smc911x.c.

12:02 < wildea01> I had a brief dig in the smsc9118 driver and found a bunch of
          problems
[...]
12:03 < wildea01> (off the top of my head): (1) There's a must-be-one bit in
          one of the control registers that is 0 out of reset and we don't
          write it
12:03 < wildea01> (2) There are read-after-read, read-after-write etc minimum
          delays that the driver doesn't honour
12:03 < wildea01> (3) the fifo fast forward function is called with number of
          bytes instead of number of words (or the other way around, can't
          remember)
12:04 < wildea01> (4) the locking is done too low-level (i.e. around the
          register accessors) which gives scope for deadlock if the caller
          holds other locks
12:05 < wildea01> they're the main things I can remember
[...]
12:05 < wildea01> I remember having to add some locks to the NAPI poll handler
          to make it play nice with the IRQ handler
[...]
12:06 < wildea01> on top of this, the hardware could be broken too
12:06 < dmart> mattman had a pretty reliable testcase for this ... I can repost
          it to linaro-dev if anyone is interested
12:06 < wildea01> I started writing a new driver so I could at least validate
          the hardware but I didn't get very far because I had other stuff to do
[...]
12:11 < lool> wildea01, dmart: I'm not sure who will work on this bug; I don't
          see anybody working on bugs which affect vexpress these days because
          we don't have an ARM LT; if it affects Samsung, we could try to
          mention it there
12:11 < wildea01> I have a hunch that the problem is related to receiving bad
          data from the network / flow control
12:11 < lool> wildea01, dmart, davidgiluk: If any of you has the bandwidth to
          work on it, that would be good :-)
12:11 < davidgiluk> lool: I don't have physical access to any of the boards
12:12 < wildea01> dmart: I can look at this tomorrow with you if you like but
          after that I doubt I'll have any time

Changed in linux-linaro:
assignee: nobody → Paweł Moll (pawel-moll)
Revision history for this message
John Rigby (jcrigby) wrote :

Can we close this by saying it is a hw issue?

Revision history for this message
Paweł Moll (pawel-moll) wrote :

As said above it rather looks like the SMP problems in the smsc driver.

Now, frankly speaking, I totally forgot about this issue, and it looks that I won't have chance to look into it in the nearest time. But, as we have ARM Landing Team now (allegedly ;-) maybe this bug should be assigned to them?

Revision history for this message
Mounir Bsaibes (mounir-bsaibes) wrote :

Setting the priority high.

Changed in linux-linaro:
importance: Undecided → High
Revision history for this message
Mounir Bsaibes (mounir-bsaibes) wrote :

needs a device driver fix, which will not be handled by linaro-linux project.

Changed in linux-linaro:
status: Confirmed → Won't Fix
John Rigby (jcrigby)
Changed in linux-linaro-vexpress (Ubuntu):
status: Confirmed → Invalid
Revision history for this message
Thomas Abraham (thomas-ab) wrote :

smdkv310 board uses a smsc9215 ethernet controller. I had similar problem with large file transfers on smdkv310 board (using smsc911x driver). But if SMSC911X_USE_32BIT flag is added to the driver's platform data flags, large file transfers did work. I will debug the issue with 16BIT register/fifo writes on smdkv310 board.

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

We should raise this with the relevant upstream developer(s).

The general feeling seems to be that this is almost certainly a driver issue.

Revision history for this message
Matt Waddel (mwaddel) wrote :

Just to be sure, I checked the platform data flags for the
vexpress and the SMSC911X_USE_32BIT flag is set.

static struct smsc911x_platform_config v2m_eth_config = {
 .flags = SMSC911X_USE_32BIT,
 .irq_polarity = SMSC911X_IRQ_POLARITY_ACTIVE_HIGH,
 .irq_type = SMSC911X_IRQ_TYPE_PUSH_PULL,
 .phy_interface = PHY_INTERFACE_MODE_MII,
};

Revision history for this message
Ulrich Weigand (uweigand) wrote :

I'm running into this issue with failing network connections on vexpress while trying to run the GDB testsuite in remote mode ...

Has there been any progress on this bug in the meantime?

Revision history for this message
Dave Martin (dave-martin-arm) wrote : Re: [Bug 673820] Re: Ethernet dies during large file transfers on vexpress

On Tue, Jun 28, 2011 at 04:05:57PM -0000, Ulrich Weigand wrote:
> I'm running into this issue with failing network connections on vexpress
> while trying to run the GDB testsuite in remote mode ...
>
> Has there been any progress on this bug in the meantime?

I don't believe so.

Taking the network down and bringing it up again gets things working
again, and it's possible to automate this with a simple script.
That won't work for all uses though.

AFAIK, the underlying cause remains a problem with the smsc911x network
driver but unfortunately I think that nobody has identified or fixed it.

Revision history for this message
Paweł Moll (pawel-moll) wrote :

Could anyone give this patch a go? It's just a hack (if I'm right this will have to go to Boot Monitor, not kernel), but it may improve Ethernet stability, potentially at some cost of performance loss. Anyway - let me know.

Revision history for this message
Matt Waddel (mwaddel) wrote :

Hi Pawel,

I ran my network tests 10x without any failures, so this patch helps a lot. (The tests would usually fail the first time.)

One question, what is this patch actually doing, preventing the transfer lock-up or recovering from a locked up transfer? The reason I ask is the transfer rate goes to 0 several times during the tests, but recovers each time. Either way this is definitely an improvement over the current situation.

Revision history for this message
Paweł Moll (pawel-moll) wrote :

The "lock up" was caused by SMSC RX FIFO underruns caused by maginal timing issue between SMSC and USB host controller. The hack is enforcing additional wait cycle between accesses. Unfortunately it affects all peripherals, thus the performance penalty.

What you observe is, I guess, situation when buffers are getting filled with data from Ethernet and are flushed to USB mass storage device and the host controller driver is "taking over" the bus, reducing SMSC traffic. Maybe you could use Streamline to see what is going on there :-)

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

On Tue, Aug 23, 2011 at 10:29 AM, Paweł Moll <email address hidden> wrote:
> The "lock up" was caused by SMSC RX FIFO underruns caused by maginal
> timing issue between SMSC and USB host controller. The hack is enforcing
> additional wait cycle between accesses. Unfortunately it affects all
> peripherals, thus the performance penalty.

Comments! and/or proper symbolic names for those magic numbers would
be good for a final patch, of course :)

Do you believe the patch is a fix for this particular issue, or will
it just reduce the probability (and hence frequency) of problems?

Note that in either case the smsc911x driver is still believed to have
problems of its own, like incorrect FIFO skipping on bad packets, and
some SMP safety issues.

Cheers
---Dave

Revision history for this message
Paweł Moll (pawel-moll) wrote : RE: [Bug 673820] Re: Ethernet dies during large file transfers on vexpress

> Comments! and/or proper symbolic names for those magic numbers would
> be good for a final patch, of course :)

I thought I made this clear - there will be no final patch :-)

The entity responsible for SMC configuration is Boot Monitor. This
patch is just a hack to confirm (or deny) my theory.

> Do you believe the patch is a fix for this particular issue, or will
> it just reduce the probability (and hence frequency) of problems?

This hack is a work-around rather than a fix as it is probably too
expensive in terms of I/O performance. We may be able to fix it properly
on IO FPGA level. No promises, terms and conditions apply ;-)

> Note that in either case the smsc911x driver is still believed to have
> problems of its own, like incorrect FIFO skipping on bad packets, and
> some SMP safety issues.

No argument here. Even more - the driver can (and should) detect the
problematic situation and perform soft reset to recover. So there are
plenty of opportunities for improvements. This doesn't change the fact
that (widely defined) hardware behaves incorrectly in this case.

Paweł

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Revision history for this message
Paweł Moll (pawel-moll) wrote :

Ok, we have managed to fix the problem at the IO FPGA level. The updated bitfile will be release on the next VE CD (end of this year), in the meantime interested parties may contact ARM Support in order to get the fix.

Revision history for this message
Ulrich Weigand (uweigand) wrote :

I've now installed the updated FPGA bitfiles on my VE, and this does indeed appear to fix the problems I've been seeing. Thanks!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.