[hns3-0115] add 8 BD limit for tx flow

Bug #1859756 reported by Fred Kimmy
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kunpeng920
Fix Released
Undecided
Ike Panhc
Ubuntu-18.04
Fix Released
Undecided
Ike Panhc
Ubuntu-18.04-hwe
Fix Released
Undecided
Unassigned
Ubuntu-20.04
Fix Released
Undecided
Unassigned
Upstream-kernel
Fix Released
Undecided
Unassigned
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Medium
Ike Panhc

Bug Description

[Impact]
We get reports that iscsi and spark tests fail on hns3

[Fix]
Cherry-pick/backport patches from upstream.
net: hns3: add 8 BD limit for tx flow
net: hns3: avoid mult + div op in critical data path
net: hns3: remove some ops in struct hns3_nic_ops
net: hns3: fix for not calculating tx bd num correctly
net: hns3: unify maybe_stop_tx for TSO and non-TSO case
net: hns3: add check for max TX BD num for tso and non-tso case
net: hns3: fix for TX queue not restarted problem
net: hns3: fix a use after free problem in hns3_nic_maybe_stop_tx()

[Test]
No known way to reproduce it in our lab. Regression test only.

[Regression Potential]
Patchset only affects hns3 driver. Minimal risk for other drivers and platform.

[Bug Description]
 A single transmit packet can span up to 8 descriptors,
 TSO transmit packet can be stored up to 63 descriptors
 and each segment within the TSO should be spanned up to
 8 descriptors.

If the packet needs more than 8 BD, and the total size of
 every 7 continuous frags more than MSS, HW does not support
 it, and it need driver makes SKB Linearized.

[Actual Results]
 iscsi and bigdata spark test OK

[Expected Results]
 iscsi and bigdata spark test OK

[Reproducibility]
 Inevitably

[Additional information]
 Hardware: D06
 Firmware: NA
 Kernel: NA
 DTS2018091810050

[Resolution]
 SW use skb_copy to merge frag;

51e8439f3496 net: hns3: add 8 BD limit for tx flow
5f543a54eec0 net: hns3: fix for not calculating tx bd num correctly

Ike Panhc (ikepanhc)
tags: added: ikeradar
description: updated
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Patch 5f543a54eec0 ("net: hns3: fix for not calculating tx bd num correctly") fixes 3fe13ed95dd3 ("net: hns3: avoid mult + div op in critical data path"), which is merged into mainline since 5.1

Revision history for this message
Ike Panhc (ikepanhc) wrote :

So not suitable for 4.15 kernel.

Changed in kunpeng920:
status: New → Fix Committed
Ike Panhc (ikepanhc)
tags: removed: ikeradar
Changed in kunpeng920:
status: Fix Committed → Fix Released
Revision history for this message
Fred Kimmy (kongzizaixian) wrote :
Download full text (7.1 KiB)

net: hns3: add 8 BD limit for tx flow
net: hns3: fix a use after free problem in hns3_nic_maybe_stop_tx()
net: hns3: avoid mult + div op in critical data path
net: hns3: fix for not calculating tx bd num correctly

this patchset have cause some error for net card as following:
IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Apr 30 10:57:22 arm-u18-48c kernel: [ 15.050113] hns3 0000:bd:00.0 eth0: link up
Apr 30 10:57:22 arm-u18-48c kernel: [ 15.050130] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Apr 30 11:00:07 arm-u18-48c kernel: [ 181.144833] Netfilter messages via NETLINK v0.30.
Apr 30 11:00:07 arm-u18-48c kernel: [ 181.151372] ip_set: protocol 6
Apr 30 11:14:59 arm-u18-48c kernel: [ 1073.485529] hrtimer: interrupt took 660 ns
Apr 30 11:17:09 arm-u18-48c kernel: [ 1202.814563] hns3 0000:bd:00.0: PPU_PF_ABNORMAL_INT_ST over_8bd_no_fe found [error status=0x1]
Apr 30 11:17:09 arm-u18-48c kernel: [ 1202.826307] hns3 0000:bd:00.0: PF Reset requested
Apr 30 11:17:09 arm-u18-48c kernel: [ 1202.878715] hns3 0000:bd:00.0: PF failed(=-5) to send mailbox message to VF
Apr 30 11:17:09 arm-u18-48c kernel: [ 1202.909837] hns3 0000:bd:00.0: inform reset to vf(1) failed -5!
Apr 30 11:17:09 arm-u18-48c kernel: [ 1202.918236] hns3 0000:bd:00.0: PF failed(=-5) to send mailbox message to VF
Apr 30 11:17:09 arm-u18-48c kernel: [ 1202.936307] hns3 0000:bd:00.0: inform reset to vf(2) failed -5!
Apr 30 11:17:09 arm-u18-48c kernel: [ 1202.954199] hns3 0000:bd:00.0: PF failed(=-5) to send mailbox message to VF
Apr 30 11:17:09 arm-u18-48c kernel: [ 1202.964959] hns3 0000:bd:00.0: inform reset to vf(3) failed -5!
Apr 30 11:17:09 arm-u18-48c kernel: [ 1202.978401] hns3 0000:bd:00.0: PF failed(=-5) to send mailbox message to VF
Apr 30 11:17:09 arm-u18-48c kernel: [ 1202.994278] hns3 0000:bd:00.0: inform reset to vf(4) failed -5!
Apr 30 11:17:09 arm-u18-48c kernel: [ 1203.006549] hns3 0000:bd:00.0: PF failed(=-5) to send mailbox message to VF
Apr 30 11:17:09 arm-u18-48c kernel: [ 1203.016382] hns3 0000:bd:00.0: inform reset to vf(5) failed -5!
Apr 30 11:17:09 arm-u18-48c kernel: [ 1203.026513] hns3 0000:bd:00.0: PF failed(=-5) to send mailbox message to VF
Apr 30 11:17:09 arm-u18-48c kernel: [ 1203.036399] hns3 0000:bd:00.0: inform reset to vf(6) failed -5!
Apr 30 11:17:09 arm-u18-48c kernel: [ 1203.050229] hns3 0000:bd:00.0: PF failed(=-5) to send mailbox message to VF
Apr 30 11:17:09 arm-u18-48c kernel: [ 1203.059686] hns3 0000:bd:00.0: inform reset to vf(7) failed -5!
Apr 30 11:17:10 arm-u18-48c kernel: [ 1204.236266] hns3 0000:bd:00.0 eth0: link down
Apr 30 11:17:10 arm-u18-48c kernel: [ 1204.364172] hns3 0000:bd:00.0: prepare wait ok
Apr 30 11:17:10 arm-u18-48c kernel: [ 1204.600847] hns3 0000:bd:00.0: The firmware version is 0109210a
Apr 30 11:17:10 arm-u18-48c kernel: [ 1204.613248] hns3 0000:bd:00.0: Reset done, hclge driver initialization finished.
Apr 30 11:17:11 arm-u18-48c kernel: [ 1205.522648] hns3 0000:bd:00.0: SSU_PORT_BASED_ERR_INT roc_pkt_without_key_port found [error status=0x1]
Apr 30 11:17:11 arm-u18-48c kernel: [ 1205.522658] hns3 0000:bd:00.0: PPU_PF_ABNORMAL_INT_ST over_8bd_no_fe found [error status=0x1]
Apr 30 11:17:11 arm-u1...

Read more...

Changed in kunpeng920:
status: Fix Released → New
Revision history for this message
dann frazier (dannf) wrote :

@Fred: In comment #3 you state "this patchset have cause some error". If this patch set has introduced a bug, please report that in a new bug. However, since you moved the Ubuntu-18.04 task back to "New" at that time, I wonder if your intent was to demonstrate that those patches are *required to fix* a bug in 4.15.

  1) Can you clarify the above?

  2) Which kernel version created the log in Comment #3?

Revision history for this message
Fred Kimmy (kongzizaixian) wrote :

=>@Fred: In comment #3 you state "this patchset have cause some error". If this patch set has =>introduced a bug, please report that in a new bug. However, since you moved the Ubuntu-18.04 t=>ask back to "New" at that time, I wonder if your intent was to demonstrate that those patches =>are *required to fix* a bug in 4.15.

=> 1) Can you clarify the above?

=>2) Which kernel version created the log in Comment #3?

If not merge this aboving patchset, ubuntu 18.04.1 version will reproduce this error log, Can you backport it into ubuntu 18.04.1 update version?

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Hi Xinwei,

I get lots of conflict on cherry-pick d1a37dedcfcf ("net: hns3: fix a use after free problem in hns3_nic_maybe_stop_tx()") to bionic 4.15 Ubuntu kernel. Is there anyway to fix this?

and in bug description it says iscsi and spark test. Could you also provide how to reproduce the failure?

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Marking as incomplete while waiting for a detailed reproducer and assistance with the merge conflict.

Changed in kunpeng920:
status: New → Incomplete
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Working with Huawei to identify a patchset that will cleanly apply to 4.15.

Changed in kunpeng920:
assignee: nobody → Ike Panhc (ikepanhc)
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Summary of email conversation between Ike Panhc and <email address hidden>:

On 2020/6/1 12:17, Ike Panhc wrote:
...
> Since our target is 51e8439f3496 ("net: hns3: add 8 BD limit for tx flow"),
> and we have its fix d1a37dedcfcf ("net: hns3: fix a use after free problem in hns3_nic_maybe_stop_tx()")
>
> Are patches 3fe13ed95dd3 ("net: hns3: avoid mult + div op in critical data path") and
> 5f543a54eec0 ("net: hns3: fix for not calculating tx bd num correctly") needed too?
>
> If they are not, it will be much simpler and less risk for regression.
>

Hi Ike:

This two is not need. Thanks.

<End of summary>

Based on this, the next step is to investigate backporting the two patches that directly address the subject of this bug report.

Changed in kunpeng920:
status: Incomplete → Triaged
Revision history for this message
Ike Panhc (ikepanhc) wrote :

I remembered wrong patch that introduces conflicts when cherry-picking to 4.15. We still need to work on d1a37dedcfcf ("net: hns3: fix a use after free problem in hns3_nic_maybe_stop_tx()")

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Finished backporting and its git branch is here[1]. Also build debs[2].

[1] https://kernel.ubuntu.com/git/ikepanhc/public.git/log/?h=lp1859756
[2] https://kernel.ubuntu.com/~ikepanhc/lp1859756/

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Running iperf3 test with kernel deb in #11 and on eno3/4 I can reach its limitation for 1hr each.

https://kernel.ubuntu.com/~ikepanhc/lp1859756/submission_2020-06-09T08.18.12.667092.html

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Another iperf3 testing on eno3/4 of d06ES is passed.

https://kernel.ubuntu.com/~ikepanhc/lp1859756/submission_2020-06-11T07.45.27.697462.html#1-13-log

Next step for me is to run iperf3 test on eno1 of d061, which 10Gb/s connected.

Changed in kunpeng920:
status: Triaged → In Progress
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Long term run on eno1 of d061 looks good to me.

ubuntu@scobee:~$ iperf -c 10.228.68.67 -t 18000
------------------------------------------------------------
Client connecting to 10.228.68.67, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.228.68.118 port 42408 connected with 10.228.68.67 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-18000.0 sec 17.8 TBytes 8.70 Gbits/sec
ubuntu@scobee:~$ iperf -c 10.228.68.67 -t 18000 -P2
------------------------------------------------------------
Client connecting to 10.228.68.67, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 4] local 10.228.68.118 port 42424 connected with 10.228.68.67 port 5001
[ 3] local 10.228.68.118 port 42422 connected with 10.228.68.67 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-18000.0 sec 9.56 TBytes 4.67 Gbits/sec
[ 4] 0.0-18000.0 sec 9.42 TBytes 4.60 Gbits/sec
[SUM] 0.0-18000.0 sec 19.0 TBytes 9.27 Gbits/sec
ubuntu@scobee:~$ ifconfig | grep -B2 118
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 10.228.68.118 netmask 255.255.255.0 broadcast 10.228.68.255
ubuntu@scobee:~$ uname -a
Linux scobee 4.15.0-106-generic #107-Ubuntu SMP Thu Jun 4 11:28:55 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Also run 96 threads on 10Gb/s hns3 for 5hr and no kernel error message.

@Xinwei,

Could you or your colleague run regression test on kernel debs in #11 and let me know if the backport is ok?

kernel debs are at https://kernel.ubuntu.com/~ikepanhc/lp1859756/

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Marking as incomplete while waiting for the regression test runs requested in the last comment.

Changed in kunpeng920:
status: In Progress → Incomplete
Revision history for this message
Fred Kimmy (kongzizaixian) wrote :

=>Could you or your colleague run regression test on kernel debs in #11 and let me know if the backport is ok?

test is ok in our CI environment.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Thanks. I will make final regression test and then propose those patches for SRU process.

Changed in kunpeng920:
status: Incomplete → In Progress
Ike Panhc (ikepanhc)
tags: added: ikeradar
Ike Panhc (ikepanhc)
Changed in linux (Ubuntu Bionic):
status: New → In Progress
Changed in linux (Ubuntu):
status: New → Fix Released
Ike Panhc (ikepanhc)
description: updated
Revision history for this message
Ike Panhc (ikepanhc) wrote :
Stefan Bader (smb)
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
Ike Panhc (ikepanhc)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Ike Panhc (ikepanhc)
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in kunpeng920:
status: In Progress → Fix Committed
Revision history for this message
Ike Panhc (ikepanhc) wrote :

These patches now are targeting 18.04.5-sru-1

Ike Panhc (ikepanhc)
tags: removed: ikeradar
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Thanks. Ubuntu-5.4.0-43.47 works good to me

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (55.0 KiB)

This bug was fixed in the package linux - 4.15.0-115.116

---------------
linux (4.15.0-115.116) bionic; urgency=medium

  * bionic/linux: 4.15.0-115.116 -proposed tracker (LP: #1893055)

  * [Potential Regression] dscr_inherit_exec_test from powerpc in
    ubuntu_kernel_selftests failed on B/E/F (LP: #1888332)
    - powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()

linux (4.15.0-114.115) bionic; urgency=medium

  * bionic/linux: 4.15.0-114.115 -proposed tracker (LP: #1891052)

  * ipsec: policy priority management is broken (LP: #1890796)
    - xfrm: policy: match with both mark and mask on user interfaces

linux (4.15.0-113.114) bionic; urgency=medium

  * bionic/linux: 4.15.0-113.114 -proposed tracker (LP: #1890705)

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * Reapply "usb: handle warm-reset port requests on hub resume" (LP: #1859873)
    - usb: handle warm-reset port requests on hub resume

  * Bionic update: upstream stable patchset 2020-07-29 (LP: #1889474)
    - gpio: arizona: handle pm_runtime_get_sync failure case
    - gpio: arizona: put pm_runtime in case of failure
    - pinctrl: amd: fix npins for uart0 in kerncz_groups
    - mac80211: allow rx of mesh eapol frames with default rx key
    - scsi: scsi_transport_spi: Fix function pointer check
    - xtensa: fix __sync_fetch_and_{and,or}_4 declarations
    - xtensa: update *pos in cpuinfo_op.next
    - drivers/net/wan/lapbether: Fixed the value of hard_header_len
    - net: sky2: initialize return of gm_phy_read
    - drm/nouveau/i2c/g94-: increase NV_PMGR_DP_AUXCTL_TRANSACTREQ timeout
    - irqdomain/treewide: Keep firmware node unconditionally allocated
    - SUNRPC reverting d03727b248d0 ("NFSv4 fix CLOSE not waiting for direct IO
      compeletion")
    - spi: spi-fsl-dspi: Exit the ISR with IRQ_NONE when it's not ours
    - IB/umem: fix reference count leak in ib_umem_odp_get()
    - uprobes: Change handle_swbp() to send SIGTRAP with si_code=SI_KERNEL, to fix
      GDB regression
    - ALSA: info: Drop WARN_ON() from buffer NULL sanity check
    - ASoC: rt5670: Correct RT5670_LDO_SEL_MASK
    - btrfs: fix double free on ulist after backref resolution failure
    - btrfs: fix mount failure caused by race with umount
    - btrfs: fix page leaks after failure to lock page for delalloc
    - bnxt_en: Fix race when modifying pause settings.
    - hippi: Fix a size used in a 'pci_free_consistent()' in an error handling
      path
    - ax88172a: fix ax88172a_unbind() failures
    - net: dp83640: fix SIOCSHWTSTAMP to update the struct with actual
      configuration
    - drm: sun4i: hdmi: Fix inverted HPD result
    - net: smc91x: Fix possible memory leak in smc_drv_probe()
    - bonding: check error value of register_netdevice() immediately
    - mlxsw: destroy workqueue when trap_register in mlxsw_emad_init
    - ipvs: fix the connection sync failed in some cases
    - i2c: rcar: always clear ICSAR to avoid side effects
    - bonding: check return value of register_netdevice() in bond_newlink()
    - serial: exar: Fix GPIO configuration for Sealevel cards based on XR17V35X
    - scripts/decode_stacktrace: strip basepath from all paths
    - HID: i...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in kunpeng920:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.