Ubuntu 17.04: machine crashes with Oops in dccp_v4_ctl_send_reset while running stress-ng.

Bug #1654073 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Tim Gardner
Zesty
Fix Released
High
Tim Gardner

Bug Description

== Comment: #0 - PAVITHRA R. PRAKASH - 2016-12-28 03:39:50 ==
---Problem Description---

Ubuntu 17.04: machine crashes with Oops while running stress-ng.

---Steps followed-----

1. Install 17.04 on NV machine.
2. apt-get install stress-ng
3. stress-ng -a 0

Logs
====
dccp af_alg joydev input_leds mac_hid at24 nvmem_core ofpart cmdlinepart powernv_flash mtd opal_prd powernv_rng ipmi_powernv ipmi_msghandler ibmpowernv uio_pdrv_genirq uio vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_tran[165938083361,3] OPAL: Trying a CPU re-init with flags: 0x1
[166456504189,3] OPAL: CPU 0x29 not in OPAL !
[167510150475,3] OPAL: Trying a CPU re-init with flags: 0x2
[168022446397,3] OPAL: CPU 0x29 not in OPAL !
sport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear ses enclosure scsi_transport_sas hid_generic ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt bnx2x fb_sys_fops drm aacraid tg3 usbhid uas hid usb_storage mdio ahci libcrc32c libahci crc32c_vpmsum
[ 237.047216] CPU: 33 PID: 34694 Comm: stress-ng-dccp Not tainted 4.9.0-11-generic #12-Ubuntu
[ 237.047315] task: c000003312a69400 task.stack: c000003779698000
[ 237.047402] NIP: d00000002e7b0a7c LR: d00000002e7b21cc CTR: c000000000a0dd00
[ 237.047509] REGS: c000003fff68f670 TRAP: 0300 Not tainted (4.9.0-11-generic)
[ 237.047613] MSR: 900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>[ 237.049028] CR: 24002282 XER: 20000000
[ 237.049103] CFAR: c000000000008a60 DAR: 00000000000002f4 DSISR: 40000000 SOFTE: 1
GPR00: d00000002e7b21cc c000003fff68f8f0 d00000002e7bb670 0000000000000001
GPR04: c000002b5dbdc400 c000003c74c0a460 0000000000000474 c000003c74c0a474
GPR08: c000003c74c0a000 0000000000000000 c00000348cd25200 0000000000000000
GPR12: 0000000000002200 c000000007b72900 c000003fff68c000 0000000000000000
GPR16: 0000000000000000 0000000000000040 0000000000000001 0000000000002713
GPR20: 000000000000cc84 000000000100007f 000000000100007f c0000000013b2f00
GPR24: 0000000000000001 0000000000000001 0000000000000000 0000000000000004
GPR28: c0000000013b2f00 c000003c74c0a474 0000000000000000 c000002b5dbdc400
NIP [d00000002e7b0a7c] dccp_v4_ctl_send_reset+0xa4/0x2f0 [dccp_ipv4]
[ 237.051403] LR [d00000002e7b21cc] dccp_v4_rcv+0x5d4/0x850 [dccp_ipv4]
[ 237.051486] Call Trace:
[ 237.051529] [c000003fff68f8f0] [000000002713cc84] 0x2713cc84 (unreliable)
[ 237.051649] [c000003fff68f970] [d00000002e7b21cc] dccp_v4_rcv+0x5d4/0x850 [dccp_ipv4]
[ 237.051779] [c000003fff68fa50] [c000000000a01e40] ip_local_deliver_finish+0x170/0x350
[ 237.051932] [c000003fff68faa0] [c000000000a0276c] ip_local_deliver+0x5c/0x130
[ 237.052038] [c000003fff68fb10] [c000000000a02278] ip_rcv_finish+0x258/0x510
[ 237.052151] [c000003fff68fba0] [c000000000a02b44] ip_rcv+0x304/0x420
[ 237.052263] [c000003fff68fc30] [c0000000009a28bc] __netif_receive_skb_core+0x97c/0xda0
[ 237.052388] [c000003fff68fd10] [c0000000009a7ab4] process_backlog+0xd4/0x1e0
[ 237.052489] [c000003fff68fd80] [c0000000009a6f0c] net_rx_action+0x35c/0x480
[ 237.052603] [c000003fff68fe90] [c000000000b22a6c] __do_softirq+0x18c/0x3fc
[ 237.052726] [c000003fff68ff90] [c000000000029fb0] call_do_softirq+0x14/0x24
[ 237.052848] [c00000377969b920] [c00000000001765c] do_softirq_own_stack+0x5c/0xa0
[ 237.052992] [c00000377969b960] [c0000000000cfd48] do_softirq.part.3+0x68/0x90
[ 237.053112] [c00000377969b990] [c0000000000cfe44] __local_bh_enable_ip+0xd4/0x100
[ 237.053240] [c00000377969b9b0] [c000000000a06724] ip_finish_output2+0x244/0x460
[ 237.053372] [c00000377969ba50] [c000000000a0977c] ip_output+0xcc/0x180
[ 237.053485] [c00000377969bae0] [c000000000a08c78] ip_local_out+0x68/0x90
[ 237.053607] [c00000377969bb20] [d000000021966978] dccp_transmit_skb+0x320/0x550 [dccp]
[ 237.053739] [c00000377969bb90] [d00000002196732c] dccp_connect+0xf4/0x1f0 [dccp]
[ 237.053890] [c00000377969bc10] [d00000002e7b0320] dccp_v4_connect+0x308/0x400 [dccp_ipv4]
[ 237.054213] [c00000377969bc90] [c000000000a51678] __inet_stream_connect+0x158/0x400
[ 237.065276] [c00000377969bd20] [c000000000a51978] inet_stream_connect+0x58/0x90
[ 237.074757] [c00000377969bd60] [c00000000097eeac] SyS_connect+0x10c/0x130
[ 237.092889] [c00000377969be30] [c00000000000bd84] system_call+0x38/0xe0
[ 237.107020] Instruction dump:
[ 237.107085] 7ca82a14 f9210020 f9210028 ebdc0b98 f9210030 f9210038 f9210040 f9210048
[ 237.117663] 419e01c4 e86a00ae 2fa30000 419e01b8 <893e02f4> e95e0060 7c9f2378 897e0149
[ 237.133181] ---[ end trace ef25e246c86e0bcc ]---
[ 237.133270]
[ 237.146244] Sending IPI to other CPUs
[ 237.147330] IPI complete

== Comment: #8 - Kevin W. Rudd - 2017-01-03 15:16:06 ==
The panic happened because the control socks had been cleared:

  dccp = {
    v4_ctl_sk = 0x0,
    v6_ctl_sk = 0x0
  },

dccp_v4_ctl_send_reset() ended up calling dccp_v4_route_skb() with a NULL ctl_sk. Close race of some sort?

CVE References

Revision history for this message
bugproxy (bugproxy) wrote : console log

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-150129 severity-high targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : dmesg

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from ltc-haba2

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : rxskb

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : net

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dev

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Revision history for this message
Tim Gardner (timg-tpi) wrote :

In dmesg I see these suspicious errors:

[ 222.904306] Injecting memory failure for page 0x3a44ee at 0x3fff85660000
[ 222.904621] Memory failure: 0x3a44ee: recovery action for dirty LRU page: Recovered

Are they normal ? The network crash could simply be a second order effect.

Revision history for this message
Colin Ian King (colin-king) wrote :

I believe the MADV_HWPOISON flag on the madvise stressor is triggering the "Injecting memory failure for page" errors. These are just test messages, see madvise(2):

"This feature is intended for testing of memory error-handling code; it is available only if the kernel was configured with CONFIG_MEMORY_FAILURE".

Revision history for this message
Tim Gardner (timg-tpi) wrote :

Also, Colin King pointed out, "the dccp crash on that bug occurs because I added a dccp stressor in before christmas; it's a little used part of the networking stack, so I guess I hit something that's new". It is possible this really is an upstream network stack bug as Kevin Rudd pointed out.

bugproxy (bugproxy)
tags: added: targetmilestone-inin1704
removed: targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : sosreport from ltc-haba2

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : rxskb

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : net

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dev

Default Comment by Bridge

Revision history for this message
Kevin W. Rudd (kevinr) wrote :

Deleted accidental attachment dups.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-02-08 15:48 EDT-------
Hello Colin and Tim,

Any further progress? Please let us know if you need anything further from IBM. Thanks!

Revision history for this message
bugproxy (bugproxy) wrote : commit 449809a66c1d0b1563dee84493e14bf3104d2d7e

------- Comment on attachment From <email address hidden> 2017-03-20 11:13 EDT-------

Hello Canonical.

Applying dccp commit 449809a66c1d0b1563dee84493e14bf3104d2d7e on top of the 4.10.0-13 kernel source appears to have resolved this issue.

Please consider including this commit: tcp/dccp: block BH for SYN processing

Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Zesty):
assignee: Canonical Kernel Team (canonical-kernel-team) → Tim Gardner (timg-tpi)
status: Triaged → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote : rxskb

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : net

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dev

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : commit 449809a66c1d0b1563dee84493e14bf3104d2d7e

------- Comment on attachment From <email address hidden> 2017-03-20 11:13 EDT-------

Hello Canonical.

Applying dccp commit 449809a66c1d0b1563dee84493e14bf3104d2d7e on top of the 4.10.0-13 kernel source appears to have resolved this issue.

Please consider including this commit: tcp/dccp: block BH for SYN processing

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (9.0 KiB)

This bug was fixed in the package linux - 4.10.0-15.17

---------------
linux (4.10.0-15.17) zesty; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1675868

  * In ZZ-BML (POWER9):ubuntu17.04 installation Fails (LP: #1675771)
    - powerpc/64s: fix handling of non-synchronous machine checks
    - powerpc/64s: allow machine check handler to set severity and initiator
    - powerpc/64s: POWER9 machine check handler

  * [Feature] R3 mwait support for Knights Mill (LP: #1637550)
    - x86/cpufeature: Enable RING3MWAIT for Knights Landing
    - x86/cpufeature: Enable RING3MWAIT for Knights Mill
    - x86/msr: Add MSR_MISC_FEATURE_ENABLES and RING3MWAIT bit
    - x86/elf: Add HWCAP2 to expose ring 3 MONITOR/MWAIT
    - x86/cpufeature: Add RING3MWAIT to CPU features

  * [Feature] GLK:New device IDs (LP: #1645951)
    - mfd: intel-lpss: Add Intel Gemini Lake PCI IDs
    - pwm: lpss: Add Intel Gemini Lake PCI ID
    - i2c: i801: Add support for Intel Gemini Lake
    - spi: pxa2xx: Add support for Intel Gemini Lake
    - [Config] CONFIG_PINCTRL_GEMINILAKE=m
    - pinctrl: intel: Add Intel Gemini Lake pin controller support

  * Zesty update to v4.10.5 stable release (LP: #1675032)
    - net/mlx5e: Register/unregister vport representors on interface attach/detach
    - net/mlx5e: Do not reduce LRO WQE size when not using build_skb
    - net/mlx5e: Fix broken CQE compression initialization
    - net/mlx5e: Update MPWQE stride size when modifying CQE compress state
    - net/mlx5e: Fix wrong CQE decompression
    - vxlan: correctly validate VXLAN ID against VXLAN_N_VID
    - vti6: return GRE_KEY for vti6
    - vxlan: don't allow overwrite of config src addr
    - ipv4: add missing initialization for flowi4_uid
    - ipv4: mask tos for input route
    - sctp: set sin_port for addr param when checking duplicate address
    - net sched actions: decrement module reference count after table flush.
    - l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv
    - vxlan: lock RCU on TX path
    - geneve: lock RCU on TX path
    - mlxsw: spectrum_router: Avoid potential packets loss
    - net: bridge: allow IPv6 when multicast flood is disabled
    - net: don't call strlen() on the user buffer in packet_bind_spkt()
    - net: net_enable_timestamp() can be called from irq contexts
    - ipv6: orphan skbs in reassembly unit
    - dccp: Unlock sock before calling sk_free()
    - amd-xgbe: Stop the PHY before releasing interrupts
    - amd-xgbe: Be sure to set MDIO modes on device (re)start
    - amd-xgbe: Don't overwrite SFP PHY mod_absent settings
    - bonding: use ETH_MAX_MTU as max mtu
    - strparser: destroy workqueue on module exit
    - tcp: fix various issues for sockets morphing to listen state
    - net: fix socket refcounting in skb_complete_wifi_ack()
    - net: fix socket refcounting in skb_complete_tx_timestamp()
    - net/sched: act_skbmod: remove unneeded rcu_read_unlock in tcf_skbmod_dump
    - dccp: fix use-after-free in dccp_feat_activate_values
    - team: use ETH_MAX_MTU as max mtu
    - vrf: Fix use-after-free in vrf_xmit
    - net/tunnel: set inner protocol in network gro hooks
    - uapi: fix linux/packet_diag.h use...

Read more...

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : Sosreport

------- Comment (attachment only) From <email address hidden> 2017-04-11 02:01 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-04-19 11:08 EDT-------
FYI Canonical:

It appears that the issue was not resolved with the most recent patch as this panic was recently seen with the 4.10.0-19-generic kernel. Unfortunately, no new vmcore is available due to the issue currently being worked in LP Bug 1680349.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-10-13 13:03 EDT-------
*** Bug 157351 has been marked as a duplicate of this bug. ***

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.