AMD phoenix/phoenix2 platforms facing amdgpu(PHX) hangs during stress loading

Bug #2051636 reported by You-Sheng Yang
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
HWE Next
New
Undecided
Unassigned
linux-firmware (Ubuntu)
Fix Released
Undecided
You-Sheng Yang
Jammy
Fix Released
High
You-Sheng Yang
Mantic
Fix Released
High
You-Sheng Yang
Noble
Fix Released
Undecided
You-Sheng Yang

Bug Description

[SRU Justification]

[Impact]

With stress tool like 3DMark or GravityMark, facing amdgpu(PHX) hangs within a few minutes or sometimes even quicker

[Fix]

Upstream firmware fixes for Phoenix (GC 11.0.1)/Phoenix 2 (GC 11.0.4), and other prerequisites:
* amdgpu/gc_11_0_1_* up to commit 56c0e7e ("amdgpu: update GC 11.0.1 firmware")
* amdgpu/psp_13_0_4_ta.bin up to commit ed7ddfb ("amdgpu: update PSP 13.0.4 firmware")
* amdgpu/vcn_4_0_2.bin up to commit 34ccb75 ("amdgpu: update VCN 4.0.2 firmware")
* amdgpu/gc_11_0_4_* up to commit 680d98c ("amdgpu: update GC 11.0.4 firmware")
* amdgpu/psp_13_0_11_ta.bin up to commit 72227fe ("amdgpu: update PSP 13.0.11 firmware")

[Test Case]

Run stress tool like 3DMark or GravityMark.

[Where problems could occur]

Binary firmware update recommended by chip vendor. No known issue so far.

[Other Info]

Phoenix is supported in linux-oem-6.5/jammy, so linux-firmware/jammy is also nominated for fix.

========== original bug report ==========

With stress tool like 3DMark or GravityMark, facing amdgpu(PHX) hangs within a few minutes or sometimes even quicker. Also using mantic + v6.7 hit the hang, so need to update new FWs to fix this issue.

PHX series
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=680d98c62b13bd441949280c77ca31efb021b68a
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=72227fe463af85648523300543287a68e6c6de5f
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=56c0e7e688427270729fce6e85ecd98f1fe2a6e1
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=ed7ddfb5d136c3b9b1eeb48f7568550c0e5d99da
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=34ccb7502e075607682f0f0984a83022bfa0da85

[ 415.782623] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=27035, emitted seq=27037
[ 415.782833] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1361 thread gnome-shel:cs0 pid 1421
[ 415.783004] amdgpu 0000:0d:00.0: amdgpu: GPU reset begin!
[ 415.944129] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 415.944317] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 416.074161] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 416.074327] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 416.204184] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 416.204356] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 416.334204] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 416.334377] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 416.464226] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 416.464398] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 416.594247] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 416.594418] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 416.724265] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 416.724432] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 416.854275] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 416.854437] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 416.984284] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 416.984456] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 416.996743] amdgpu 0000:0d:00.0: amdgpu: MODE2 reset
[ 417.026498] amdgpu 0000:0d:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 417.026909] [drm] PCIE GART of 512M enabled (table at 0x000000801FD00000).
[ 417.027149] amdgpu 0000:0d:00.0: amdgpu: SMU is resuming...
[ 417.029520] amdgpu 0000:0d:00.0: amdgpu: SMU is resumed successfully!
[ 417.032154] [drm] DMUB hardware initialized: version=0x08003000
[ 417.190837] [drm] kiq ring mec 3 pipe 1 q 0
[ 417.192870] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 417.193037] amdgpu 0000:0d:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[ 417.193447] amdgpu 0000:0d:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 417.193449] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 417.193451] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 417.193452] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 417.193453] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 417.193454] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 417.193455] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 417.193456] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 417.193458] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 417.193459] amdgpu 0000:0d:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 417.193460] amdgpu 0000:0d:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 417.193461] amdgpu 0000:0d:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[ 417.193462] amdgpu 0000:0d:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[ 417.195893] amdgpu 0000:0d:00.0: amdgpu: recover vram bo from shadow start
[ 417.195894] amdgpu 0000:0d:00.0: amdgpu: recover vram bo from shadow done
[ 417.195904] amdgpu 0000:0d:00.0: amdgpu: GPU reset(2) succeeded!
[ 417.197048] [drm] Skip scheduling IBs!
[ 417.197057] [drm] Skip scheduling IBs!
[ 417.197063] [drm] Skip scheduling IBs!
[ 443.578688] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

You-Sheng Yang (vicamo)
summary: - AMD phenix/phenix2 platforms facing amdgpu(PHX) hangs during stress
+ AMD phoenix/phoenix2 platforms facing amdgpu(PHX) hangs during stress
loading
You-Sheng Yang (vicamo)
Changed in linux-firmware (Ubuntu Jammy):
status: New → In Progress
Changed in linux-firmware (Ubuntu Mantic):
status: New → In Progress
Changed in linux-firmware (Ubuntu Noble):
status: New → Incomplete
status: Incomplete → Triaged
Changed in linux-firmware (Ubuntu Jammy):
importance: Undecided → High
Changed in linux-firmware (Ubuntu Mantic):
importance: Undecided → High
Changed in linux-firmware (Ubuntu Jammy):
assignee: nobody → You-Sheng Yang (vicamo)
Changed in linux-firmware (Ubuntu Mantic):
assignee: nobody → You-Sheng Yang (vicamo)
Changed in linux-firmware (Ubuntu Noble):
assignee: nobody → You-Sheng Yang (vicamo)
Revision history for this message
You-Sheng Yang (vicamo) wrote :

All the fixes are in upstream repository, so there should be no work to do for Noble once it migrate to upstream HEAD.

Revision history for this message
You-Sheng Yang (vicamo) wrote :
You-Sheng Yang (vicamo)
description: updated
tags: added: amd oem-priority originate-from-2051539
Juerg Haefliger (juergh)
tags: added: kern-9038
Timo Aaltonen (tjaalton)
Changed in linux-firmware (Ubuntu Noble):
status: Triaged → Fix Released
Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello You-Sheng, or anyone else affected,

Accepted linux-firmware into mantic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/linux-firmware/20230919.git3672ccab-0ubuntu2.8 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-mantic to verification-done-mantic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-mantic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in linux-firmware (Ubuntu Mantic):
status: In Progress → Fix Committed
Changed in linux-firmware (Ubuntu Jammy):
status: In Progress → Fix Committed
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

Hello You-Sheng, or anyone else affected,

Accepted linux-firmware into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/linux-firmware/20220329.git681281e4-0ubuntu3.28 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Mario Limonciello (superm1) wrote :

Internal team at AMD has tested across a number of different OEM PHX systems using OEM 6.5-1014 kernel. This is testing well.

tags: added: verification-done-jammy
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

Hello You-Sheng, or anyone else affected,

Accepted linux-firmware into mantic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/linux-firmware/20230919.git3672ccab-0ubuntu2.9 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-mantic to verification-done-mantic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-mantic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

Hello You-Sheng, or anyone else affected,

Accepted linux-firmware into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/linux-firmware/20220329.git681281e4-0ubuntu3.29 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Timo Aaltonen (tjaalton)
tags: added: verification-needed-mantic
Revision history for this message
You-Sheng Yang (vicamo) wrote :

verified:
* linux-firmware/jammy-proposed version 20220329.git681281e4-0ubuntu3.29
* linux-firmware/noble-proposed version 20230919.git3672ccab-0ubuntu2.9

tags: added: verification-done-mantic
removed: verification-needed-mantic
Revision history for this message
Timo Aaltonen (tjaalton) wrote : Update Released

The verification of the Stable Release Update for linux-firmware has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (3.9 KiB)

This bug was fixed in the package linux-firmware - 20220329.git681281e4-0ubuntu3.29

---------------
linux-firmware (20220329.git681281e4-0ubuntu3.29) jammy; urgency=medium

  * Update firmware for MT7921 in order to fix Framework 13 AMD 7040 (LP: #2049220)
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7922)
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7922)
    - linux-firmware: update firmware for MT7922 WiFi device
    - linux-firmware: update firmware for MT7922 WiFi device
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7922)
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7922)
    - linux-firmware: update firmware for MT7922 WiFi device
    - linux-firmware: update firmware for MT7922 WiFi device
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7922)
    - linux-firmware: update firmware for MT7922 WiFi device
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7922)

linux-firmware (20220329.git681281e4-0ubuntu3.28) jammy; urgency=medium

  * Missing firmware for AMD GPU GC 11.0.3 (LP: #2034103)
    - amdgpu: update VCN 4.0.0 firmware for amd.5.5 release
    - amdgpu: update VCN 4.0.0 firmware
  * DP connection swap to break eDP behavior on AMD 7735U (LP: #2049758)
    - SAUCE: Update DCN312 DMCUB firmware

linux-firmware (20220329.git681281e4-0ubuntu3.27) jammy; urgency=medium

  * AMD phoenix/phoenix2 platforms facing amdgpu(PHX) hangs during stress loading (LP: #2051636)
    - amdgpu: update PSP 13.0.4 firmware for amd.5.5 release
    - amdgpu: update PSP 13.0.11 firmware for amd.5.5 release
    - amdgpu: update PSP 13.0.4 firmware from 5.7 branch
    - amdgpu: update GC 11.0.1 firmware from 5.7 branch
    - amdgpu: update GC 11.0.4 firmware from 5.7 branch
    - amdgpu: update PSP 13.0.11 firmware from 5.7 branch
    - amdgpu: update GC 11.0.1 firmware
    - amdgpu: update PSP 13.0.4 firmware
    - amdgpu: update VCN 4.0.2 firmware
    - amdgpu: update GC 11.0.4 firmware
    - amdgpu: update PSP 13.0.11 firmware
  * Update firmware for MT7921 in order to fix Framework 13 AMD 7040 (LP: #2049220)
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7921)
    - linux-firmware: update firmware for MT7921 WiFi device
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7921)
    - linux-firmware: update firmware for MT7921 WiFi device
    - linux-firmware: update firmware for MT7921 WiFi device
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7921)
    - linux-firmware: update firmware for MT7921 WiFi device
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7921)
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7921)
    - linux-firmware: update firmware for MT7921 WiFi device
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7921)
    - linux-firmware: update firmware for MT7921 WiFi device
    - linux-firmware: update firmware for MT7921 WiFi device
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7921)
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7921)
    -...

Read more...

Changed in linux-firmware (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-firmware - 20230919.git3672ccab-0ubuntu2.9

---------------
linux-firmware (20230919.git3672ccab-0ubuntu2.9) mantic; urgency=medium

  * Update firmware for MT7921 in order to fix Framework 13 AMD 7040 (LP: #2049220)
    - linux-firmware: update firmware for MT7922 WiFi device
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7922)

linux-firmware (20230919.git3672ccab-0ubuntu2.8) mantic; urgency=medium

  * DP connection swap to break eDP behavior on AMD 7735U (LP: #2049758)
    - SAUCE: Update DCN312 DMCUB firmware

linux-firmware (20230919.git3672ccab-0ubuntu2.7) mantic; urgency=medium

  * Miscellaneous Ubuntu changes
    - [Packaging] scripts: Fix shellcheck warnings
    - [Workflow] Add initial gitea workflow file
    - SAUCE: [Workflow] Disable markdownlint pre-commit hook
    - SAUCE: [Workflow] check_whence.py: Update list of known files
    - [Packaging] scripts/generate-changelog: Fix array initialization
    - [Packaging] control: Add XSBC-Original-Maintainer field
    - [Packaging] scripts/install-firmware: Fix installation of license files
  * AMD phoenix/phoenix2 platforms facing amdgpu(PHX) hangs during stress loading (LP: #2051636)
    - amdgpu: update PSP 13.0.4 firmware from 5.7 branch
    - amdgpu: update GC 11.0.1 firmware from 5.7 branch
    - amdgpu: update GC 11.0.4 firmware from 5.7 branch
    - amdgpu: update PSP 13.0.11 firmware from 5.7 branch
    - amdgpu: update GC 11.0.1 firmware
    - amdgpu: update PSP 13.0.4 firmware
    - amdgpu: update VCN 4.0.2 firmware
    - amdgpu: update GC 11.0.4 firmware
    - amdgpu: update PSP 13.0.11 firmware
  * Update firmware for MT7921 in order to fix Framework 13 AMD 7040 (LP: #2049220)
    - linux-firmware: update firmware for MT7921 WiFi device
    - linux-firmware: update firmware for mediatek bluetooth chip (MT7921)
  * WCN6856 Wi-FI Unavailable and no function during suspend stress (LP: #2048977)
    - ath11k: WCN6855 hw2.0: update to WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.37

linux-firmware (20230919.git3672ccab-0ubuntu2.6) mantic; urgency=medium

  * occasional wifi firmware loading failures: wiwlwifi: BE200: Failed to start RT ucode: -110 (LP: #2048853)
    - iwlwifi: add new FWs from core83-55 release
    - iwlwifi: fix for the new FWs from core83-55 release
    - iwlwifi: update gl FW for core80-165 release
  * WCN6856 Wi-FI Unavailable and no function during suspend stress (LP: #2048977)
    - ath11k: WCN6855 hw2.0: update board-2.bin
    - ath11k: WCN6855 hw2.0: update to WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.36

 -- Juerg Haefliger <email address hidden> Wed, 21 Feb 2024 10:41:18 +0100

Changed in linux-firmware (Ubuntu Mantic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.