Comment 7 for bug 1848326

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Based on a discussion with ~albertomilone, powering down the NVIDIA GPU while keeping the modules loaded is the way to go long-term as opposed to blacklisting the modules.

The power management feature is described here (requires Turing GPUs and above):
http://us.download.nvidia.com/XFree86/Linux-x86_64/440.44/README/dynamicpowermanagement.html

My GPU is pre-Turing (Pascal, 1060m), however, powering off is not where the problem is.

Running `prime-select intel` creates /lib/udev/rules.d/80-pm-nvidia.rules which contains the following line to unbind an NVIDIA GPU device from its driver:

https://github.com/tseliot/nvidia-prime/blob/cf757cc9585dfc032930379fc81effb3a3d59606/prime-select#L164-L165
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", ATTR{remove}="1"

If I comment it out, I can boot just fine with my iGPU after running `prime-select intel`. The resulting 80-pm-nvidia.rules file looks like this: https://paste.ubuntu.com/p/HX6t9y8BPg/

Just commenting out the power management lines while leaving the unbinding in-place results in the same issue (80-pm-nvidia.rules: https://paste.ubuntu.com/p/mTdXbZZk8H/).

The unbinding operation hangs which results in something like this even before X11 or gdm3 are attempted to be started:

[ 15.683190] nvidia-uvm: Loaded the UVM driver, major device number 511.
[ 15.824882] NVRM: Attempting to remove minor device 0 with non-zero usage count!
[ 15.824903] ------------[ cut here ]------------
[ 15.825082] WARNING: CPU: 0 PID: 759 at /var/lib/dkms/nvidia/440.59/build/nvidia/nv-pci.c:577 nv_pci_remove+0x338/0x360 [nvidia]
# ...
[ 15.825330] ---[ end trace 353e142c2126a8a0 ]---
# ...
[ 242.649248] INFO: task nvidia-persiste:1876 blocked for more than 120 seconds.
[ 242.649931] Tainted: P W O 5.4.0-12-generic #15-Ubuntu
[ 242.650618] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.651319] nvidia-persiste D 0 1876 1 0x00000004

Eventually it fails with a timeout:
systemd[1]: nvidia-persistenced.service: start operation timed out. Terminating.
systemd[1]: nvidia-persistenced.service: Failed with result 'timeout'.
systemd[1]: Failed to start NVIDIA Persistence Daemon.

Masking nvidia-persistenced via `sudo systemctl mask nvidia-persistenced` and rebooting shows that systemd-udevd and rmmod hang as well:

Feb 9 17:18:43 blade systemd-udevd[717]: 0000:01:00.0: Worker [756] processing SEQNUM=4430 is taking a long time
Feb 9 17:18:43 blade systemd-udevd[717]: 0000:01:00.1: Worker [746] processing SEQNUM=4440 is taking a long time
Feb 9 17:20:43 blade systemd-udevd[717]: 0000:01:00.1: Worker [746] processing SEQNUM=4440 killed
Feb 9 17:20:43 blade systemd-udevd[717]: 0000:01:00.0: Worker [756] processing SEQNUM=4430 killed
Feb 9 17:21:31 blade kernel: [ 242.818665] INFO: task systemd-udevd:746 blocked for more than 120 seconds.
Feb 9 17:21:31 blade kernel: [ 242.819381] Tainted: P W O 5.4.0-12-generic #15-Ubuntu
Feb 9 17:21:31 blade kernel: [ 242.820075] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 9 17:21:31 blade kernel: [ 242.820797] systemd-udevd D 0 746 717 0x00000324
# ...
Feb 9 17:21:31 blade kernel: [ 242.823033] rmmod D 0 1939 1937 0x00004000
Feb 9 17:21:31 blade kernel: [ 242.823034] Call Trace:
# ...
Feb 9 17:21:31 blade kernel: [ 242.823783] nvkms_close_gpu+0x50/0x80 [nvidia_modeset]
Feb 9 17:21:31 blade kernel: [ 242.823793] _nv002598kms+0x14d/0x170 [nvidia_modeset]
# ...
Feb 9 17:21:31 blade kernel: [ 242.823893] ? nv_linux_drm_exit+0x9/0x768 [nvidia_drm]
Feb 9 17:21:31 blade kernel: [ 242.823897] ? __x64_sys_delete_module+0x147/0x290
# ...