Skip to content

NVIDIA driver unbinding more unforgiving (595/Resolute) compared to (590/Questing) #1119

@eadwu

Description

@eadwu

NVIDIA Open GPU Kernel Modules Version

595.58.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 26.04 LTS

Kernel Release

7.0.0-14-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

3080 Ti

Describe the bug

Unable to unbind NVIDIA driver from unused GPU
It is unused via the following

  • lsof /dev/nvidia*
  • fuser -av /dev/nvidia*

To Reproduce

gpu_vd="$(cat /sys/bus/pci/devices/$gpu/vendor) $(cat /sys/bus/pci/devices/$gpu/device)"
aud_vd="$(cat /sys/bus/pci/devices/$aud/vendor) $(cat /sys/bus/pci/devices/$aud/device)"
echo "$gpu" | sudo tee "/sys/bus/pci/devices/$gpu/driver/unbind"
echo "$aud" | sudo tee "/sys/bus/pci/devices/$aud/driver/unbind"
echo "$gpu_vd" | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id
echo "$aud_vd" | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id

when opening some apps. Note that lsof and fuser both return no open file handles, whether by nvidia-persistenced, nvtop, or btop. The processes then hangs at unbinding without ever exiting.

nvidia_drm is not loaded and none of the NVIDIA GPUs are driving any displays (KDE instead of Gnome on Ubuntu).

Bug Incidence

Sometimes

nvidia-bug-report.log.gz

 kernel:  ? irqentry_exit+0x97/0x5a0
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? arch_exit_to_user_mode_prepare.isra.0+0xd/0x100
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? handle_mm_fault+0x1c0/0x2e0
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? count_memcg_events+0x103/0x250
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? __handle_mm_fault+0x493/0x720
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? do_syscall_64+0x150/0x5a0
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? arch_exit_to_user_mode_prepare.isra.0+0xd/0xe0
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? __audit_syscall_exit+0x36/0x120
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? ksys_read+0xc6/0xf0
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? lruvec_stat_mod_folio+0x8d/0x100
 kernel:  ? vfs_read+0x364/0x3a0
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? rw_verify_area+0x57/0x180
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? do_syscall_64+0x150/0x5a0
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? arch_exit_to_user_mode_prepare.isra.0+0xd/0xe0
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? __audit_syscall_exit+0x36/0x120
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? ksys_write+0x71/0xf0
 kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
 kernel:  ? vfs_write+0x25b/0x490
 kernel:  do_syscall_64+0x115/0x5a0
 kernel:  x64_sys_call+0x22f/0x2390
 kernel:  __x64_sys_write+0x19/0x30
 kernel:  ksys_write+0x71/0xf0
 kernel:  vfs_write+0x25b/0x490
 kernel:  kernfs_fop_write_iter+0x161/0x210
 kernel:  sysfs_kf_write+0x74/0x90
 kernel:  drv_attr_store+0x24/0x50
 kernel:  new_id_store+0xf4/0x1f0
 kernel:  pci_add_dynid+0xe6/0x110
 kernel:  driver_attach+0x1e/0x30
 kernel:  bus_for_each_dev+0x8a/0xe0
 kernel:  __driver_attach+0xe4/0x250
 kernel:  mutex_lock+0x3b/0x50
 kernel:  __mutex_lock_slowpath+0x13/0x20
 kernel:  ? __pfx___driver_attach+0x10/0x10
 kernel:  ? simple_strntoull+0x8c/0xa0
 kernel:  ? select_task_rq+0x91/0x100
 kernel:  __mutex_lock.constprop.0+0x550/0xaf0
 kernel:  schedule_preempt_disabled+0x15/0x30
 kernel:  schedule+0x27/0x90
 kernel:  __schedule+0x2b2/0x630
 kernel:  <TASK>
kernel:  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
kernel:  ? exc_page_fault+0x94/0x1e0
kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
kernel:  ? do_syscall_64+0x115/0x5a0
kernel:  ? x64_sys_call+0x22f/0x2390
kernel:  ? __x64_sys_write+0x19/0x30
kernel:  ? ksys_write+0x71/0xf0
kernel:  ? vfs_write+0x25b/0x490
kernel:  ? kernfs_fop_write_iter+0x161/0x210
kernel:  ? sysfs_kf_write+0x74/0x90
kernel:  ? drv_attr_store+0x24/0x50
kernel:  ? unbind_store+0xaf/0xc0
kernel:  ? device_driver_detach+0x14/0x20
kernel:  ? bus_find_device+0xb0/0xf0
kernel:  ? device_release_driver_internal+0x1fb/0x260
kernel:  ? device_remove+0x43/0x80
kernel:  ? pci_device_remove+0x4b/0xc0
kernel:  ? nv_pci_remove+0x52/0x80 [nvidia]
kernel:  ? nv_pci_remove_helper+0x3e9/0x500 [nvidia]
kernel:  ? os_delay+0xfb/0x250 [nvidia]
kernel:  ? __pfx_process_timeout+0x10/0x10
kernel:  ? schedule_timeout+0x88/0x110
kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
kernel:  ? schedule+0x27/0x90
kernel:  ? timer_delete_sync+0x5c/0xb0
kernel:  __schedule+0x175/0x630
kernel:  ? raw_spin_rq_lock_nested+0x21/0xa0

More Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions