RTD3 fails on GA103M (RTX 3080 Ti Mobile): driver holds unconditional pm_runtime baseline ref, kernel 6.14 + 6.17, open + proprietary 580/590
Summary
On a ThinkPad X1 Extreme Gen 5 (12th-gen Alder Lake) with GA103M
(10de:2420, RTX 3080 Ti Mobile), the dGPU never enters runtime D3.
After all userspace clients are gone and power/control is set to
auto, runtime_status stays active and runtime_usage stays at
1 with zero userspace openers per fuser. The driver
simultaneously self-reports Runtime D3 status: Enabled (fine-grained)
and Video Memory: Active.
The runtime_usage=1 is a driver-internal pm_runtime reference. No
userspace action can clear it.
The bug reproduces identically across a full 2×2 driver matrix and two
kernel versions, so it appears to be a driver/firmware issue rather
than kernel-side.
Battery cost on this hardware is roughly 5–10 W of dGPU idle power.
Hardware
|
|
| Laptop |
Lenovo ThinkPad X1 Extreme Gen 5 (21DECTO1WW) |
| BIOS |
N3JET37W 1.21, dated 2023-11-07 |
| CPU |
Intel i9-12900H (Alder Lake-P) |
| iGPU |
8086:46a6 Iris Xe Graphics @ 0000:00:02.0 |
| dGPU |
10de:2420 GA103M / RTX 3080 Ti Mobile @ 0000:01:00.0 (rev a1) |
| PCIe Root Port |
8086:460d @ 0000:00:01.0 |
| HDA function |
10de:2288 @ 0000:01:00.1 |
| Distro |
Zorin OS 18.1 (Ubuntu 24.04 noble base) |
| Init |
systemd 255 |
| Mode |
envycontrol hybrid (PRIME render-offload via prime-run) |
| Session |
GNOME on X11 (6.14) / Wayland (6.17) — bug present in both |
Repro
# 1. Resolve dGPU DRM node (numbers shift across boots; resolve by vendor)
for c in /dev/dri/card* /dev/dri/renderD*; do
n=$(basename "$c")
pci=$(readlink -f /sys/class/drm/$n/device 2>/dev/null | grep -oE '[0-9a-f]{4}:[0-9a-f]{2}:[0-9a-f]{2}\.[0-9]' | tail -1)
printf "%-25s %s\n" "$c" "$pci"
done
# dGPU is the cardN/renderDN whose path is 0000:01:00.0
# 2. Trigger RTD3
echo auto | sudo tee /sys/bus/pci/devices/0000:01:00.0/power/control
sleep 3
# 3. Observe state
cat /sys/bus/pci/devices/0000:01:00.0/power/{control,runtime_status,runtime_usage}
cat /proc/driver/nvidia/gpus/0000:01:00.0/power
sudo fuser -v /dev/nvidia* /dev/dri/cardN /dev/dri/renderDN
Expected
control=auto
runtime_status=suspended
runtime_usage=0
Video Memory: Off
Actual (every test)
control=auto
runtime_status=active
runtime_usage=1 <-- driver-internal ref
Video Memory: Active
Runtime D3 status: Enabled (fine-grained)
fuser: (no openers)
Test matrix
All combinations were tested with all the standard hybrid prerequisites
already in place: Mutter mutter-device-ignore udev tag on the dGPU,
Xwayland EGL/GLX defaults pointed at Mesa, nvidia-persistenced.service
masked, no Electron apps running, ollama stopped. fuser confirms
zero userspace openers in every failed test.
| Driver |
Variant |
Version |
Kernel |
DPM |
Result |
| nvidia-driver-580 |
open |
580.126.09 |
6.17.0-22-generic |
0x03 |
FAIL: usage=1, VRAM Active, 0 openers |
| nvidia-driver-580 |
proprietary |
580.126.09 |
6.17.0-22-generic |
0x03 |
FAIL: usage=1, VRAM Active, 0 openers |
| nvidia-driver-580 |
proprietary |
580.126.09 |
6.17.0-22-generic |
0x02 |
FAIL: usage=1, VRAM Active, 0 openers |
| nvidia-driver-590 |
proprietary |
590.48.01 |
6.17.0-22-generic |
0x03 |
FAIL: usage=1, VRAM Active, 0 openers |
| nvidia-driver-590 |
open |
590.48.01 |
6.17.0-22-generic |
0x03 |
FAIL: usage=1, VRAM Active, 0 openers |
| nvidia-driver-590 |
open |
590.48.01 |
6.14.0-37-generic |
0x03 |
FAIL: usage=1, VRAM Active, 0 openers |
DPM = NVreg_DynamicPowerManagement. Note that on 0x02 the driver
still reports Runtime D3 status: Enabled (fine-grained) — the status
string appears unaffected by the parameter on this GPU.
The 6.14 kernel was Ubuntu's HWE backport of upstream 6.14.11, package
linux-image-6.14.0-37-generic from noble-updates, with
linux-modules-extra-6.14.0-37-generic installed (i915 lives there on
noble HWE).
Module options in effect
# /etc/modprobe.d/nvidia.conf (envycontrol-generated, lightly edited)
options nvidia NVreg_DynamicPowerManagement=0x03
options nvidia NVreg_UsePageAttributeTable=1
options nvidia NVreg_InitializeSystemMemoryAllocations=0
options nvidia_drm modeset=1
# /etc/modprobe.d/nvidia-graphics-drivers-kms.conf (distro)
options nvidia_drm modeset=1
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/var
/proc/driver/nvidia/params confirms all options are loaded
(DynamicPowerManagement: 3, PreserveVideoMemoryAllocations: 1,
InitializeSystemMemoryAllocations: 0, etc.).
/sys/bus/pci/devices/0000:01:00.0/power/autosuspend_delay_ms returns
EIO with the open module on both kernels — consistent with the open
module's documented immediate-suspend behavior, not a bug indicator.
Things ruled out
- gnome-shell / Mutter opening the dGPU:
mutter-device-ignore
udev tag verified present on the dGPU's DRM node in TAGS and
CURRENT_TAGS; gnome-shell does not appear in fuser.
- Xwayland:
__EGL_VENDOR_LIBRARY_FILENAMES and
__GLX_VENDOR_LIBRARY_NAME point at Mesa; Xwayland does not open
the dGPU.
- ollama / CUDA-using userspace: stopping the service does not
drop runtime_usage below 1.
- Electron/Chromium apps: open the render node while running but
release on quit; not the persistent ref.
- Module unload chain: stopping ollama and
modprobe -r nvidia_uvm nvidia_drm nvidia_modeset leaves only the bare nvidia module
bound to the PCI device, and runtime_usage stays at 1.
Conclusion: the bare nvidia module bound to the device holds the
reference.
- GDM/Wayland-vs-X11: bug present on 6.17 Wayland and 6.14 X11.
- DPM mode:
0x02 and 0x03 both fail.
- Ubuntu kernel package: bug also present on 6.14, so it is not a
6.17-specific regression in the Ubuntu tree.
Boot-time observation (separate, possibly related)
On every boot of every driver variant, the dGPU comes up with
power/control=on despite the envycontrol-generated udev bind rule
writing auto. Manually writing auto after boot sticks. Hypothesis:
the bind rule fires before power/control is fully registered, and
the rule's TEST=="power/control" guard skips the write. This is
mitigatable in userspace and is not the subject of this issue, but
mentioning it in case it correlates with the baseline-ref behavior.
Captures
Public gist with three files:
https://gist.github.com/johntmorehead/85276a8decc5f20cfd5f8e240b852ea1
6.17_baseline.txt — modinfo, params, modprobe.d, udev rules,
current power state, full failure pattern on 6.17.
6.17_journalctl_kernel.txt — kernel log filtered for
nvidia/pcie/d3/gsp on 6.17 boot.
6.14.0-37-generic_capture.txt — full identical capture on 6.14
HWE kernel.
What I'd find useful
- Confirmation that this is a known issue on GA103M / Ampere mobile
with GSP-RM, or pointers to a tracking issue.
- Any debug knob (
NVreg_*, RmMsg, etc.) that would surface what
pm_runtime reference the driver is holding.
- Guidance on whether nouveau/NVK is the right path for users who
prioritize idle power over CUDA/NVENC on this hardware.
Happy to gather more data — dmesg extracts, RmMsg traces, additional
NVreg_* permutations, ftrace of the pm_runtime put path, etc.
RTD3 fails on GA103M (RTX 3080 Ti Mobile): driver holds unconditional pm_runtime baseline ref, kernel 6.14 + 6.17, open + proprietary 580/590
Summary
On a ThinkPad X1 Extreme Gen 5 (12th-gen Alder Lake) with GA103M
(
10de:2420, RTX 3080 Ti Mobile), the dGPU never enters runtime D3.After all userspace clients are gone and
power/controlis set toauto,runtime_statusstaysactiveandruntime_usagestays at1with zero userspace openers perfuser. The driversimultaneously self-reports
Runtime D3 status: Enabled (fine-grained)and
Video Memory: Active.The
runtime_usage=1is a driver-internal pm_runtime reference. Nouserspace action can clear it.
The bug reproduces identically across a full 2×2 driver matrix and two
kernel versions, so it appears to be a driver/firmware issue rather
than kernel-side.
Battery cost on this hardware is roughly 5–10 W of dGPU idle power.
Hardware
21DECTO1WW)N3JET37W1.21, dated 2023-11-078086:46a6Iris Xe Graphics @0000:00:02.010de:2420GA103M / RTX 3080 Ti Mobile @0000:01:00.0(rev a1)8086:460d@0000:00:01.010de:2288@0000:01:00.1hybrid(PRIME render-offload viaprime-run)Repro
Expected
Actual (every test)
Test matrix
All combinations were tested with all the standard hybrid prerequisites
already in place: Mutter
mutter-device-ignoreudev tag on the dGPU,Xwayland EGL/GLX defaults pointed at Mesa,
nvidia-persistenced.servicemasked, no Electron apps running,
ollamastopped.fuserconfirmszero userspace openers in every failed test.
DPM =
NVreg_DynamicPowerManagement. Note that on0x02the driverstill reports
Runtime D3 status: Enabled (fine-grained)— the statusstring appears unaffected by the parameter on this GPU.
The 6.14 kernel was Ubuntu's HWE backport of upstream 6.14.11, package
linux-image-6.14.0-37-genericfromnoble-updates, withlinux-modules-extra-6.14.0-37-genericinstalled (i915 lives there onnoble HWE).
Module options in effect
/proc/driver/nvidia/paramsconfirms all options are loaded(
DynamicPowerManagement: 3,PreserveVideoMemoryAllocations: 1,InitializeSystemMemoryAllocations: 0, etc.)./sys/bus/pci/devices/0000:01:00.0/power/autosuspend_delay_msreturnsEIO with the open module on both kernels — consistent with the open
module's documented immediate-suspend behavior, not a bug indicator.
Things ruled out
mutter-device-ignoreudev tag verified present on the dGPU's DRM node in
TAGSandCURRENT_TAGS; gnome-shell does not appear infuser.__EGL_VENDOR_LIBRARY_FILENAMESand__GLX_VENDOR_LIBRARY_NAMEpoint at Mesa; Xwayland does not openthe dGPU.
drop
runtime_usagebelow 1.release on quit; not the persistent ref.
modprobe -r nvidia_uvm nvidia_drm nvidia_modesetleaves only the barenvidiamodulebound to the PCI device, and
runtime_usagestays at1.Conclusion: the bare
nvidiamodule bound to the device holds thereference.
0x02and0x03both fail.6.17-specific regression in the Ubuntu tree.
Boot-time observation (separate, possibly related)
On every boot of every driver variant, the dGPU comes up with
power/control=ondespite the envycontrol-generated udevbindrulewriting
auto. Manually writingautoafter boot sticks. Hypothesis:the
bindrule fires beforepower/controlis fully registered, andthe rule's
TEST=="power/control"guard skips the write. This ismitigatable in userspace and is not the subject of this issue, but
mentioning it in case it correlates with the baseline-ref behavior.
Captures
Public gist with three files:
https://gist.github.com/johntmorehead/85276a8decc5f20cfd5f8e240b852ea1
6.17_baseline.txt— modinfo, params, modprobe.d, udev rules,current power state, full failure pattern on 6.17.
6.17_journalctl_kernel.txt— kernel log filtered fornvidia/pcie/d3/gsp on 6.17 boot.
6.14.0-37-generic_capture.txt— full identical capture on 6.14HWE kernel.
What I'd find useful
with GSP-RM, or pointers to a tracking issue.
NVreg_*, RmMsg, etc.) that would surface whatpm_runtime reference the driver is holding.
prioritize idle power over CUDA/NVENC on this hardware.
Happy to gather more data —
dmesgextracts, RmMsg traces, additionalNVreg_*permutations, ftrace of the pm_runtime put path, etc.