-
Notifications
You must be signed in to change notification settings - Fork 462
Description
Describe the bug
We have observed an issue while upgrading the GPU Operator from version 25.3.4 to 25.10.0 in a multi-node GPU cluster. During the upgrade, if a node has active GPU workloads, the GPU driver pod can get stuck in the Init:CrashLoopBackOff state.
According to NVIDIA’s documentation on the GPU Operator upgrade workflow:
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html the expected behavior is that the upgrade controller first terminates all GPU workloads and then proceeds with the driver upgrade.
However, during our testing, we observed different behavior.
Before the upgrade started, the upgrade-done label was already present on the node from the older version. When the upgrade began, the upgrade controller did not trigger. Instead, the driver DaemonSet rollout started immediately, causing the driver pod to restart with the new version.
This behavior was unexpected because the driver DaemonSet upgrade strategy is set to OnDelete, and deletion of driver pods should be controlled by the upgrade controller. In this case, the upgrade controller did not start as expected.
During the restart, the driver pod’s init container (k8s-driver-manager) attempted to delete the GPU workloads but failed because the autoUpgrade policy was enabled, as indicated in the logs.
In a ubuntu os based clusters, when a node had active GPU workloads, the upgrade controller did not trigger as expected. Instead, the driver DaemonSet rollout started immediately, leading to a driver pod restart without proper workload eviction. This resulted in the deadlock scenario and caused the driver pod to get stuck.
k8s-driver-manager logs
W0225 08:26:00.486943 1515429 client_config.go:667] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time=2026-02-25T08:26:00Z level=info msg=Starting driver uninstallation process
time=2026-02-25T08:26:00Z level=info msg=Fetching current component labels
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.operator-validator" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.operator-validator"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.container-toolkit" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.container-toolkit"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.device-plugin" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.device-plugin"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.gpu-feature-discovery" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.gpu-feature-discovery"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.dcgm-exporter" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.dcgm-exporter"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.dcgm" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.dcgm"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.mig-manager" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.mig-manager"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.nvsm" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.nvsm"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.sandbox-validator" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.sandbox-validator"=
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.sandbox-device-plugin" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.sandbox-device-plugin"=
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.vgpu-device-manager" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.vgpu-device-manager"=
time=2026-02-25T08:26:00Z level=info msg=Current value of AUTO_UPGRADE_POLICY_ENABLED=true
time=2026-02-25T08:26:00Z level=info msg=Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
time=2026-02-25T08:26:01Z level=info msg=Waiting for nvidia-operator-validator to shutdown
time=2026-02-25T08:26:06Z level=info msg=Waiting for nvidia-container-toolkit-daemonset to shutdown
time=2026-02-25T08:26:06Z level=info msg=Waiting for nvidia-device-plugin-daemonset to shutdown
time=2026-02-25T08:26:06Z level=info msg=Waiting for gpu-feature-discovery to shutdown
time=2026-02-25T08:26:06Z level=info msg=Waiting for nvidia-dcgm-exporter to shutdown
time=2026-02-25T08:26:06Z level=info msg=Waiting for nvidia-dcgm to shutdown
time=2026-02-25T08:26:06Z level=info msg=Waiting for mig-manager to shutdown
time=2026-02-25T08:26:06Z level=info msg=Auto eviction of GPU pods on node <node_name> is disabled by the upgrade policy
time=2026-02-25T08:26:06Z level=info msg=Cleaning up NVIDIA driver
time=2026-02-25T08:26:06Z level=info msg=Unloading NVIDIA driver kernel modules
time=2026-02-25T08:26:06Z level=warning msg=Failed to unload kernel module nvidia_uvm: resource temporarily unavailable
time=2026-02-25T08:26:06Z level=warning msg=Failed to unload kernel module nvidia: resource temporarily unavailable
time=2026-02-25T08:26:06Z level=info msg=Could not unload NVIDIA driver kernel modules, driver is in use
time=2026-02-25T08:26:06Z level=info msg=Module Size Ref Count Used by
time=2026-02-25T08:26:06Z level=info msg=Auto drain of the node is disabled by the upgrade policy
time=2026-02-25T08:26:06Z level=error msg=Failed to uninstall nvidia driver components
time=2026-02-25T08:26:06Z level=info msg=Performing cleanup on failure
time=2026-02-25T08:26:06Z level=info msg=Auto eviction of GPU pods on node <node_name> is disabled by the upgrade policy
time=2026-02-25T08:26:06Z level=info msg=Auto drain of the node is disabled by the upgrade policy
time=2026-02-25T08:26:06Z level=info msg=Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
nvidia_uvm 2179072 48 -
nvidia 11554816 206 nvidia_uvm,
ecc 45056 1 nvidia,
time=2026-02-25T08:26:06Z level=fatal msg=failed to uninstall nvidia driver components: failed to unload driver: resource temporarily unavailable
This resulted in a deadlock situation:
- The driver pod restart requires GPU workloads to be terminated in order to unload the driver module.
- The GPU workloads were not terminated because the upgrade controller workflow was not triggered.
As a result, neither the driver pod nor the workloads could proceed, and the driver pod remained stuck in the Init:CrashLoopBackOff state.
To Reproduce
- Install v25.3.4 GPU Operator with driver.usePrecompiled was enabled.
- Run any GPU Workload.
- Upgrade to v25.10.0 with running workloads with driver.usePrecompiled was disabled.
or
Manually restart the driver DaemonSet pod running on a node that has active GPU workloads.
Expected behavior
As per documentation https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html, the expected behavior during the upgrade process is:
- Disable all clients to the GPU driver.
- Unload the current GPU driver kernel modules.
- Start the updated GPU driver pod.
- Install the updated GPU driver and load the updated kernel modules.
- Enable the clients of the GPU driver.
Environment (please provide the following information):
- GPU Operator Version: v25.3.4 upgrade to v25.10.0
- OS: Ubuntu24.04
- Kernel Version: 6.8.0-90-generic
- Container Runtime Version: 2.1.5
- Kubernetes Distro and Version: k8s version (v1.34.2)
Gpu_operator pod logs
gpu-operator.log
Cluster-policy configurations
driver:
enabled: true
manager:
env:
- name: ENABLE_GPU_POD_EVICTION
value: "true"
- name: ENABLE_AUTO_DRAIN
value: "false"
- name: DRAIN_USE_FORCE
value: "false"
- name: DRAIN_POD_SELECTOR_LABEL
value: ""
- name: DRAIN_TIMEOUT_SECONDS
value: 0s
- name: DRAIN_DELETE_EMPTYDIR_DATA
value: "true"
image: k8s-driver-manager
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: true
force: true
timeoutSeconds: 900