[Bug]: Driver Pod Restart Causes Init:CrashLoopBackOff When GPU Workloads Are Running

**Describe the bug**
We have observed an issue while upgrading the GPU Operator from version 25.3.4 to 25.10.0 in a multi-node GPU cluster. During the upgrade, if a node has active GPU workloads, the GPU driver pod can get stuck in the Init:CrashLoopBackOff state.

According to NVIDIA’s documentation on the GPU Operator upgrade workflow:
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html the expected behavior is that the upgrade controller first terminates all GPU workloads and then proceeds with the driver upgrade.

However, during our testing, we observed different behavior.
Before the upgrade started, the upgrade-done label was already present on the node from the older version. When the upgrade began, the upgrade controller did not trigger. Instead, the driver DaemonSet rollout started immediately, causing the driver pod to restart with the new version.

This behavior was unexpected because the driver DaemonSet upgrade strategy is set to OnDelete, and deletion of driver pods should be controlled by the upgrade controller. In this case, the upgrade controller did not start as expected.

During the restart, the driver pod’s init container (k8s-driver-manager) attempted to delete the GPU workloads but failed because the autoUpgrade policy was enabled, as indicated in the logs.

In a ubuntu os based clusters, when a node had active GPU workloads, the upgrade controller did not trigger as expected. Instead, the driver DaemonSet rollout started immediately, leading to a driver pod restart without proper workload eviction. This resulted in the deadlock scenario and caused the driver pod to get stuck.


k8s-driver-manager logs
```
W0225 08:26:00.486943 1515429 client_config.go:667] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time=2026-02-25T08:26:00Z level=info msg=Starting driver uninstallation process
time=2026-02-25T08:26:00Z level=info msg=Fetching current component labels
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.operator-validator" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.operator-validator"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.container-toolkit" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.container-toolkit"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.device-plugin" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.device-plugin"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.gpu-feature-discovery" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.gpu-feature-discovery"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.dcgm-exporter" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.dcgm-exporter"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.dcgm" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.dcgm"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.mig-manager" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.mig-manager"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.nvsm" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.nvsm"=true
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.sandbox-validator" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.sandbox-validator"=
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.sandbox-device-plugin" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.sandbox-device-plugin"=
time=2026-02-25T08:26:00Z level=info msg=Getting current value of the "nvidia.com/gpu.deploy.vgpu-device-manager" node label
time=2026-02-25T08:26:00Z level=info msg=Current value of "nvidia.com/gpu.deploy.vgpu-device-manager"=
time=2026-02-25T08:26:00Z level=info msg=Current value of AUTO_UPGRADE_POLICY_ENABLED=true
time=2026-02-25T08:26:00Z level=info msg=Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
time=2026-02-25T08:26:01Z level=info msg=Waiting for nvidia-operator-validator to shutdown
time=2026-02-25T08:26:06Z level=info msg=Waiting for nvidia-container-toolkit-daemonset to shutdown
time=2026-02-25T08:26:06Z level=info msg=Waiting for nvidia-device-plugin-daemonset to shutdown
time=2026-02-25T08:26:06Z level=info msg=Waiting for gpu-feature-discovery to shutdown
time=2026-02-25T08:26:06Z level=info msg=Waiting for nvidia-dcgm-exporter to shutdown
time=2026-02-25T08:26:06Z level=info msg=Waiting for nvidia-dcgm to shutdown
time=2026-02-25T08:26:06Z level=info msg=Waiting for mig-manager to shutdown
time=2026-02-25T08:26:06Z level=info msg=Auto eviction of GPU pods on node <node_name> is disabled by the upgrade policy
time=2026-02-25T08:26:06Z level=info msg=Cleaning up NVIDIA driver
time=2026-02-25T08:26:06Z level=info msg=Unloading NVIDIA driver kernel modules
time=2026-02-25T08:26:06Z level=warning msg=Failed to unload kernel module nvidia_uvm: resource temporarily unavailable
time=2026-02-25T08:26:06Z level=warning msg=Failed to unload kernel module nvidia: resource temporarily unavailable
time=2026-02-25T08:26:06Z level=info msg=Could not unload NVIDIA driver kernel modules, driver is in use
time=2026-02-25T08:26:06Z level=info msg=Module               Size       Ref Count       Used by

time=2026-02-25T08:26:06Z level=info msg=Auto drain of the node is disabled by the upgrade policy
time=2026-02-25T08:26:06Z level=error msg=Failed to uninstall nvidia driver components
time=2026-02-25T08:26:06Z level=info msg=Performing cleanup on failure
time=2026-02-25T08:26:06Z level=info msg=Auto eviction of GPU pods on node <node_name> is disabled by the upgrade policy
time=2026-02-25T08:26:06Z level=info msg=Auto drain of the node is disabled by the upgrade policy
time=2026-02-25T08:26:06Z level=info msg=Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
nvidia_uvm           2179072    48              -
nvidia               11554816   206             nvidia_uvm,
ecc                  45056      1               nvidia,
time=2026-02-25T08:26:06Z level=fatal msg=failed to uninstall nvidia driver components: failed to unload driver: resource temporarily unavailable

```

This resulted in a deadlock situation:
- The driver pod restart requires GPU workloads to be terminated in order to unload the driver module.
- The GPU workloads were not terminated because the upgrade controller workflow was not triggered.
As a result, neither the driver pod nor the workloads could proceed, and the driver pod remained stuck in the Init:CrashLoopBackOff state.

**To Reproduce**
- Install v25.3.4 GPU Operator with driver.usePrecompiled was enabled.
- Run any GPU Workload.
- Upgrade to v25.10.0 with running workloads with driver.usePrecompiled was disabled.
or 
Manually restart the driver DaemonSet pod running on a node that has active GPU workloads.

**Expected behavior**
As per documentation https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html, the expected behavior during the upgrade process is:
1. Disable all clients to the GPU driver.
2. Unload the current GPU driver kernel modules.
3. Start the updated GPU driver pod.
4. Install the updated GPU driver and load the updated kernel modules.
5. Enable the clients of the GPU driver.


**Environment (please provide the following information):**
 - GPU Operator Version: v25.3.4 upgrade to v25.10.0
 - OS: Ubuntu24.04
 - Kernel Version: 6.8.0-90-generic
 - Container Runtime Version: 2.1.5
 - Kubernetes Distro and Version: k8s version (v1.34.2)

Gpu_operator pod logs
[gpu-operator.log](https://github.com/user-attachments/files/25541444/gpu-operator.log)

Cluster-policy configurations

```
 driver:
      enabled: true
      manager:
        env:
        - name: ENABLE_GPU_POD_EVICTION
          value: "true"
        - name: ENABLE_AUTO_DRAIN
          value: "false"
        - name: DRAIN_USE_FORCE
          value: "false"
        - name: DRAIN_POD_SELECTOR_LABEL
          value: ""
        - name: DRAIN_TIMEOUT_SECONDS
          value: 0s
        - name: DRAIN_DELETE_EMPTYDIR_DATA
          value: "true"
        image: k8s-driver-manager
      upgradePolicy:
        autoUpgrade: true
        drain:
          deleteEmptyDir: false
          enable: false
          force: false
          timeoutSeconds: 300
        maxParallelUpgrades: 1
        maxUnavailable: 25%
        podDeletion:
          deleteEmptyDir: true
          force: true
          timeoutSeconds: 900

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Driver Pod Restart Causes Init:CrashLoopBackOff When GPU Workloads Are Running #2166

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Driver Pod Restart Causes Init:CrashLoopBackOff When GPU Workloads Are Running #2166

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions