Collect events and upgrade state in must-gather.sh by rajathagasthya · Pull Request #2168 · NVIDIA/gpu-operator

rajathagasthya · 2026-02-25T17:58:18Z

Summary

Collect Kubernetes events in operator namespace
Collect per-GPU-node upgrade state (annotations, labels, cordon status, node events)
Collect controller revisions for driver and other operand DaemonSets

Sample logs

gpu_nodes.upgrade_state (when events are still under the k8s default 1h retention period)

=== ipp1-2890 ===
# Upgrade annotations:
"nvidia.com/gpu-driver-upgrade-enabled":"true"

# Upgrade state label:
upgrade-done
# Node conditions (Ready, SchedulingDisabled):
NetworkUnavailable=False MemoryPressure=False DiskPressure=False PIDPressure=False Ready=True
# Unschedulable:
# Driver pod controller-revision-hash:
595485d69c

# Events on node (upgrade-related):
LAST SEEN   TYPE     REASON               OBJECT           MESSAGE
5m32s       Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to upgrade-required
5m32s       Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to cordon-required
5m32s       Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to wait-for-jobs-required
5m32s       Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to pod-deletion-required
5m32s       Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to pod-restart-required
5m25s       Normal   NodeNotSchedulable   node/ipp1-2890   Node ipp1-2890 status is now: NodeNotSchedulable
92s         Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to validation-required
92s         Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node annotation to nvidia.com/gpu-driver-upgrade-validation-start-time=null
92s         Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to uncordon-required
91s         Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to upgrade-done
91s         Normal   NodeSchedulable      node/ipp1-2890   Node ipp1-2890 status is now: NodeSchedulable

controller_revisions.log

NAME                                                    CONTROLLER                                                  REVISION   AGE
nvidia-device-plugin-daemonset-6bc9cfcd44               daemonset.apps/nvidia-device-plugin-daemonset               1          19h
nvidia-mig-manager-77cb6b577f                           daemonset.apps/nvidia-mig-manager                           1          19h
gpu-feature-discovery-68948976d8                        daemonset.apps/gpu-feature-discovery                        1          19h
nvidia-driver-daemonset-5f764b96ff                      daemonset.apps/nvidia-driver-daemonset                      1          19h
nvidia-dcgm-exporter-7f59d89578                         daemonset.apps/nvidia-dcgm-exporter                         1          19h
nvidia-operator-validator-668bbc86db                    daemonset.apps/nvidia-operator-validator                    1          19h
gpu-operator-node-feature-discovery-worker-67745fbd86   daemonset.apps/gpu-operator-node-feature-discovery-worker   1          19h
nvidia-device-plugin-mps-control-daemon-54f7947d7       daemonset.apps/nvidia-device-plugin-mps-control-daemon      1          19h
nvidia-container-toolkit-daemonset-6b97d4758f           daemonset.apps/nvidia-container-toolkit-daemonset           1          19h
gpu-feature-discovery-6df74744bb                        daemonset.apps/gpu-feature-discovery                        2          18h
nvidia-device-plugin-daemonset-79fb565fcd               daemonset.apps/nvidia-device-plugin-daemonset               2          18h
nvidia-container-toolkit-daemonset-7976455759           daemonset.apps/nvidia-container-toolkit-daemonset           2          18h
nvidia-mig-manager-757dfd48f9                           daemonset.apps/nvidia-mig-manager                           2          19h
nvidia-dcgm-exporter-5f66c88f4d                         daemonset.apps/nvidia-dcgm-exporter                         2          18h
nvidia-driver-daemonset-595485d69c                      daemonset.apps/nvidia-driver-daemonset                      2          7m32s
nvidia-operator-validator-7ff994f8f6                    daemonset.apps/nvidia-operator-validator                    2          18h
nvidia-device-plugin-mps-control-daemon-95b7fb56f       daemonset.apps/nvidia-device-plugin-mps-control-daemon      2          18h
nvidia-dcgm-exporter-5fdd549c6f                         daemonset.apps/nvidia-dcgm-exporter                         3          18h
nvidia-mig-manager-5964c54db9                           daemonset.apps/nvidia-mig-manager                           3          19h
nvidia-container-toolkit-daemonset-fc49746c6            daemonset.apps/nvidia-container-toolkit-daemonset           3          18h
nvidia-operator-validator-85f7b55949                    daemonset.apps/nvidia-operator-validator                    3          18h
gpu-feature-discovery-58cb8bf9d7                        daemonset.apps/gpu-feature-discovery                        3          18h
nvidia-device-plugin-mps-control-daemon-55fb65cb6d      daemonset.apps/nvidia-device-plugin-mps-control-daemon      3          18h
nvidia-device-plugin-daemonset-86bc88dd57               daemonset.apps/nvidia-device-plugin-daemonset               3          18h
nvidia-device-plugin-daemonset-9f6cfccc5                daemonset.apps/nvidia-device-plugin-daemonset               4          18h
nvidia-operator-validator-6867486fdc                    daemonset.apps/nvidia-operator-validator                    4          18h
nvidia-container-toolkit-daemonset-7f9f47c454           daemonset.apps/nvidia-container-toolkit-daemonset           4          18h
nvidia-dcgm-exporter-665c988767                         daemonset.apps/nvidia-dcgm-exporter                         4          18h
nvidia-device-plugin-mps-control-daemon-74c9986b46      daemonset.apps/nvidia-device-plugin-mps-control-daemon      4          18h
nvidia-mig-manager-74bdf786f5                           daemonset.apps/nvidia-mig-manager                           4          18h
gpu-feature-discovery-58f5b5d997                        daemonset.apps/gpu-feature-discovery                        4          18h
nvidia-container-toolkit-daemonset-dd8f4b4c6            daemonset.apps/nvidia-container-toolkit-daemonset           5          18h
nvidia-mig-manager-7c969f679b                           daemonset.apps/nvidia-mig-manager                           5          18h
nvidia-operator-validator-869f8556fc                    daemonset.apps/nvidia-operator-validator                    5          18h
nvidia-device-plugin-mps-control-daemon-6c464f9b8d      daemonset.apps/nvidia-device-plugin-mps-control-daemon      5          18h
gpu-feature-discovery-6c86984bcf                        daemonset.apps/gpu-feature-discovery                        5          18h
nvidia-device-plugin-daemonset-65f579845                daemonset.apps/nvidia-device-plugin-daemonset               5          18h
nvidia-dcgm-exporter-64f476646f                         daemonset.apps/nvidia-dcgm-exporter                         5          18h
nvidia-dcgm-exporter-8c8f4678b                          daemonset.apps/nvidia-dcgm-exporter                         6          18h
nvidia-device-plugin-mps-control-daemon-8b764b9cf       daemonset.apps/nvidia-device-plugin-mps-control-daemon      6          18h
nvidia-device-plugin-daemonset-57d65bdc84               daemonset.apps/nvidia-device-plugin-daemonset               6          18h
nvidia-container-toolkit-daemonset-666b5b9bd5           daemonset.apps/nvidia-container-toolkit-daemonset           6          18h
gpu-feature-discovery-6fc887f684                        daemonset.apps/gpu-feature-discovery                        6          18h
nvidia-mig-manager-595796b4c8                           daemonset.apps/nvidia-mig-manager                           6          18h
nvidia-operator-validator-c987c545d                     daemonset.apps/nvidia-operator-validator                    6          18h
nvidia-mig-manager-7944597844                           daemonset.apps/nvidia-mig-manager                           7          18h
nvidia-mig-manager-67bf6f5bf7                           daemonset.apps/nvidia-mig-manager                           8          18h

controller_revisions_driver.yaml

apiVersion: apps/v1
data:
  spec:
    template:
      $patch: replace
      metadata:
        annotations:
          kubectl.kubernetes.io/default-container: nvidia-driver-ctr
        labels:
          app: nvidia-driver-daemonset
          app.kubernetes.io/component: nvidia-driver
          app.kubernetes.io/managed-by: gpu-operator
          helm.sh/chart: gpu-operator-v1.0.0-devel
          nvidia.com/precompiled: "false"
      spec:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/component
                  operator: In
                  values:
                  - nvidia-driver
              topologyKey: kubernetes.io/hostname
        containers:
        - args:
          - init
          command:
          - nvidia-driver
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
          - name: NODE_IP
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: status.hostIP
          - name: KERNEL_MODULE_TYPE
            value: auto
          - name: DRIVER_CONFIG_DIGEST
            value: "2561234486"
          image: ghcr.io/nvidia/driver:f01cb133-580.126.20-ubuntu24.04
          imagePullPolicy: IfNotPresent
          lifecycle:
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - rm -f /run/nvidia/validations/.driver-ctr-ready
          name: nvidia-driver-ctr
          resources: {}
          securityContext:
            privileged: true
            seLinuxOptions:
              level: s0
          startupProbe:
            exec:
              command:
              - sh
              - /usr/local/bin/startup-probe.sh
            failureThreshold: 120
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 60
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /run/nvidia
            mountPropagation: Bidirectional
            name: run-nvidia
          - mountPath: /run/nvidia-fabricmanager
            name: run-nvidia-fabricmanager
          - mountPath: /run/nvidia-topologyd
            name: run-nvidia-topologyd
          - mountPath: /var/log
            name: var-log
          - mountPath: /dev/log
            name: dev-log
          - mountPath: /host-etc/os-release
            name: host-os-release
            readOnly: true
          - mountPath: /run/mellanox/drivers/usr/src
            mountPropagation: HostToContainer
            name: mlnx-ofed-usr-src
          - mountPath: /run/mellanox/drivers
            mountPropagation: HostToContainer
            name: run-mellanox-drivers
          - mountPath: /sys/devices/system/memory/auto_online_blocks
            name: sysfs-memory-online
          - mountPath: /sys/module/firmware_class/parameters/path
            name: firmware-search-path
          - mountPath: /lib/firmware
            name: nv-firmware
          - mountPath: /usr/local/bin/startup-probe.sh
            name: driver-startup-probe-script
            subPath: startup-probe.sh
        dnsPolicy: ClusterFirst
        hostPID: true
        initContainers:
        - args:
          - uninstall_driver
          command:
          - driver-manager
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
          - name: NVIDIA_VISIBLE_DEVICES
            value: void
          - name: ENABLE_GPU_POD_EVICTION
            value: "true"
          - name: ENABLE_AUTO_DRAIN
            value: "false"
          - name: DRAIN_USE_FORCE
            value: "false"
          - name: DRAIN_POD_SELECTOR_LABEL
          - name: DRAIN_TIMEOUT_SECONDS
            value: 0s
          - name: DRAIN_DELETE_EMPTYDIR_DATA
            value: "false"
          - name: OPERATOR_NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
          - name: DRIVER_CONFIG_DIGEST
            value: "2561234486"
          image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.9.1
          imagePullPolicy: IfNotPresent
          name: k8s-driver-manager
          resources: {}
          securityContext:
            privileged: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /run/nvidia
            mountPropagation: Bidirectional
            name: run-nvidia
          - mountPath: /host
            mountPropagation: HostToContainer
            name: host-root
            readOnly: true
          - mountPath: /sys
            name: host-sys
          - mountPath: /run/mellanox/drivers
            mountPropagation: HostToContainer
            name: run-mellanox-drivers
        nodeSelector:
          nvidia.com/gpu.deploy.driver: "true"
        priorityClassName: system-node-critical
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: nvidia-driver
        serviceAccountName: nvidia-driver
        terminationGracePeriodSeconds: 30
        tolerations:
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists
        volumes:
        - hostPath:
            path: /run/nvidia
            type: DirectoryOrCreate
          name: run-nvidia
        - hostPath:
            path: /var/log
            type: ""
          name: var-log
        - hostPath:
            path: /dev/log
            type: ""
          name: dev-log
        - hostPath:
            path: /etc/os-release
            type: ""
          name: host-os-release
        - hostPath:
            path: /run/nvidia-fabricmanager
            type: DirectoryOrCreate
          name: run-nvidia-fabricmanager
        - hostPath:
            path: /run/nvidia-topologyd
            type: DirectoryOrCreate
          name: run-nvidia-topologyd
        - hostPath:
            path: /run/mellanox/drivers/usr/src
            type: DirectoryOrCreate
          name: mlnx-ofed-usr-src
        - hostPath:
            path: /run/mellanox/drivers
            type: DirectoryOrCreate
          name: run-mellanox-drivers
        - hostPath:
            path: /run/nvidia/validations
            type: DirectoryOrCreate
          name: run-nvidia-validations
        - hostPath:
            path: /
            type: ""
          name: host-root
        - hostPath:
            path: /sys
            type: Directory
          name: host-sys
        - hostPath:
            path: /sys/module/firmware_class/parameters/path
            type: ""
          name: firmware-search-path
        - hostPath:
            path: /sys/devices/system/memory/auto_online_blocks
            type: ""
          name: sysfs-memory-online
        - hostPath:
            path: /run/nvidia/driver/lib/firmware
            type: DirectoryOrCreate
          name: nv-firmware
        - configMap:
            defaultMode: 493
            name: nvidia-driver-startup-probe
          name: driver-startup-probe-script
kind: ControllerRevision
metadata:
  annotations:
    deprecated.daemonset.template.generation: "2"
    nvidia.com/last-applied-hash: "644779074"
    openshift.io/scc: nvidia-driver
  creationTimestamp: "2026-02-25T17:01:29Z"
  labels:
    app: nvidia-driver-daemonset
    app.kubernetes.io/component: nvidia-driver
    app.kubernetes.io/managed-by: gpu-operator
    controller-revision-hash: 595485d69c
    helm.sh/chart: gpu-operator-v1.0.0-devel
    nvidia.com/precompiled: "false"
  name: nvidia-driver-daemonset-595485d69c
  namespace: gpu-operator
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: nvidia-driver-daemonset
    uid: 6bbb3137-1e80-41ca-89c8-95a0e94e5a8d
  resourceVersion: "223952"
  uid: 6593292a-c795-47e5-bda4-00e257630897
revision: 2
---
apiVersion: apps/v1
data:
  spec:
    template:
      $patch: replace
      metadata:
        annotations:
          kubectl.kubernetes.io/default-container: nvidia-driver-ctr
        labels:
          app: nvidia-driver-daemonset
          app.kubernetes.io/component: nvidia-driver
          app.kubernetes.io/managed-by: gpu-operator
          helm.sh/chart: gpu-operator-v1.0.0-devel
          nvidia.com/precompiled: "false"
      spec:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/component
                  operator: In
                  values:
                  - nvidia-driver
              topologyKey: kubernetes.io/hostname
        containers:
        - args:
          - init
          command:
          - nvidia-driver
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
          - name: NODE_IP
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: status.hostIP
          - name: KERNEL_MODULE_TYPE
            value: auto
          - name: DRIVER_CONFIG_DIGEST
            value: "889101604"
          image: nvcr.io/nvidia/driver:580.126.16-ubuntu24.04
          imagePullPolicy: IfNotPresent
          lifecycle:
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - rm -f /run/nvidia/validations/.driver-ctr-ready
          name: nvidia-driver-ctr
          resources: {}
          securityContext:
            privileged: true
            seLinuxOptions:
              level: s0
          startupProbe:
            exec:
              command:
              - sh
              - /usr/local/bin/startup-probe.sh
            failureThreshold: 120
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 60
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /run/nvidia
            mountPropagation: Bidirectional
            name: run-nvidia
          - mountPath: /run/nvidia-fabricmanager
            name: run-nvidia-fabricmanager
          - mountPath: /run/nvidia-topologyd
            name: run-nvidia-topologyd
          - mountPath: /var/log
            name: var-log
          - mountPath: /dev/log
            name: dev-log
          - mountPath: /host-etc/os-release
            name: host-os-release
            readOnly: true
          - mountPath: /run/mellanox/drivers/usr/src
            mountPropagation: HostToContainer
            name: mlnx-ofed-usr-src
          - mountPath: /run/mellanox/drivers
            mountPropagation: HostToContainer
            name: run-mellanox-drivers
          - mountPath: /sys/devices/system/memory/auto_online_blocks
            name: sysfs-memory-online
          - mountPath: /sys/module/firmware_class/parameters/path
            name: firmware-search-path
          - mountPath: /lib/firmware
            name: nv-firmware
          - mountPath: /usr/local/bin/startup-probe.sh
            name: driver-startup-probe-script
            subPath: startup-probe.sh
        dnsPolicy: ClusterFirst
        hostPID: true
        initContainers:
        - args:
          - uninstall_driver
          command:
          - driver-manager
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
          - name: NVIDIA_VISIBLE_DEVICES
            value: void
          - name: ENABLE_GPU_POD_EVICTION
            value: "true"
          - name: ENABLE_AUTO_DRAIN
            value: "false"
          - name: DRAIN_USE_FORCE
            value: "false"
          - name: DRAIN_POD_SELECTOR_LABEL
          - name: DRAIN_TIMEOUT_SECONDS
            value: 0s
          - name: DRAIN_DELETE_EMPTYDIR_DATA
            value: "false"
          - name: OPERATOR_NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
          - name: DRIVER_CONFIG_DIGEST
            value: "889101604"
          image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.9.1
          imagePullPolicy: IfNotPresent
          name: k8s-driver-manager
          resources: {}
          securityContext:
            privileged: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /run/nvidia
            mountPropagation: Bidirectional
            name: run-nvidia
          - mountPath: /host
            mountPropagation: HostToContainer
            name: host-root
            readOnly: true
          - mountPath: /sys
            name: host-sys
          - mountPath: /run/mellanox/drivers
            mountPropagation: HostToContainer
            name: run-mellanox-drivers
        nodeSelector:
          nvidia.com/gpu.deploy.driver: "true"
        priorityClassName: system-node-critical
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: nvidia-driver
        serviceAccountName: nvidia-driver
        terminationGracePeriodSeconds: 30
        tolerations:
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists
        volumes:
        - hostPath:
            path: /run/nvidia
            type: DirectoryOrCreate
          name: run-nvidia
        - hostPath:
            path: /var/log
            type: ""
          name: var-log
        - hostPath:
            path: /dev/log
            type: ""
          name: dev-log
        - hostPath:
            path: /etc/os-release
            type: ""
          name: host-os-release
        - hostPath:
            path: /run/nvidia-fabricmanager
            type: DirectoryOrCreate
          name: run-nvidia-fabricmanager
        - hostPath:
            path: /run/nvidia-topologyd
            type: DirectoryOrCreate
          name: run-nvidia-topologyd
        - hostPath:
            path: /run/mellanox/drivers/usr/src
            type: DirectoryOrCreate
          name: mlnx-ofed-usr-src
        - hostPath:
            path: /run/mellanox/drivers
            type: DirectoryOrCreate
          name: run-mellanox-drivers
        - hostPath:
            path: /run/nvidia/validations
            type: DirectoryOrCreate
          name: run-nvidia-validations
        - hostPath:
            path: /
            type: ""
          name: host-root
        - hostPath:
            path: /sys
            type: Directory
          name: host-sys
        - hostPath:
            path: /sys/module/firmware_class/parameters/path
            type: ""
          name: firmware-search-path
        - hostPath:
            path: /sys/devices/system/memory/auto_online_blocks
            type: ""
          name: sysfs-memory-online
        - hostPath:
            path: /run/nvidia/driver/lib/firmware
            type: DirectoryOrCreate
          name: nv-firmware
        - configMap:
            defaultMode: 493
            name: nvidia-driver-startup-probe
          name: driver-startup-probe-script
kind: ControllerRevision
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
    nvidia.com/last-applied-hash: "2199474554"
    openshift.io/scc: nvidia-driver
  creationTimestamp: "2026-02-24T21:53:40Z"
  labels:
    app: nvidia-driver-daemonset
    app.kubernetes.io/component: nvidia-driver
    app.kubernetes.io/managed-by: gpu-operator
    controller-revision-hash: 5f764b96ff
    helm.sh/chart: gpu-operator-v1.0.0-devel
    nvidia.com/precompiled: "false"
  name: nvidia-driver-daemonset-5f764b96ff
  namespace: gpu-operator
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: nvidia-driver-daemonset
    uid: 6bbb3137-1e80-41ca-89c8-95a0e94e5a8d
  resourceVersion: "29143"
  uid: aef86a3c-0bf0-476a-bb77-7a484805617a
revision: 1
---

events_operator_namespace.log

LAST SEEN   TYPE      REASON                   OBJECT                                         MESSAGE
5m29s       Normal    Killing                  pod/nvidia-driver-daemonset-cnvjc              Stopping container nvidia-driver-ctr
4m58s       Normal    Scheduled                pod/nvidia-driver-daemonset-4vvpv              Successfully assigned gpu-operator/nvidia-driver-daemonset-4vvpv to ipp1-2890
4m58s       Normal    SuccessfulCreate         daemonset/nvidia-driver-daemonset              Created pod: nvidia-driver-daemonset-4vvpv
4m58s       Normal    Created                  pod/nvidia-driver-daemonset-4vvpv              Created container: k8s-driver-manager
4m58s       Normal    Pulled                   pod/nvidia-driver-daemonset-4vvpv              Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.9.1" already present on machine
4m57s       Normal    Killing                  pod/nvidia-mig-manager-6gptg                   Stopping container nvidia-mig-manager
4m57s       Normal    Killing                  pod/nvidia-device-plugin-daemonset-bfmqt       Stopping container nvidia-device-plugin
4m57s       Warning   FailedKillPod            pod/nvidia-device-plugin-daemonset-bfmqt       error killing pod: [failed to "KillContainer" for "nvidia-device-plugin" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "d8a700cc-d3d1-4eb0-9809-d130acc6b3f7" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\""]
4m57s       Normal    SuccessfulDelete         daemonset/nvidia-device-plugin-daemonset       Deleted pod: nvidia-device-plugin-daemonset-bfmqt
4m57s       Normal    Killing                  pod/gpu-feature-discovery-82tf4                Stopping container gpu-feature-discovery
4m57s       Normal    Killing                  pod/nvidia-dcgm-exporter-9hqf8                 Stopping container nvidia-dcgm-exporter
4m57s       Warning   FailedKillPod            pod/gpu-feature-discovery-82tf4                error killing pod: [failed to "KillContainer" for "gpu-feature-discovery" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "26aeac6e-5b72-4927-bfb1-2a4baf20ebf3" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\""]
4m57s       Normal    SuccessfulDelete         daemonset/nvidia-operator-validator            Deleted pod: nvidia-operator-validator-hxl5d
4m57s       Warning   FailedKillPod            pod/nvidia-operator-validator-hxl5d            error killing pod: [failed to "KillContainer" for "nvidia-operator-validator" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "9b9e7b94-a3a3-4666-9535-f9ded7081240" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\""]
4m57s       Warning   FailedKillPod            pod/nvidia-mig-manager-6gptg                   error killing pod: [failed to "KillContainer" for "nvidia-mig-manager" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "6d54b2ab-3dbb-413a-b143-0b980dee8e2e" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\""]
4m57s       Normal    SuccessfulDelete         daemonset/nvidia-container-toolkit-daemonset   Deleted pod: nvidia-container-toolkit-daemonset-kztx4
4m57s       Warning   FailedKillPod            pod/nvidia-container-toolkit-daemonset-kztx4   error killing pod: [failed to "KillContainer" for "nvidia-container-toolkit-ctr" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "49b35cf7-efbb-41a1-a0c7-4dfc44944269" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\""]
4m57s       Normal    SuccessfulDelete         daemonset/gpu-feature-discovery                Deleted pod: gpu-feature-discovery-82tf4
4m57s       Normal    Killing                  pod/nvidia-container-toolkit-daemonset-kztx4   Stopping container nvidia-container-toolkit-ctr
4m57s       Normal    Started                  pod/nvidia-driver-daemonset-4vvpv              Started container k8s-driver-manager
4m57s       Warning   FailedKillPod            pod/nvidia-dcgm-exporter-9hqf8                 error killing pod: [failed to "KillContainer" for "nvidia-dcgm-exporter" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "e95ab621-e792-4b74-a9f5-4b024d88ff3c" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\""]
4m57s       Normal    SuccessfulDelete         daemonset/nvidia-dcgm-exporter                 Deleted pod: nvidia-dcgm-exporter-9hqf8
4m57s       Normal    SuccessfulDelete         daemonset/nvidia-mig-manager                   Deleted pod: nvidia-mig-manager-6gptg
4m46s       Normal    Killing                  pod/nvidia-operator-validator-hxl5d            Stopping container nvidia-operator-validator
4m46s       Warning   FailedPreStopHook        pod/nvidia-operator-validator-hxl5d            PreStopHook failed
4m10s       Normal    SuccessfulCreate         daemonset/nvidia-container-toolkit-daemonset   Created pod: nvidia-container-toolkit-daemonset-88r4q
4m10s       Normal    SuccessfulCreate         daemonset/gpu-feature-discovery                Created pod: gpu-feature-discovery-bptk4
4m10s       Normal    SuccessfulCreate         daemonset/nvidia-operator-validator            Created pod: nvidia-operator-validator-l4754
4m10s       Normal    SuccessfulCreate         daemonset/nvidia-mig-manager                   Created pod: nvidia-mig-manager-nzmhb
4m10s       Normal    SuccessfulCreate         daemonset/nvidia-dcgm-exporter                 Created pod: nvidia-dcgm-exporter-qd6vj
4m10s       Normal    Scheduled                pod/gpu-feature-discovery-bptk4                Successfully assigned gpu-operator/gpu-feature-discovery-bptk4 to ipp1-2890
4m10s       Normal    Scheduled                pod/nvidia-dcgm-exporter-qd6vj                 Successfully assigned gpu-operator/nvidia-dcgm-exporter-qd6vj to ipp1-2890
4m10s       Normal    Scheduled                pod/nvidia-mig-manager-nzmhb                   Successfully assigned gpu-operator/nvidia-mig-manager-nzmhb to ipp1-2890
4m10s       Normal    Scheduled                pod/nvidia-container-toolkit-daemonset-88r4q   Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-88r4q to ipp1-2890
4m10s       Normal    Scheduled                pod/nvidia-device-plugin-daemonset-2nc7v       Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-2nc7v to ipp1-2890
4m10s       Normal    Scheduled                pod/nvidia-operator-validator-l4754            Successfully assigned gpu-operator/nvidia-operator-validator-l4754 to ipp1-2890
4m10s       Normal    SuccessfulCreate         daemonset/nvidia-device-plugin-daemonset       Created pod: nvidia-device-plugin-daemonset-2nc7v
4m9s        Normal    Pulled                   pod/nvidia-container-toolkit-daemonset-88r4q   Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
4m9s        Normal    Started                  pod/nvidia-container-toolkit-daemonset-88r4q   Started container driver-validation
4m9s        Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "5729b1ca55917b4d3c11159a090a9c35b183c04160b2fda91d20bb9dfd83517c": no runtime for "nvidia" is configured
4m9s        Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "30878c59ae450e4fc8cb342d75bcedea57de251a305422d867d8cc8e6e67df1a": no runtime for "nvidia" is configured
4m9s        Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "56b7a33fa7af6aa2fdd1fc49b2abe7c30d1281d27847c687ada0b3ab782c3ce3": no runtime for "nvidia" is configured
4m9s        Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "4cb4fc985a472424da62f6e4bbb1a80745ec327687007548167ac378d728a656": no runtime for "nvidia" is configured
4m9s        Normal    Pulling                  pod/nvidia-driver-daemonset-4vvpv              Pulling image "ghcr.io/nvidia/driver:f01cb133-580.126.20-ubuntu24.04"
4m9s        Normal    Created                  pod/nvidia-container-toolkit-daemonset-88r4q   Created container: driver-validation
4m9s        Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "843c465bdeff33802b15700b901784eabfdb62d4eef56d0cc2236f6b1810edb9": no runtime for "nvidia" is configured
3m58s       Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "96992bd3c5ed51d47d3dbcacb1199627d913b8e94dd9a26ebc74231d01a91579": no runtime for "nvidia" is configured
3m58s       Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "bb140ecadc3e968d82540e7ad6e4c083949f36cedcac57072279d21e9fe607fa": no runtime for "nvidia" is configured
3m57s       Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "c19cfcfde54787f7e1487601c091fd886c467de6fba51caa6d08c0fc5c33d389": no runtime for "nvidia" is configured
3m57s       Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "18eb5881e5357a9175a7bdec8619fee7566d71501b1944add480b01cfba4300e": no runtime for "nvidia" is configured
3m56s       Normal    Started                  pod/nvidia-driver-daemonset-4vvpv              Started container nvidia-driver-ctr
3m56s       Normal    Pulled                   pod/nvidia-driver-daemonset-4vvpv              Successfully pulled image "ghcr.io/nvidia/driver:f01cb133-580.126.20-ubuntu24.04" in 13.031s (13.031s including waiting). Image size: 650704528 bytes.
3m56s       Normal    Created                  pod/nvidia-driver-daemonset-4vvpv              Created container: nvidia-driver-ctr
3m54s       Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "b91a59cf9f5907c2f14355c786f8af294647efc335518f489edfeb63361df063": no runtime for "nvidia" is configured
3m46s       Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "271912cad005df1c7ec3b411f2fa39933950b6877feeae9c0fdba9751750fb7e": no runtime for "nvidia" is configured
3m45s       Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "18073b779521b77e3cbc7752dd5c2f63064efe24cec18807e917cd29f31fdf50": no runtime for "nvidia" is configured
3m44s       Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "7def1d2c1efd1e16de5eeac923cc4834053b10299bb6eb2a08f5e9dd3e0ebd80": no runtime for "nvidia" is configured
3m43s       Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "fa407f3dc7baf4bf0d01bebb5b4594220f3f03ddde0887ffa21d17c20665f7e9": no runtime for "nvidia" is configured
3m43s       Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "dcd9f871b25e882d16fb4189b4ac0adca3f0240d56fe8d8e1af418b551d55fe7": no runtime for "nvidia" is configured
3m32s       Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "0625a5ba7cb7741a35bb84158e49d5ee4feb5c10d21badb1d16f3e8759e58236": no runtime for "nvidia" is configured
3m32s       Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "4b2ed2951b7cb990dd86c18a880b2abc62df41fbda741adddbbb36ba53b75250": no runtime for "nvidia" is configured
3m32s       Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "482c2bf68a62615b8239063d6401934bf5b3676d8a67d98caf4b4c930488d8d2": no runtime for "nvidia" is configured
3m32s       Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "b2643509f1881ff948c237641d983c0b0530e674105e1fd3816bf927e61fc810": no runtime for "nvidia" is configured
3m30s       Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "a56d4da9a13253347d46306f9db2a0b59a57eb1b1b0667fcfee3e3554a72250f": no runtime for "nvidia" is configured
3m21s       Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "aa470629532628fd1e5b7b74c85c921619ee77331d246bde53157a620cdc8cea": no runtime for "nvidia" is configured
3m20s       Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "2d510832137ce3b532b2f348e37f4e43b3e114c3f1a17a510273206bafd74825": no runtime for "nvidia" is configured
3m18s       Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "6a3e2b1ccbf9410c6fb9ede96c9b8162b73415f63b24c7940dececad6ffbd6d7": no runtime for "nvidia" is configured
3m17s       Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "3b68f4bac3b9492ad2ba74e5e1102166b658d01b8bcf3a436521ff8dca190f46": no runtime for "nvidia" is configured
3m17s       Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "a761d723c91c7301ecbde9d345d95178126b47e95897fdd05714286939fa40d5": no runtime for "nvidia" is configured
3m8s        Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "68dc17c9b5938509ee1378e7382b9187c2c3ededcf67df7aab99c3bfca58e5ef": no runtime for "nvidia" is configured
3m6s        Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "2ec8e1ebfac4c41384342ee5428e095615ff3241a3b4992cb5e94b7a85363b3a": no runtime for "nvidia" is configured
3m6s        Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "f141bfdbefb6197b61c951ddcdd66952ea9234520b7baa085dec905b000ad32c": no runtime for "nvidia" is configured
3m4s        Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "f176eb9c95bf854b269237a62768479acc165999962d4420ec00bc6418c7d61b": no runtime for "nvidia" is configured
3m2s        Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "0664ada1bb4c5b3ae81e5a914f621d8242f90903db8cfd37df53982deb88bd1b": no runtime for "nvidia" is configured
2m55s       Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "a5b1e58fbff7a57c494bf8d022252ba7d195fe4cb86134584bdf8045198d03ad": no runtime for "nvidia" is configured
2m52s       Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "2cdc7fa02cb3cceaca8bbe66ae2e1296380cce749848dca9733fe225ac87d5ec": no runtime for "nvidia" is configured
2m52s       Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "06335bc09cd5c1047f70fdf5f82ecf5ac09a1ddf0e17e889512ecb2f5283f4e5": no runtime for "nvidia" is configured
2m52s       Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "643fcb4a00d2aace8b59daaec3e2a618e685adc6bb0d19303ff3dfe7b2071af2": no runtime for "nvidia" is configured
2m49s       Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "9ed65668901d22444bc649c98d2cf45b1135c8b5e962fc0ccdeef2f79d9df994": no runtime for "nvidia" is configured
2m41s       Normal    Started                  pod/nvidia-container-toolkit-daemonset-88r4q   Started container nvidia-container-toolkit-ctr
2m41s       Normal    Created                  pod/nvidia-container-toolkit-daemonset-88r4q   Created container: nvidia-container-toolkit-ctr
2m41s       Normal    Pulled                   pod/nvidia-container-toolkit-daemonset-88r4q   Container image "nvcr.io/nvidia/k8s/container-toolkit:v1.19.0-rc.4" already present on machine
2m40s       Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "d55d903bc7c1648f6d9a3c5a2fb830bce5d6c035a195a1e56d1950c68f39f502": no runtime for "nvidia" is configured
2m40s       Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "7e7b3839257e9e74648bab62351d222f9bc96e440ffa871476716a7cae923dd3": no runtime for "nvidia" is configured
2m39s       Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "7d5b7941ff549c33ed8c57a2f5d09f34ec0832a6f8c6b6c3fa2f441e7c9c58da": no runtime for "nvidia" is configured
2m38s       Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "9c5101a30b192d4636f32c544760b64388d644b25ad8425b14297081cf6d3cc1": no runtime for "nvidia" is configured
2m37s       Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "1f11decc9f748393be2afca14cf3acab7a23697578c6d464af6c80e409737227": no runtime for "nvidia" is configured
2m26s       Normal    Started                  pod/gpu-feature-discovery-bptk4                Started container toolkit-validation
2m26s       Normal    Created                  pod/gpu-feature-discovery-bptk4                Created container: toolkit-validation
2m26s       Normal    Pulled                   pod/gpu-feature-discovery-bptk4                Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m25s       Normal    Created                  pod/nvidia-dcgm-exporter-qd6vj                 Created container: toolkit-validation
2m25s       Normal    Pulled                   pod/nvidia-dcgm-exporter-qd6vj                 Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m25s       Normal    Started                  pod/nvidia-dcgm-exporter-qd6vj                 Started container toolkit-validation
2m25s       Normal    Pulled                   pod/nvidia-mig-manager-nzmhb                   Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m25s       Normal    Created                  pod/nvidia-mig-manager-nzmhb                   Created container: toolkit-validation
2m25s       Normal    Started                  pod/nvidia-mig-manager-nzmhb                   Started container toolkit-validation
2m24s       Normal    Pulled                   pod/nvidia-operator-validator-l4754            Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m24s       Normal    Started                  pod/nvidia-operator-validator-l4754            Started container driver-validation
2m24s       Normal    Created                  pod/nvidia-operator-validator-l4754            Created container: driver-validation
2m23s       Normal    Created                  pod/nvidia-operator-validator-l4754            Created container: toolkit-validation
2m23s       Normal    Pulled                   pod/nvidia-operator-validator-l4754            Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m22s       Normal    Pulled                   pod/nvidia-operator-validator-l4754            Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m22s       Normal    Created                  pod/nvidia-operator-validator-l4754            Created container: cuda-validation
2m22s       Normal    Started                  pod/nvidia-operator-validator-l4754            Started container toolkit-validation
2m21s       Normal    Created                  pod/nvidia-device-plugin-daemonset-2nc7v       Created container: toolkit-validation
2m21s       Normal    Created                  pod/gpu-feature-discovery-bptk4                Created container: gpu-feature-discovery
2m21s       Normal    Started                  pod/nvidia-cuda-validator-6p98l                Started container cuda-validation
2m21s       Normal    Started                  pod/nvidia-operator-validator-l4754            Started container cuda-validation
2m21s       Normal    Pulled                   pod/gpu-feature-discovery-bptk4                Container image "nvcr.io/nvidia/k8s-device-plugin:v0.18.2" already present on machine
2m21s       Normal    Pulled                   pod/nvidia-cuda-validator-6p98l                Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m21s       Normal    Pulled                   pod/nvidia-device-plugin-daemonset-2nc7v       Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m21s       Normal    Created                  pod/nvidia-cuda-validator-6p98l                Created container: cuda-validation
2m21s       Normal    Created                  pod/nvidia-device-plugin-daemonset-2nc7v       Created container: nvidia-device-plugin
2m21s       Normal    Pulled                   pod/nvidia-device-plugin-daemonset-2nc7v       Container image "nvcr.io/nvidia/k8s-device-plugin:v0.18.2" already present on machine
2m21s       Normal    Started                  pod/nvidia-device-plugin-daemonset-2nc7v       Started container toolkit-validation
2m20s       Normal    Pulled                   pod/nvidia-cuda-validator-6p98l                Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m20s       Normal    Created                  pod/nvidia-cuda-validator-6p98l                Created container: nvidia-cuda-validator
2m20s       Normal    Created                  pod/nvidia-mig-manager-nzmhb                   Created container: nvidia-mig-manager
2m20s       Normal    Pulled                   pod/nvidia-mig-manager-nzmhb                   Container image "ghcr.io/nvidia/k8s-mig-manager:315c447c" already present on machine
2m20s       Normal    Started                  pod/nvidia-device-plugin-daemonset-2nc7v       Started container nvidia-device-plugin
2m20s       Normal    Created                  pod/nvidia-dcgm-exporter-qd6vj                 Created container: nvidia-dcgm-exporter
2m20s       Normal    Pulled                   pod/nvidia-dcgm-exporter-qd6vj                 Container image "nvcr.io/nvidia/k8s/dcgm-exporter:4.5.1-4.8.0-distroless" already present on machine
2m20s       Normal    Started                  pod/gpu-feature-discovery-bptk4                Started container gpu-feature-discovery
2m19s       Normal    Started                  pod/nvidia-dcgm-exporter-qd6vj                 Started container nvidia-dcgm-exporter
2m19s       Normal    Started                  pod/nvidia-cuda-validator-6p98l                Started container nvidia-cuda-validator
2m19s       Normal    Started                  pod/nvidia-mig-manager-nzmhb                   Started container nvidia-mig-manager
2m16s       Normal    Pulled                   pod/nvidia-operator-validator-l4754            Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m15s       Normal    Started                  pod/nvidia-operator-validator-l4754            Started container plugin-validation
2m15s       Normal    Created                  pod/nvidia-operator-validator-l4754            Created container: plugin-validation
2m14s       Normal    Pulled                   pod/nvidia-operator-validator-l4754            Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m14s       Normal    Created                  pod/nvidia-operator-validator-l4754            Created container: nvidia-operator-validator
2m14s       Normal    Started                  pod/nvidia-operator-validator-l4754            Started container nvidia-operator-validator

hack/must-gather.sh

* Collect Kubernetes events in operator namespace. * Collect per-GPU-node upgrade state (annotations, labels, cordon status, node events). * Collect controller revisions for driver and other operand DaemonSets. Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>

rajathagasthya force-pushed the must-gather-upgrade-diagnostics branch from ee7641e to 163bf6b Compare February 25, 2026 17:59

rajathagasthya changed the title ~~fix(must-gather): collect events, upgrade state, and controller revisions~~ Collect events and upgrade state in must-gather.sh Feb 25, 2026

rajathagasthya force-pushed the must-gather-upgrade-diagnostics branch 2 times, most recently from 9331ab2 to 34109c3 Compare February 25, 2026 18:10

rajathagasthya marked this pull request as ready for review February 25, 2026 18:16

rajathagasthya requested review from cdesiniotis, karthikvetrivel, rahulait, shivamerla and tariq1890 as code owners February 25, 2026 18:16

rahulait reviewed Feb 25, 2026

View reviewed changes

hack/must-gather.sh Outdated Show resolved Hide resolved

rajathagasthya force-pushed the must-gather-upgrade-diagnostics branch 2 times, most recently from 2ab7a63 to 83404d9 Compare February 25, 2026 23:08

rahulait approved these changes Feb 25, 2026

View reviewed changes

cdesiniotis reviewed Feb 26, 2026

View reviewed changes

hack/must-gather.sh Outdated Show resolved Hide resolved

cdesiniotis reviewed Feb 26, 2026

View reviewed changes

hack/must-gather.sh Outdated Show resolved Hide resolved

rajathagasthya force-pushed the must-gather-upgrade-diagnostics branch from 83404d9 to 4b74cb0 Compare February 26, 2026 19:41

karthikvetrivel approved these changes Feb 26, 2026

View reviewed changes

tariq1890 reviewed Feb 26, 2026

View reviewed changes

hack/must-gather.sh Outdated Show resolved Hide resolved

rajathagasthya force-pushed the must-gather-upgrade-diagnostics branch from 4b74cb0 to ac37a84 Compare February 26, 2026 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect events and upgrade state in must-gather.sh#2168

Collect events and upgrade state in must-gather.sh#2168
rajathagasthya wants to merge 1 commit intoNVIDIA:mainfrom
rajathagasthya:must-gather-upgrade-diagnostics

rajathagasthya commented Feb 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

rajathagasthya commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Sample logs

gpu_nodes.upgrade_state (when events are still under the k8s default 1h retention period)

controller_revisions.log

controller_revisions_driver.yaml

events_operator_namespace.log

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rajathagasthya commented Feb 25, 2026 •

edited

Loading