Skip to content

Collect events and upgrade state in must-gather.sh#2168

Open
rajathagasthya wants to merge 1 commit intoNVIDIA:mainfrom
rajathagasthya:must-gather-upgrade-diagnostics
Open

Collect events and upgrade state in must-gather.sh#2168
rajathagasthya wants to merge 1 commit intoNVIDIA:mainfrom
rajathagasthya:must-gather-upgrade-diagnostics

Conversation

@rajathagasthya
Copy link
Contributor

@rajathagasthya rajathagasthya commented Feb 25, 2026

Summary

  • Collect Kubernetes events in operator namespace
  • Collect per-GPU-node upgrade state (annotations, labels, cordon status, node events)
  • Collect controller revisions for driver and other operand DaemonSets

Sample logs

gpu_nodes.upgrade_state (when events are still under the k8s default 1h retention period)

=== ipp1-2890 ===
# Upgrade annotations:
"nvidia.com/gpu-driver-upgrade-enabled":"true"

# Upgrade state label:
upgrade-done
# Node conditions (Ready, SchedulingDisabled):
NetworkUnavailable=False MemoryPressure=False DiskPressure=False PIDPressure=False Ready=True
# Unschedulable:
# Driver pod controller-revision-hash:
595485d69c

# Events on node (upgrade-related):
LAST SEEN   TYPE     REASON               OBJECT           MESSAGE
5m32s       Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to upgrade-required
5m32s       Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to cordon-required
5m32s       Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to wait-for-jobs-required
5m32s       Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to pod-deletion-required
5m32s       Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to pod-restart-required
5m25s       Normal   NodeNotSchedulable   node/ipp1-2890   Node ipp1-2890 status is now: NodeNotSchedulable
92s         Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to validation-required
92s         Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node annotation to nvidia.com/gpu-driver-upgrade-validation-start-time=null
92s         Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to uncordon-required
91s         Normal   GPUDriverUpgrade     node/ipp1-2890   Successfully updated node state label to upgrade-done
91s         Normal   NodeSchedulable      node/ipp1-2890   Node ipp1-2890 status is now: NodeSchedulable

controller_revisions.log

NAME                                                    CONTROLLER                                                  REVISION   AGE
nvidia-device-plugin-daemonset-6bc9cfcd44               daemonset.apps/nvidia-device-plugin-daemonset               1          19h
nvidia-mig-manager-77cb6b577f                           daemonset.apps/nvidia-mig-manager                           1          19h
gpu-feature-discovery-68948976d8                        daemonset.apps/gpu-feature-discovery                        1          19h
nvidia-driver-daemonset-5f764b96ff                      daemonset.apps/nvidia-driver-daemonset                      1          19h
nvidia-dcgm-exporter-7f59d89578                         daemonset.apps/nvidia-dcgm-exporter                         1          19h
nvidia-operator-validator-668bbc86db                    daemonset.apps/nvidia-operator-validator                    1          19h
gpu-operator-node-feature-discovery-worker-67745fbd86   daemonset.apps/gpu-operator-node-feature-discovery-worker   1          19h
nvidia-device-plugin-mps-control-daemon-54f7947d7       daemonset.apps/nvidia-device-plugin-mps-control-daemon      1          19h
nvidia-container-toolkit-daemonset-6b97d4758f           daemonset.apps/nvidia-container-toolkit-daemonset           1          19h
gpu-feature-discovery-6df74744bb                        daemonset.apps/gpu-feature-discovery                        2          18h
nvidia-device-plugin-daemonset-79fb565fcd               daemonset.apps/nvidia-device-plugin-daemonset               2          18h
nvidia-container-toolkit-daemonset-7976455759           daemonset.apps/nvidia-container-toolkit-daemonset           2          18h
nvidia-mig-manager-757dfd48f9                           daemonset.apps/nvidia-mig-manager                           2          19h
nvidia-dcgm-exporter-5f66c88f4d                         daemonset.apps/nvidia-dcgm-exporter                         2          18h
nvidia-driver-daemonset-595485d69c                      daemonset.apps/nvidia-driver-daemonset                      2          7m32s
nvidia-operator-validator-7ff994f8f6                    daemonset.apps/nvidia-operator-validator                    2          18h
nvidia-device-plugin-mps-control-daemon-95b7fb56f       daemonset.apps/nvidia-device-plugin-mps-control-daemon      2          18h
nvidia-dcgm-exporter-5fdd549c6f                         daemonset.apps/nvidia-dcgm-exporter                         3          18h
nvidia-mig-manager-5964c54db9                           daemonset.apps/nvidia-mig-manager                           3          19h
nvidia-container-toolkit-daemonset-fc49746c6            daemonset.apps/nvidia-container-toolkit-daemonset           3          18h
nvidia-operator-validator-85f7b55949                    daemonset.apps/nvidia-operator-validator                    3          18h
gpu-feature-discovery-58cb8bf9d7                        daemonset.apps/gpu-feature-discovery                        3          18h
nvidia-device-plugin-mps-control-daemon-55fb65cb6d      daemonset.apps/nvidia-device-plugin-mps-control-daemon      3          18h
nvidia-device-plugin-daemonset-86bc88dd57               daemonset.apps/nvidia-device-plugin-daemonset               3          18h
nvidia-device-plugin-daemonset-9f6cfccc5                daemonset.apps/nvidia-device-plugin-daemonset               4          18h
nvidia-operator-validator-6867486fdc                    daemonset.apps/nvidia-operator-validator                    4          18h
nvidia-container-toolkit-daemonset-7f9f47c454           daemonset.apps/nvidia-container-toolkit-daemonset           4          18h
nvidia-dcgm-exporter-665c988767                         daemonset.apps/nvidia-dcgm-exporter                         4          18h
nvidia-device-plugin-mps-control-daemon-74c9986b46      daemonset.apps/nvidia-device-plugin-mps-control-daemon      4          18h
nvidia-mig-manager-74bdf786f5                           daemonset.apps/nvidia-mig-manager                           4          18h
gpu-feature-discovery-58f5b5d997                        daemonset.apps/gpu-feature-discovery                        4          18h
nvidia-container-toolkit-daemonset-dd8f4b4c6            daemonset.apps/nvidia-container-toolkit-daemonset           5          18h
nvidia-mig-manager-7c969f679b                           daemonset.apps/nvidia-mig-manager                           5          18h
nvidia-operator-validator-869f8556fc                    daemonset.apps/nvidia-operator-validator                    5          18h
nvidia-device-plugin-mps-control-daemon-6c464f9b8d      daemonset.apps/nvidia-device-plugin-mps-control-daemon      5          18h
gpu-feature-discovery-6c86984bcf                        daemonset.apps/gpu-feature-discovery                        5          18h
nvidia-device-plugin-daemonset-65f579845                daemonset.apps/nvidia-device-plugin-daemonset               5          18h
nvidia-dcgm-exporter-64f476646f                         daemonset.apps/nvidia-dcgm-exporter                         5          18h
nvidia-dcgm-exporter-8c8f4678b                          daemonset.apps/nvidia-dcgm-exporter                         6          18h
nvidia-device-plugin-mps-control-daemon-8b764b9cf       daemonset.apps/nvidia-device-plugin-mps-control-daemon      6          18h
nvidia-device-plugin-daemonset-57d65bdc84               daemonset.apps/nvidia-device-plugin-daemonset               6          18h
nvidia-container-toolkit-daemonset-666b5b9bd5           daemonset.apps/nvidia-container-toolkit-daemonset           6          18h
gpu-feature-discovery-6fc887f684                        daemonset.apps/gpu-feature-discovery                        6          18h
nvidia-mig-manager-595796b4c8                           daemonset.apps/nvidia-mig-manager                           6          18h
nvidia-operator-validator-c987c545d                     daemonset.apps/nvidia-operator-validator                    6          18h
nvidia-mig-manager-7944597844                           daemonset.apps/nvidia-mig-manager                           7          18h
nvidia-mig-manager-67bf6f5bf7                           daemonset.apps/nvidia-mig-manager                           8          18h

controller_revisions_driver.yaml

apiVersion: apps/v1
data:
  spec:
    template:
      $patch: replace
      metadata:
        annotations:
          kubectl.kubernetes.io/default-container: nvidia-driver-ctr
        labels:
          app: nvidia-driver-daemonset
          app.kubernetes.io/component: nvidia-driver
          app.kubernetes.io/managed-by: gpu-operator
          helm.sh/chart: gpu-operator-v1.0.0-devel
          nvidia.com/precompiled: "false"
      spec:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/component
                  operator: In
                  values:
                  - nvidia-driver
              topologyKey: kubernetes.io/hostname
        containers:
        - args:
          - init
          command:
          - nvidia-driver
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
          - name: NODE_IP
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: status.hostIP
          - name: KERNEL_MODULE_TYPE
            value: auto
          - name: DRIVER_CONFIG_DIGEST
            value: "2561234486"
          image: ghcr.io/nvidia/driver:f01cb133-580.126.20-ubuntu24.04
          imagePullPolicy: IfNotPresent
          lifecycle:
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - rm -f /run/nvidia/validations/.driver-ctr-ready
          name: nvidia-driver-ctr
          resources: {}
          securityContext:
            privileged: true
            seLinuxOptions:
              level: s0
          startupProbe:
            exec:
              command:
              - sh
              - /usr/local/bin/startup-probe.sh
            failureThreshold: 120
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 60
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /run/nvidia
            mountPropagation: Bidirectional
            name: run-nvidia
          - mountPath: /run/nvidia-fabricmanager
            name: run-nvidia-fabricmanager
          - mountPath: /run/nvidia-topologyd
            name: run-nvidia-topologyd
          - mountPath: /var/log
            name: var-log
          - mountPath: /dev/log
            name: dev-log
          - mountPath: /host-etc/os-release
            name: host-os-release
            readOnly: true
          - mountPath: /run/mellanox/drivers/usr/src
            mountPropagation: HostToContainer
            name: mlnx-ofed-usr-src
          - mountPath: /run/mellanox/drivers
            mountPropagation: HostToContainer
            name: run-mellanox-drivers
          - mountPath: /sys/devices/system/memory/auto_online_blocks
            name: sysfs-memory-online
          - mountPath: /sys/module/firmware_class/parameters/path
            name: firmware-search-path
          - mountPath: /lib/firmware
            name: nv-firmware
          - mountPath: /usr/local/bin/startup-probe.sh
            name: driver-startup-probe-script
            subPath: startup-probe.sh
        dnsPolicy: ClusterFirst
        hostPID: true
        initContainers:
        - args:
          - uninstall_driver
          command:
          - driver-manager
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
          - name: NVIDIA_VISIBLE_DEVICES
            value: void
          - name: ENABLE_GPU_POD_EVICTION
            value: "true"
          - name: ENABLE_AUTO_DRAIN
            value: "false"
          - name: DRAIN_USE_FORCE
            value: "false"
          - name: DRAIN_POD_SELECTOR_LABEL
          - name: DRAIN_TIMEOUT_SECONDS
            value: 0s
          - name: DRAIN_DELETE_EMPTYDIR_DATA
            value: "false"
          - name: OPERATOR_NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
          - name: DRIVER_CONFIG_DIGEST
            value: "2561234486"
          image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.9.1
          imagePullPolicy: IfNotPresent
          name: k8s-driver-manager
          resources: {}
          securityContext:
            privileged: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /run/nvidia
            mountPropagation: Bidirectional
            name: run-nvidia
          - mountPath: /host
            mountPropagation: HostToContainer
            name: host-root
            readOnly: true
          - mountPath: /sys
            name: host-sys
          - mountPath: /run/mellanox/drivers
            mountPropagation: HostToContainer
            name: run-mellanox-drivers
        nodeSelector:
          nvidia.com/gpu.deploy.driver: "true"
        priorityClassName: system-node-critical
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: nvidia-driver
        serviceAccountName: nvidia-driver
        terminationGracePeriodSeconds: 30
        tolerations:
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists
        volumes:
        - hostPath:
            path: /run/nvidia
            type: DirectoryOrCreate
          name: run-nvidia
        - hostPath:
            path: /var/log
            type: ""
          name: var-log
        - hostPath:
            path: /dev/log
            type: ""
          name: dev-log
        - hostPath:
            path: /etc/os-release
            type: ""
          name: host-os-release
        - hostPath:
            path: /run/nvidia-fabricmanager
            type: DirectoryOrCreate
          name: run-nvidia-fabricmanager
        - hostPath:
            path: /run/nvidia-topologyd
            type: DirectoryOrCreate
          name: run-nvidia-topologyd
        - hostPath:
            path: /run/mellanox/drivers/usr/src
            type: DirectoryOrCreate
          name: mlnx-ofed-usr-src
        - hostPath:
            path: /run/mellanox/drivers
            type: DirectoryOrCreate
          name: run-mellanox-drivers
        - hostPath:
            path: /run/nvidia/validations
            type: DirectoryOrCreate
          name: run-nvidia-validations
        - hostPath:
            path: /
            type: ""
          name: host-root
        - hostPath:
            path: /sys
            type: Directory
          name: host-sys
        - hostPath:
            path: /sys/module/firmware_class/parameters/path
            type: ""
          name: firmware-search-path
        - hostPath:
            path: /sys/devices/system/memory/auto_online_blocks
            type: ""
          name: sysfs-memory-online
        - hostPath:
            path: /run/nvidia/driver/lib/firmware
            type: DirectoryOrCreate
          name: nv-firmware
        - configMap:
            defaultMode: 493
            name: nvidia-driver-startup-probe
          name: driver-startup-probe-script
kind: ControllerRevision
metadata:
  annotations:
    deprecated.daemonset.template.generation: "2"
    nvidia.com/last-applied-hash: "644779074"
    openshift.io/scc: nvidia-driver
  creationTimestamp: "2026-02-25T17:01:29Z"
  labels:
    app: nvidia-driver-daemonset
    app.kubernetes.io/component: nvidia-driver
    app.kubernetes.io/managed-by: gpu-operator
    controller-revision-hash: 595485d69c
    helm.sh/chart: gpu-operator-v1.0.0-devel
    nvidia.com/precompiled: "false"
  name: nvidia-driver-daemonset-595485d69c
  namespace: gpu-operator
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: nvidia-driver-daemonset
    uid: 6bbb3137-1e80-41ca-89c8-95a0e94e5a8d
  resourceVersion: "223952"
  uid: 6593292a-c795-47e5-bda4-00e257630897
revision: 2
---
apiVersion: apps/v1
data:
  spec:
    template:
      $patch: replace
      metadata:
        annotations:
          kubectl.kubernetes.io/default-container: nvidia-driver-ctr
        labels:
          app: nvidia-driver-daemonset
          app.kubernetes.io/component: nvidia-driver
          app.kubernetes.io/managed-by: gpu-operator
          helm.sh/chart: gpu-operator-v1.0.0-devel
          nvidia.com/precompiled: "false"
      spec:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/component
                  operator: In
                  values:
                  - nvidia-driver
              topologyKey: kubernetes.io/hostname
        containers:
        - args:
          - init
          command:
          - nvidia-driver
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
          - name: NODE_IP
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: status.hostIP
          - name: KERNEL_MODULE_TYPE
            value: auto
          - name: DRIVER_CONFIG_DIGEST
            value: "889101604"
          image: nvcr.io/nvidia/driver:580.126.16-ubuntu24.04
          imagePullPolicy: IfNotPresent
          lifecycle:
            preStop:
              exec:
                command:
                - /bin/sh
                - -c
                - rm -f /run/nvidia/validations/.driver-ctr-ready
          name: nvidia-driver-ctr
          resources: {}
          securityContext:
            privileged: true
            seLinuxOptions:
              level: s0
          startupProbe:
            exec:
              command:
              - sh
              - /usr/local/bin/startup-probe.sh
            failureThreshold: 120
            initialDelaySeconds: 60
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 60
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /run/nvidia
            mountPropagation: Bidirectional
            name: run-nvidia
          - mountPath: /run/nvidia-fabricmanager
            name: run-nvidia-fabricmanager
          - mountPath: /run/nvidia-topologyd
            name: run-nvidia-topologyd
          - mountPath: /var/log
            name: var-log
          - mountPath: /dev/log
            name: dev-log
          - mountPath: /host-etc/os-release
            name: host-os-release
            readOnly: true
          - mountPath: /run/mellanox/drivers/usr/src
            mountPropagation: HostToContainer
            name: mlnx-ofed-usr-src
          - mountPath: /run/mellanox/drivers
            mountPropagation: HostToContainer
            name: run-mellanox-drivers
          - mountPath: /sys/devices/system/memory/auto_online_blocks
            name: sysfs-memory-online
          - mountPath: /sys/module/firmware_class/parameters/path
            name: firmware-search-path
          - mountPath: /lib/firmware
            name: nv-firmware
          - mountPath: /usr/local/bin/startup-probe.sh
            name: driver-startup-probe-script
            subPath: startup-probe.sh
        dnsPolicy: ClusterFirst
        hostPID: true
        initContainers:
        - args:
          - uninstall_driver
          command:
          - driver-manager
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
          - name: NVIDIA_VISIBLE_DEVICES
            value: void
          - name: ENABLE_GPU_POD_EVICTION
            value: "true"
          - name: ENABLE_AUTO_DRAIN
            value: "false"
          - name: DRAIN_USE_FORCE
            value: "false"
          - name: DRAIN_POD_SELECTOR_LABEL
          - name: DRAIN_TIMEOUT_SECONDS
            value: 0s
          - name: DRAIN_DELETE_EMPTYDIR_DATA
            value: "false"
          - name: OPERATOR_NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
          - name: DRIVER_CONFIG_DIGEST
            value: "889101604"
          image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.9.1
          imagePullPolicy: IfNotPresent
          name: k8s-driver-manager
          resources: {}
          securityContext:
            privileged: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /run/nvidia
            mountPropagation: Bidirectional
            name: run-nvidia
          - mountPath: /host
            mountPropagation: HostToContainer
            name: host-root
            readOnly: true
          - mountPath: /sys
            name: host-sys
          - mountPath: /run/mellanox/drivers
            mountPropagation: HostToContainer
            name: run-mellanox-drivers
        nodeSelector:
          nvidia.com/gpu.deploy.driver: "true"
        priorityClassName: system-node-critical
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: nvidia-driver
        serviceAccountName: nvidia-driver
        terminationGracePeriodSeconds: 30
        tolerations:
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists
        volumes:
        - hostPath:
            path: /run/nvidia
            type: DirectoryOrCreate
          name: run-nvidia
        - hostPath:
            path: /var/log
            type: ""
          name: var-log
        - hostPath:
            path: /dev/log
            type: ""
          name: dev-log
        - hostPath:
            path: /etc/os-release
            type: ""
          name: host-os-release
        - hostPath:
            path: /run/nvidia-fabricmanager
            type: DirectoryOrCreate
          name: run-nvidia-fabricmanager
        - hostPath:
            path: /run/nvidia-topologyd
            type: DirectoryOrCreate
          name: run-nvidia-topologyd
        - hostPath:
            path: /run/mellanox/drivers/usr/src
            type: DirectoryOrCreate
          name: mlnx-ofed-usr-src
        - hostPath:
            path: /run/mellanox/drivers
            type: DirectoryOrCreate
          name: run-mellanox-drivers
        - hostPath:
            path: /run/nvidia/validations
            type: DirectoryOrCreate
          name: run-nvidia-validations
        - hostPath:
            path: /
            type: ""
          name: host-root
        - hostPath:
            path: /sys
            type: Directory
          name: host-sys
        - hostPath:
            path: /sys/module/firmware_class/parameters/path
            type: ""
          name: firmware-search-path
        - hostPath:
            path: /sys/devices/system/memory/auto_online_blocks
            type: ""
          name: sysfs-memory-online
        - hostPath:
            path: /run/nvidia/driver/lib/firmware
            type: DirectoryOrCreate
          name: nv-firmware
        - configMap:
            defaultMode: 493
            name: nvidia-driver-startup-probe
          name: driver-startup-probe-script
kind: ControllerRevision
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
    nvidia.com/last-applied-hash: "2199474554"
    openshift.io/scc: nvidia-driver
  creationTimestamp: "2026-02-24T21:53:40Z"
  labels:
    app: nvidia-driver-daemonset
    app.kubernetes.io/component: nvidia-driver
    app.kubernetes.io/managed-by: gpu-operator
    controller-revision-hash: 5f764b96ff
    helm.sh/chart: gpu-operator-v1.0.0-devel
    nvidia.com/precompiled: "false"
  name: nvidia-driver-daemonset-5f764b96ff
  namespace: gpu-operator
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: nvidia-driver-daemonset
    uid: 6bbb3137-1e80-41ca-89c8-95a0e94e5a8d
  resourceVersion: "29143"
  uid: aef86a3c-0bf0-476a-bb77-7a484805617a
revision: 1
---

events_operator_namespace.log

LAST SEEN   TYPE      REASON                   OBJECT                                         MESSAGE
5m29s       Normal    Killing                  pod/nvidia-driver-daemonset-cnvjc              Stopping container nvidia-driver-ctr
4m58s       Normal    Scheduled                pod/nvidia-driver-daemonset-4vvpv              Successfully assigned gpu-operator/nvidia-driver-daemonset-4vvpv to ipp1-2890
4m58s       Normal    SuccessfulCreate         daemonset/nvidia-driver-daemonset              Created pod: nvidia-driver-daemonset-4vvpv
4m58s       Normal    Created                  pod/nvidia-driver-daemonset-4vvpv              Created container: k8s-driver-manager
4m58s       Normal    Pulled                   pod/nvidia-driver-daemonset-4vvpv              Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.9.1" already present on machine
4m57s       Normal    Killing                  pod/nvidia-mig-manager-6gptg                   Stopping container nvidia-mig-manager
4m57s       Normal    Killing                  pod/nvidia-device-plugin-daemonset-bfmqt       Stopping container nvidia-device-plugin
4m57s       Warning   FailedKillPod            pod/nvidia-device-plugin-daemonset-bfmqt       error killing pod: [failed to "KillContainer" for "nvidia-device-plugin" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "d8a700cc-d3d1-4eb0-9809-d130acc6b3f7" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\""]
4m57s       Normal    SuccessfulDelete         daemonset/nvidia-device-plugin-daemonset       Deleted pod: nvidia-device-plugin-daemonset-bfmqt
4m57s       Normal    Killing                  pod/gpu-feature-discovery-82tf4                Stopping container gpu-feature-discovery
4m57s       Normal    Killing                  pod/nvidia-dcgm-exporter-9hqf8                 Stopping container nvidia-dcgm-exporter
4m57s       Warning   FailedKillPod            pod/gpu-feature-discovery-82tf4                error killing pod: [failed to "KillContainer" for "gpu-feature-discovery" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "26aeac6e-5b72-4927-bfb1-2a4baf20ebf3" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\""]
4m57s       Normal    SuccessfulDelete         daemonset/nvidia-operator-validator            Deleted pod: nvidia-operator-validator-hxl5d
4m57s       Warning   FailedKillPod            pod/nvidia-operator-validator-hxl5d            error killing pod: [failed to "KillContainer" for "nvidia-operator-validator" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "9b9e7b94-a3a3-4666-9535-f9ded7081240" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\""]
4m57s       Warning   FailedKillPod            pod/nvidia-mig-manager-6gptg                   error killing pod: [failed to "KillContainer" for "nvidia-mig-manager" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "6d54b2ab-3dbb-413a-b143-0b980dee8e2e" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\""]
4m57s       Normal    SuccessfulDelete         daemonset/nvidia-container-toolkit-daemonset   Deleted pod: nvidia-container-toolkit-daemonset-kztx4
4m57s       Warning   FailedKillPod            pod/nvidia-container-toolkit-daemonset-kztx4   error killing pod: [failed to "KillContainer" for "nvidia-container-toolkit-ctr" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "49b35cf7-efbb-41a1-a0c7-4dfc44944269" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\""]
4m57s       Normal    SuccessfulDelete         daemonset/gpu-feature-discovery                Deleted pod: gpu-feature-discovery-82tf4
4m57s       Normal    Killing                  pod/nvidia-container-toolkit-daemonset-kztx4   Stopping container nvidia-container-toolkit-ctr
4m57s       Normal    Started                  pod/nvidia-driver-daemonset-4vvpv              Started container k8s-driver-manager
4m57s       Warning   FailedKillPod            pod/nvidia-dcgm-exporter-9hqf8                 error killing pod: [failed to "KillContainer" for "nvidia-dcgm-exporter" with KillContainerError: "rpc error: code = Unavailable desc = error reading from server: EOF", failed to "KillPodSandbox" for "e95ab621-e792-4b74-a9f5-4b024d88ff3c" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused\""]
4m57s       Normal    SuccessfulDelete         daemonset/nvidia-dcgm-exporter                 Deleted pod: nvidia-dcgm-exporter-9hqf8
4m57s       Normal    SuccessfulDelete         daemonset/nvidia-mig-manager                   Deleted pod: nvidia-mig-manager-6gptg
4m46s       Normal    Killing                  pod/nvidia-operator-validator-hxl5d            Stopping container nvidia-operator-validator
4m46s       Warning   FailedPreStopHook        pod/nvidia-operator-validator-hxl5d            PreStopHook failed
4m10s       Normal    SuccessfulCreate         daemonset/nvidia-container-toolkit-daemonset   Created pod: nvidia-container-toolkit-daemonset-88r4q
4m10s       Normal    SuccessfulCreate         daemonset/gpu-feature-discovery                Created pod: gpu-feature-discovery-bptk4
4m10s       Normal    SuccessfulCreate         daemonset/nvidia-operator-validator            Created pod: nvidia-operator-validator-l4754
4m10s       Normal    SuccessfulCreate         daemonset/nvidia-mig-manager                   Created pod: nvidia-mig-manager-nzmhb
4m10s       Normal    SuccessfulCreate         daemonset/nvidia-dcgm-exporter                 Created pod: nvidia-dcgm-exporter-qd6vj
4m10s       Normal    Scheduled                pod/gpu-feature-discovery-bptk4                Successfully assigned gpu-operator/gpu-feature-discovery-bptk4 to ipp1-2890
4m10s       Normal    Scheduled                pod/nvidia-dcgm-exporter-qd6vj                 Successfully assigned gpu-operator/nvidia-dcgm-exporter-qd6vj to ipp1-2890
4m10s       Normal    Scheduled                pod/nvidia-mig-manager-nzmhb                   Successfully assigned gpu-operator/nvidia-mig-manager-nzmhb to ipp1-2890
4m10s       Normal    Scheduled                pod/nvidia-container-toolkit-daemonset-88r4q   Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-88r4q to ipp1-2890
4m10s       Normal    Scheduled                pod/nvidia-device-plugin-daemonset-2nc7v       Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-2nc7v to ipp1-2890
4m10s       Normal    Scheduled                pod/nvidia-operator-validator-l4754            Successfully assigned gpu-operator/nvidia-operator-validator-l4754 to ipp1-2890
4m10s       Normal    SuccessfulCreate         daemonset/nvidia-device-plugin-daemonset       Created pod: nvidia-device-plugin-daemonset-2nc7v
4m9s        Normal    Pulled                   pod/nvidia-container-toolkit-daemonset-88r4q   Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
4m9s        Normal    Started                  pod/nvidia-container-toolkit-daemonset-88r4q   Started container driver-validation
4m9s        Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "5729b1ca55917b4d3c11159a090a9c35b183c04160b2fda91d20bb9dfd83517c": no runtime for "nvidia" is configured
4m9s        Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "30878c59ae450e4fc8cb342d75bcedea57de251a305422d867d8cc8e6e67df1a": no runtime for "nvidia" is configured
4m9s        Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "56b7a33fa7af6aa2fdd1fc49b2abe7c30d1281d27847c687ada0b3ab782c3ce3": no runtime for "nvidia" is configured
4m9s        Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "4cb4fc985a472424da62f6e4bbb1a80745ec327687007548167ac378d728a656": no runtime for "nvidia" is configured
4m9s        Normal    Pulling                  pod/nvidia-driver-daemonset-4vvpv              Pulling image "ghcr.io/nvidia/driver:f01cb133-580.126.20-ubuntu24.04"
4m9s        Normal    Created                  pod/nvidia-container-toolkit-daemonset-88r4q   Created container: driver-validation
4m9s        Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "843c465bdeff33802b15700b901784eabfdb62d4eef56d0cc2236f6b1810edb9": no runtime for "nvidia" is configured
3m58s       Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "96992bd3c5ed51d47d3dbcacb1199627d913b8e94dd9a26ebc74231d01a91579": no runtime for "nvidia" is configured
3m58s       Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "bb140ecadc3e968d82540e7ad6e4c083949f36cedcac57072279d21e9fe607fa": no runtime for "nvidia" is configured
3m57s       Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "c19cfcfde54787f7e1487601c091fd886c467de6fba51caa6d08c0fc5c33d389": no runtime for "nvidia" is configured
3m57s       Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "18eb5881e5357a9175a7bdec8619fee7566d71501b1944add480b01cfba4300e": no runtime for "nvidia" is configured
3m56s       Normal    Started                  pod/nvidia-driver-daemonset-4vvpv              Started container nvidia-driver-ctr
3m56s       Normal    Pulled                   pod/nvidia-driver-daemonset-4vvpv              Successfully pulled image "ghcr.io/nvidia/driver:f01cb133-580.126.20-ubuntu24.04" in 13.031s (13.031s including waiting). Image size: 650704528 bytes.
3m56s       Normal    Created                  pod/nvidia-driver-daemonset-4vvpv              Created container: nvidia-driver-ctr
3m54s       Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "b91a59cf9f5907c2f14355c786f8af294647efc335518f489edfeb63361df063": no runtime for "nvidia" is configured
3m46s       Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "271912cad005df1c7ec3b411f2fa39933950b6877feeae9c0fdba9751750fb7e": no runtime for "nvidia" is configured
3m45s       Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "18073b779521b77e3cbc7752dd5c2f63064efe24cec18807e917cd29f31fdf50": no runtime for "nvidia" is configured
3m44s       Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "7def1d2c1efd1e16de5eeac923cc4834053b10299bb6eb2a08f5e9dd3e0ebd80": no runtime for "nvidia" is configured
3m43s       Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "fa407f3dc7baf4bf0d01bebb5b4594220f3f03ddde0887ffa21d17c20665f7e9": no runtime for "nvidia" is configured
3m43s       Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "dcd9f871b25e882d16fb4189b4ac0adca3f0240d56fe8d8e1af418b551d55fe7": no runtime for "nvidia" is configured
3m32s       Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "0625a5ba7cb7741a35bb84158e49d5ee4feb5c10d21badb1d16f3e8759e58236": no runtime for "nvidia" is configured
3m32s       Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "4b2ed2951b7cb990dd86c18a880b2abc62df41fbda741adddbbb36ba53b75250": no runtime for "nvidia" is configured
3m32s       Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "482c2bf68a62615b8239063d6401934bf5b3676d8a67d98caf4b4c930488d8d2": no runtime for "nvidia" is configured
3m32s       Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "b2643509f1881ff948c237641d983c0b0530e674105e1fd3816bf927e61fc810": no runtime for "nvidia" is configured
3m30s       Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "a56d4da9a13253347d46306f9db2a0b59a57eb1b1b0667fcfee3e3554a72250f": no runtime for "nvidia" is configured
3m21s       Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "aa470629532628fd1e5b7b74c85c921619ee77331d246bde53157a620cdc8cea": no runtime for "nvidia" is configured
3m20s       Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "2d510832137ce3b532b2f348e37f4e43b3e114c3f1a17a510273206bafd74825": no runtime for "nvidia" is configured
3m18s       Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "6a3e2b1ccbf9410c6fb9ede96c9b8162b73415f63b24c7940dececad6ffbd6d7": no runtime for "nvidia" is configured
3m17s       Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "3b68f4bac3b9492ad2ba74e5e1102166b658d01b8bcf3a436521ff8dca190f46": no runtime for "nvidia" is configured
3m17s       Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "a761d723c91c7301ecbde9d345d95178126b47e95897fdd05714286939fa40d5": no runtime for "nvidia" is configured
3m8s        Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "68dc17c9b5938509ee1378e7382b9187c2c3ededcf67df7aab99c3bfca58e5ef": no runtime for "nvidia" is configured
3m6s        Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "2ec8e1ebfac4c41384342ee5428e095615ff3241a3b4992cb5e94b7a85363b3a": no runtime for "nvidia" is configured
3m6s        Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "f141bfdbefb6197b61c951ddcdd66952ea9234520b7baa085dec905b000ad32c": no runtime for "nvidia" is configured
3m4s        Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "f176eb9c95bf854b269237a62768479acc165999962d4420ec00bc6418c7d61b": no runtime for "nvidia" is configured
3m2s        Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "0664ada1bb4c5b3ae81e5a914f621d8242f90903db8cfd37df53982deb88bd1b": no runtime for "nvidia" is configured
2m55s       Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "a5b1e58fbff7a57c494bf8d022252ba7d195fe4cb86134584bdf8045198d03ad": no runtime for "nvidia" is configured
2m52s       Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "2cdc7fa02cb3cceaca8bbe66ae2e1296380cce749848dca9733fe225ac87d5ec": no runtime for "nvidia" is configured
2m52s       Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "06335bc09cd5c1047f70fdf5f82ecf5ac09a1ddf0e17e889512ecb2f5283f4e5": no runtime for "nvidia" is configured
2m52s       Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "643fcb4a00d2aace8b59daaec3e2a618e685adc6bb0d19303ff3dfe7b2071af2": no runtime for "nvidia" is configured
2m49s       Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "9ed65668901d22444bc649c98d2cf45b1135c8b5e962fc0ccdeef2f79d9df994": no runtime for "nvidia" is configured
2m41s       Normal    Started                  pod/nvidia-container-toolkit-daemonset-88r4q   Started container nvidia-container-toolkit-ctr
2m41s       Normal    Created                  pod/nvidia-container-toolkit-daemonset-88r4q   Created container: nvidia-container-toolkit-ctr
2m41s       Normal    Pulled                   pod/nvidia-container-toolkit-daemonset-88r4q   Container image "nvcr.io/nvidia/k8s/container-toolkit:v1.19.0-rc.4" already present on machine
2m40s       Warning   FailedCreatePodSandBox   pod/nvidia-dcgm-exporter-qd6vj                 Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "d55d903bc7c1648f6d9a3c5a2fb830bce5d6c035a195a1e56d1950c68f39f502": no runtime for "nvidia" is configured
2m40s       Warning   FailedCreatePodSandBox   pod/nvidia-mig-manager-nzmhb                   Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "7e7b3839257e9e74648bab62351d222f9bc96e440ffa871476716a7cae923dd3": no runtime for "nvidia" is configured
2m39s       Warning   FailedCreatePodSandBox   pod/nvidia-operator-validator-l4754            Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "7d5b7941ff549c33ed8c57a2f5d09f34ec0832a6f8c6b6c3fa2f441e7c9c58da": no runtime for "nvidia" is configured
2m38s       Warning   FailedCreatePodSandBox   pod/gpu-feature-discovery-bptk4                Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "9c5101a30b192d4636f32c544760b64388d644b25ad8425b14297081cf6d3cc1": no runtime for "nvidia" is configured
2m37s       Warning   FailedCreatePodSandBox   pod/nvidia-device-plugin-daemonset-2nc7v       Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "1f11decc9f748393be2afca14cf3acab7a23697578c6d464af6c80e409737227": no runtime for "nvidia" is configured
2m26s       Normal    Started                  pod/gpu-feature-discovery-bptk4                Started container toolkit-validation
2m26s       Normal    Created                  pod/gpu-feature-discovery-bptk4                Created container: toolkit-validation
2m26s       Normal    Pulled                   pod/gpu-feature-discovery-bptk4                Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m25s       Normal    Created                  pod/nvidia-dcgm-exporter-qd6vj                 Created container: toolkit-validation
2m25s       Normal    Pulled                   pod/nvidia-dcgm-exporter-qd6vj                 Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m25s       Normal    Started                  pod/nvidia-dcgm-exporter-qd6vj                 Started container toolkit-validation
2m25s       Normal    Pulled                   pod/nvidia-mig-manager-nzmhb                   Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m25s       Normal    Created                  pod/nvidia-mig-manager-nzmhb                   Created container: toolkit-validation
2m25s       Normal    Started                  pod/nvidia-mig-manager-nzmhb                   Started container toolkit-validation
2m24s       Normal    Pulled                   pod/nvidia-operator-validator-l4754            Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m24s       Normal    Started                  pod/nvidia-operator-validator-l4754            Started container driver-validation
2m24s       Normal    Created                  pod/nvidia-operator-validator-l4754            Created container: driver-validation
2m23s       Normal    Created                  pod/nvidia-operator-validator-l4754            Created container: toolkit-validation
2m23s       Normal    Pulled                   pod/nvidia-operator-validator-l4754            Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m22s       Normal    Pulled                   pod/nvidia-operator-validator-l4754            Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m22s       Normal    Created                  pod/nvidia-operator-validator-l4754            Created container: cuda-validation
2m22s       Normal    Started                  pod/nvidia-operator-validator-l4754            Started container toolkit-validation
2m21s       Normal    Created                  pod/nvidia-device-plugin-daemonset-2nc7v       Created container: toolkit-validation
2m21s       Normal    Created                  pod/gpu-feature-discovery-bptk4                Created container: gpu-feature-discovery
2m21s       Normal    Started                  pod/nvidia-cuda-validator-6p98l                Started container cuda-validation
2m21s       Normal    Started                  pod/nvidia-operator-validator-l4754            Started container cuda-validation
2m21s       Normal    Pulled                   pod/gpu-feature-discovery-bptk4                Container image "nvcr.io/nvidia/k8s-device-plugin:v0.18.2" already present on machine
2m21s       Normal    Pulled                   pod/nvidia-cuda-validator-6p98l                Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m21s       Normal    Pulled                   pod/nvidia-device-plugin-daemonset-2nc7v       Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m21s       Normal    Created                  pod/nvidia-cuda-validator-6p98l                Created container: cuda-validation
2m21s       Normal    Created                  pod/nvidia-device-plugin-daemonset-2nc7v       Created container: nvidia-device-plugin
2m21s       Normal    Pulled                   pod/nvidia-device-plugin-daemonset-2nc7v       Container image "nvcr.io/nvidia/k8s-device-plugin:v0.18.2" already present on machine
2m21s       Normal    Started                  pod/nvidia-device-plugin-daemonset-2nc7v       Started container toolkit-validation
2m20s       Normal    Pulled                   pod/nvidia-cuda-validator-6p98l                Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m20s       Normal    Created                  pod/nvidia-cuda-validator-6p98l                Created container: nvidia-cuda-validator
2m20s       Normal    Created                  pod/nvidia-mig-manager-nzmhb                   Created container: nvidia-mig-manager
2m20s       Normal    Pulled                   pod/nvidia-mig-manager-nzmhb                   Container image "ghcr.io/nvidia/k8s-mig-manager:315c447c" already present on machine
2m20s       Normal    Started                  pod/nvidia-device-plugin-daemonset-2nc7v       Started container nvidia-device-plugin
2m20s       Normal    Created                  pod/nvidia-dcgm-exporter-qd6vj                 Created container: nvidia-dcgm-exporter
2m20s       Normal    Pulled                   pod/nvidia-dcgm-exporter-qd6vj                 Container image "nvcr.io/nvidia/k8s/dcgm-exporter:4.5.1-4.8.0-distroless" already present on machine
2m20s       Normal    Started                  pod/gpu-feature-discovery-bptk4                Started container gpu-feature-discovery
2m19s       Normal    Started                  pod/nvidia-dcgm-exporter-qd6vj                 Started container nvidia-dcgm-exporter
2m19s       Normal    Started                  pod/nvidia-cuda-validator-6p98l                Started container nvidia-cuda-validator
2m19s       Normal    Started                  pod/nvidia-mig-manager-nzmhb                   Started container nvidia-mig-manager
2m16s       Normal    Pulled                   pod/nvidia-operator-validator-l4754            Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m15s       Normal    Started                  pod/nvidia-operator-validator-l4754            Started container plugin-validation
2m15s       Normal    Created                  pod/nvidia-operator-validator-l4754            Created container: plugin-validation
2m14s       Normal    Pulled                   pod/nvidia-operator-validator-l4754            Container image "docker.io/ragasthya852/gpu-operator:d2d8b1eca-1771971963256" already present on machine
2m14s       Normal    Created                  pod/nvidia-operator-validator-l4754            Created container: nvidia-operator-validator
2m14s       Normal    Started                  pod/nvidia-operator-validator-l4754            Started container nvidia-operator-validator

@rajathagasthya rajathagasthya force-pushed the must-gather-upgrade-diagnostics branch from ee7641e to 163bf6b Compare February 25, 2026 17:59
@rajathagasthya rajathagasthya changed the title fix(must-gather): collect events, upgrade state, and controller revisions Collect events and upgrade state in must-gather.sh Feb 25, 2026
@rajathagasthya rajathagasthya force-pushed the must-gather-upgrade-diagnostics branch 2 times, most recently from 9331ab2 to 34109c3 Compare February 25, 2026 18:10
@rajathagasthya rajathagasthya marked this pull request as ready for review February 25, 2026 18:16
@rajathagasthya rajathagasthya force-pushed the must-gather-upgrade-diagnostics branch 2 times, most recently from 2ab7a63 to 83404d9 Compare February 25, 2026 23:08
@rajathagasthya rajathagasthya force-pushed the must-gather-upgrade-diagnostics branch from 83404d9 to 4b74cb0 Compare February 26, 2026 19:41
* Collect Kubernetes events in operator namespace.
* Collect per-GPU-node upgrade state (annotations, labels, cordon
  status, node events).
* Collect controller revisions for driver and other operand DaemonSets.

Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>
@rajathagasthya rajathagasthya force-pushed the must-gather-upgrade-diagnostics branch from 4b74cb0 to ac37a84 Compare February 26, 2026 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants