Skip to content

[dcgm][dcgm-exporter] add liveness and readiness probes#2175

Merged
tariq1890 merged 1 commit intomainfrom
dcgm-probes
Feb 28, 2026
Merged

[dcgm][dcgm-exporter] add liveness and readiness probes#2175
tariq1890 merged 1 commit intomainfrom
dcgm-probes

Conversation

@tariq1890
Copy link
Contributor

@tariq1890 tariq1890 commented Feb 27, 2026

This commit adds liveness and readiness probes to the dcgm and dcgm-exporter operands. Adding probes to the DCGM pods ensure that these pods aren't marked as "Ready" until the DCGM is actually ready to serve traffic. The DCGM-Exporter probes have been taken from the default probes configured in the helm chart of the NVIDIA/dcgm-exporter project.

The liveness and readiness probe values have been taken from the standalone DCGM Exporter helm chart. Please see here

This commit adds liveness and readiness probes to
the dcgm and dcgm-exporter operands. Adding probes
to the DCGM pods ensure that these pods aren't marked
as "Ready" until the DCGM is actually ready to serve
traffic. The DCGM-Exporter probes have been taken
from the default probes configured in the helm chart
of the NVIDIA/dcgm-exporter project.

Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
@guptaNswati
Copy link
Contributor

LGTM @tariq1890 can you please link the default dcgm-exporter settings in the description and also share some test.

CI seems to have a flake.

@tariq1890
Copy link
Contributor Author

@guptaNswati Sure, I can share test results

CI seems to have a flake.

The CI / coverage failure is due to an outage (which seems to be week long at this point) in coveralls.io. Our CI runs go test -cover and uploads the coverage to the coveralls.io to track and display our project's unit test coverage.

The failure is non-blocking

Copy link

@glowkey glowkey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tariq1890
Copy link
Contributor Author

Sharing the screenshots of dcgm and dcgm-exporter from working with this PR. These screenshot are from an OpenShift cluster. The PR e2e test suite covers the vanilla k8s + Ubuntu scenario.

Screenshot 2026-02-27 at 2 06 25 PM Screenshot 2026-02-27 at 2 08 57 PM Screenshot 2026-02-27 at 2 09 13 PM

@guptaNswati
Copy link
Contributor

guptaNswati commented Feb 27, 2026

Thank you @tariq1890. Its already approved. I don't think i have approval rights.

@tariq1890 tariq1890 merged commit 09219be into main Feb 28, 2026
14 of 15 checks passed
@tariq1890 tariq1890 deleted the dcgm-probes branch February 28, 2026 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants