Skip to content

Try to use SLURM_STEP_GPUS for device list if CUDA_VISIBLE_DEVICES is not set#3577

Open
gabeweisz wants to merge 1 commit intoAI-Hypercomputer:mainfrom
ROCm:gw_check_slurm_gpus
Open

Try to use SLURM_STEP_GPUS for device list if CUDA_VISIBLE_DEVICES is not set#3577
gabeweisz wants to merge 1 commit intoAI-Hypercomputer:mainfrom
ROCm:gw_check_slurm_gpus

Conversation

@gabeweisz
Copy link
Copy Markdown
Collaborator

Description

This makes it easier to use MaxText with GPUs from Slurm since SLURM_STEP_GPUS is set automatically.

Fixes #3433

Tests

Manually tested the change, SLURM_STEP_GPUS set by slurm, with CUDA_VISIBLE_DEVICES and SLURM_STEP_GPUS both set, and with nothing set

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@gabeweisz
Copy link
Copy Markdown
Collaborator Author

@shralex - you asked me to look at this issue - please review the PR

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@gabeweisz gabeweisz marked this pull request as draft April 6, 2026 13:21
@gabeweisz gabeweisz force-pushed the gw_check_slurm_gpus branch 2 times, most recently from 6f005ab to 7a1f19c Compare April 6, 2026 14:09
@gabeweisz gabeweisz marked this pull request as ready for review April 6, 2026 14:14
@gabeweisz gabeweisz marked this pull request as draft April 6, 2026 14:17
@gabeweisz gabeweisz force-pushed the gw_check_slurm_gpus branch from 7a1f19c to f1c56c3 Compare April 6, 2026 14:20
@gabeweisz gabeweisz marked this pull request as ready for review April 6, 2026 16:39
@gabeweisz
Copy link
Copy Markdown
Collaborator Author

Tests have all passed, now it is really ready for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Distributed] hardware=gpu does not correctly configure process-per-node mode with Slurm

1 participant