Skip to content

[develop][E2E Test] Add E2E test for Slurm 25.11 expedited requeue mode feature#7211

Closed
hehe7318 wants to merge 13 commits intoaws:developfrom
hehe7318:wip/add-test-for-expedited-requeue-mode
Closed

[develop][E2E Test] Add E2E test for Slurm 25.11 expedited requeue mode feature#7211
hehe7318 wants to merge 13 commits intoaws:developfrom
hehe7318:wip/add-test-for-expedited-requeue-mode

Conversation

@hehe7318
Copy link
Contributor

@hehe7318 hehe7318 commented Jan 27, 2026

Description of changes

Add test_expedited_requeue in test_slurm.py to validate that jobs submitted with --requeue=expedite are treated as highest priority after ICE recovery.

  • The test uses recoverable ICE simulation via create_fleet_overrides.json (instead of the permanent overrides.py approach used by test_fast_capacity_failover).
  • Write create_fleet_overrides.json with invalid InstanceTypes → create_fleet returns no instances → InsufficientInstanceCapacity → nodes go down
  • Recover by changing InstanceTypes back to real ones (t3.medium, c5.large) → next launch succeeds
  • Verify the expedited requeue job started before the normal job

Tests

  • Running

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Extend test_fast_capacity_failover to validate the new --requeue=expedite
option introduced in Slurm 25.11.2. This feature allows batch jobs to
automatically requeue on node failure with highest priority.
@hehe7318 hehe7318 requested review from a team as code owners January 27, 2026 14:56
@hehe7318 hehe7318 added the 3.x label Jan 27, 2026
hehe7318 and others added 9 commits January 28, 2026 14:07
- Change job commands from simple 'sleep 30' to output hostname and
  timestamps, making it easier to verify job execution in output files
- Add --prefer option to job2 targeting the same compute resource as job1
- Increase job2 node request from 1 to 2 nodes to prevent it from
  immediately running on another CR before job1 requeues
…er and use recoverable ICE simulation

Move _test_expedited_requeue_on_ice out of test_fast_capacity_failover into a standalone
test_expedited_requeue with its own cluster config (multi-instance-type CR using create_fleet).

Replace the unrecoverable overrides.py-based ICE simulation with create_fleet_overrides.json:
write invalid 'ICE-' prefixed InstanceTypes to trigger ICE, then change them back to real ones
(t3.medium, c5.xlarge) to recover. This allows verifying that after ICE recovery, the expedited
requeue job starts before a normal job submitted earlier.
@hehe7318 hehe7318 added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Feb 13, 2026
@hehe7318 hehe7318 closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant