[develop][E2E Test] Add E2E test for Slurm 25.11 expedited requeue mode feature by hehe7318 · Pull Request #7211 · aws/aws-parallelcluster

hehe7318 · 2026-01-27T14:56:24Z

Description of changes

Add test_expedited_requeue in test_slurm.py to validate that jobs submitted with --requeue=expedite are treated as highest priority after ICE recovery.

The test uses recoverable ICE simulation via create_fleet_overrides.json (instead of the permanent overrides.py approach used by test_fast_capacity_failover).
Write create_fleet_overrides.json with invalid InstanceTypes → create_fleet returns no instances → InsufficientInstanceCapacity → nodes go down
Recover by changing InstanceTypes back to real ones (t3.medium, c5.large) → next launch succeeds
Verify the expedited requeue job started before the normal job

Tests

Running

Checklist

Make sure you are pointing to the right branch.
If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
Check all commits' messages are clear, describing what and why vs how.
Make sure to have added unit tests or integration tests to cover the new/modified code.
Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Extend test_fast_capacity_failover to validate the new --requeue=expedite option introduced in Slurm 25.11.2. This feature allows batch jobs to automatically requeue on node failure with highest priority.

…eue jobs are treated as highest priority.

- Change job commands from simple 'sleep 30' to output hostname and timestamps, making it easier to verify job execution in output files - Add --prefer option to job2 targeting the same compute resource as job1 - Increase job2 node request from 1 to 2 nodes to prevent it from immediately running on another CR before job1 requeues

…arsing error

…er and use recoverable ICE simulation Move _test_expedited_requeue_on_ice out of test_fast_capacity_failover into a standalone test_expedited_requeue with its own cluster config (multi-instance-type CR using create_fleet). Replace the unrecoverable overrides.py-based ICE simulation with create_fleet_overrides.json: write invalid 'ICE-' prefixed InstanceTypes to trigger ICE, then change them back to real ones (t3.medium, c5.xlarge) to recover. This allows verifying that after ICE recovery, the expedited requeue job starts before a normal job submitted earlier.

…avoid MissingParameter error

…instead of fleet-config.json

Add integration test for Slurm 25.11 expedited requeue mode feature

5f30bd3

Extend test_fast_capacity_failover to validate the new --requeue=expedite option introduced in Slurm 25.11.2. This feature allows batch jobs to automatically requeue on node failure with highest priority.

hehe7318 requested review from a team as code owners January 27, 2026 14:56

hehe7318 added the 3.x label Jan 27, 2026

hehe7318 and others added 9 commits January 28, 2026 14:07

Refine _test_expedited_requeue_on_ice to validate that expedited requ…

c03b120

…eue jobs are treated as highest priority.

Merge branch 'develop' into wip/add-test-for-expedited-requeue-mode

44d1672

Fix quote escaping in expedited requeue test to avoid sbatch --wrap p…

8b4eada

…arsing error

Merge branch 'develop' into wip/add-test-for-expedited-requeue-mode

b98a10b

TOREVERT: Work around known Slurm 25.11 expedited requeu bug

cb1ce79

Use c5.large and t3.medium to align the vcpu amount

7370907

Fix create_fleet_overrides to include LaunchTemplateSpecification to …

979d4b1

…avoid MissingParameter error

hehe7318 added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Feb 13, 2026

hehe7318 added 3 commits February 13, 2026 15:33

Add SubnetId to create_fleet_overrides and get subnet from vpc_stack …

2e81f78

…instead of fleet-config.json

Increase the job finish timeout to 15mins

0331fae

Fix the test hung issue

7820be4

hehe7318 closed this Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[develop][E2E Test] Add E2E test for Slurm 25.11 expedited requeue mode feature#7211

[develop][E2E Test] Add E2E test for Slurm 25.11 expedited requeue mode feature#7211
hehe7318 wants to merge 13 commits intoaws:developfrom
hehe7318:wip/add-test-for-expedited-requeue-mode

hehe7318 commented Jan 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hehe7318 commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Tests

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hehe7318 commented Jan 27, 2026 •

edited

Loading