[Integ] Add E2E test for Slurm 25.11 expedited requeue mode feature by himani2411 · Pull Request #7246 · aws/aws-parallelcluster

himani2411 · 2026-02-24T18:28:36Z

Description of changes

cherry-pick https://github.com/aws/aws-parallelcluster/pull/7211/changes
- Add E2E test for Slurm 25.11 expedited requeue (--requeue=expedite). The test simulates ICE on a compute node, submits a mix of expedited and normal jobs targeting that node, recovers from ICE, and verifies that the requeued expedited job runs first by comparing start time epochs from job output files.
- Helper functions _submit_jobs_and_simulate_ice and _recover_from_ice_and_wait_for_jobs are extracted to reduce duplication in the ICE simulation cycle.
reverting the changes made for slurm bug in expedited-requeue mode
Add --exclusive flag to the job submitted so both the jobs are not assigned/started at the same time when we are no longer simulating ICE failure

Cookbook Changes -> aws/aws-parallelcluster-cookbook#3117

Tests

Integ test was succeesful

References

Link to impacted open issues.
Link to related PRs in other packages (i.e. cookbook, node).
Link to documentation useful to understand the changes.

Checklist

Make sure you are pointing to the right branch.
If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
Check all commits' messages are clear, describing what and why vs how.
Make sure to have added unit tests or integration tests to cover the new/modified code.
Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Extend test_fast_capacity_failover to validate the new --requeue=expedite option introduced in Slurm 25.11.2. This feature allows batch jobs to automatically requeue on node failure with highest priority.

…eue jobs are treated as highest priority.

- Change job commands from simple 'sleep 30' to output hostname and timestamps, making it easier to verify job execution in output files - Add --prefer option to job2 targeting the same compute resource as job1 - Increase job2 node request from 1 to 2 nodes to prevent it from immediately running on another CR before job1 requeues

…arsing error

…er and use recoverable ICE simulation Move _test_expedited_requeue_on_ice out of test_fast_capacity_failover into a standalone test_expedited_requeue with its own cluster config (multi-instance-type CR using create_fleet). Replace the unrecoverable overrides.py-based ICE simulation with create_fleet_overrides.json: write invalid 'ICE-' prefixed InstanceTypes to trigger ICE, then change them back to real ones (t3.medium, c5.xlarge) to recover. This allows verifying that after ICE recovery, the expedited requeue job starts before a normal job submitted earlier.

…avoid MissingParameter error

…instead of fleet-config.json

…ns 1st on the node we are targetting

…ns 1st on the node we are targetting * adding a wait fix of 5 secs

hgreebe · 2026-03-03T17:31:02Z

tests/integration-tests/tests/schedulers/test_slurm.py

+    logging.info("Start epochs: %s", dict(zip([j["label"] for j in jobs], start_epochs)))
+
+    assert_that(start_epochs[1]).is_less_than_or_equal_to(start_epochs[0])  # job1 (normal) after job2 (expedited)
+    # assert_that(start_epochs[2]).is_less_than_or_equal_to(start_epochs[0])  # job3 (expedited) before job1 (normal)


Why have these assertions been commented out?

there is a TODO which is for addition of another job 3 which is an improvement for now the test is focusing on only 2 Jobs

hehe7318 · 2026-03-03T18:27:36Z

tests/integration-tests/tests/schedulers/test_slurm.py

+    #  so that its clear that Job 2 goes at the top of the queue
+    jobs = [
+        {"label": "job1", "expedited": False},
+        {"label": "job2", "expedited": True},


We want to verify the scenario customer requested -> the job with expedited requeue flag should still have the highest priority after requeue.

Not the scenario the job with expedited requeue flag should run before a normal job.

the usecase that CX has is that the higher priority job that they want gets to the top of the queue and runs before an other job which is what i tested in the most obvious way i could. with the original submission of expediated and normal job it was not obvious whether the expediated feature was working or the original submission order was being maintained

hehe7318 · 2026-03-03T18:28:25Z

tests/integration-tests/tests/schedulers/test_slurm.py

+    assert_that(start_epochs[1]).is_less_than_or_equal_to(start_epochs[0])  # job1 (normal) after job2 (expedited)
+    # assert_that(start_epochs[2]).is_less_than_or_equal_to(start_epochs[0])  # job3 (expedited) before job1 (normal)
+    # assert_that(start_epochs[1]).is_less_than_or_equal_to(start_epochs[2])  # job2 (expedited) before job3 (expedited)
+    logging.info("Verified: expedited jobs (job2) ran before normal job (job1)")


Same as above.

himani2411 requested review from a team as code owners February 24, 2026 18:28

himani2411 added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Feb 24, 2026

himani2411 force-pushed the xuanqi--expedited-requeue-mode-integ branch 2 times, most recently from 4b1752a to b77ee9b Compare February 24, 2026 20:47

himani2411 changed the title ~~Xuanqi expedited requeue mode integ~~ [Integ] Add E2E test for Slurm 25.11 expedited requeue mode feature Feb 24, 2026

himani2411 force-pushed the xuanqi--expedited-requeue-mode-integ branch 11 times, most recently from 04a558d to 0fdd987 Compare March 3, 2026 00:25

hehe7318 and others added 12 commits March 3, 2026 10:53

Add integration test for Slurm 25.11 expedited requeue mode feature

7a32ded

Extend test_fast_capacity_failover to validate the new --requeue=expedite option introduced in Slurm 25.11.2. This feature allows batch jobs to automatically requeue on node failure with highest priority.

Refine _test_expedited_requeue_on_ice to validate that expedited requ…

fc01829

…eue jobs are treated as highest priority.

Fix quote escaping in expedited requeue test to avoid sbatch --wrap p…

dd94842

…arsing error

TOREVERT: Work around known Slurm 25.11 expedited requeu bug

094326d

Use c5.large and t3.medium to align the vcpu amount

a10e298

Fix create_fleet_overrides to include LaunchTemplateSpecification to …

df3677b

…avoid MissingParameter error

Add SubnetId to create_fleet_overrides and get subnet from vpc_stack …

b399217

…instead of fleet-config.json

Increase the job finish timeout to 15mins

0dec864

Fix the test hung issue

eb72659

Reverting the chnage which was because of Slurm bug fixed in 25.11.3

813aa65

Himani Anil Deshpande added 5 commits March 3, 2026 10:53

Adding exclusive flag since we want to identify that Expedited Job ru…

937aaa9

…ns 1st on the node we are targetting

Adding exclusive flag since we want to identify that Expedited Job ru…

aec47aa

…ns 1st on the node we are targetting * adding a wait fix of 5 secs

Add another job with expediated requeue flag

99939fe

remove 3rd job

e2533af

code-linters

098cbea

himani2411 force-pushed the xuanqi--expedited-requeue-mode-integ branch from 0fdd987 to 098cbea Compare March 3, 2026 15:53

hgreebe reviewed Mar 3, 2026

View reviewed changes

hehe7318 reviewed Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Integ] Add E2E test for Slurm 25.11 expedited requeue mode feature#7246

[Integ] Add E2E test for Slurm 25.11 expedited requeue mode feature#7246
himani2411 wants to merge 17 commits intoaws:developfrom
himani2411:xuanqi--expedited-requeue-mode-integ

himani2411 commented Feb 24, 2026 •

edited

Loading

Uh oh!

hgreebe Mar 3, 2026

Uh oh!

himani2411 Mar 3, 2026

Uh oh!

hehe7318 Mar 3, 2026

Uh oh!

himani2411 Mar 3, 2026

Uh oh!

hehe7318 Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

himani2411 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Tests

References

Checklist

Uh oh!

hgreebe Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

himani2411 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

hehe7318 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

himani2411 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

hehe7318 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

himani2411 commented Feb 24, 2026 •

edited

Loading