Skip to content

[Integ] Add E2E test for Slurm 25.11 expedited requeue mode feature#7246

Open
himani2411 wants to merge 17 commits intoaws:developfrom
himani2411:xuanqi--expedited-requeue-mode-integ
Open

[Integ] Add E2E test for Slurm 25.11 expedited requeue mode feature#7246
himani2411 wants to merge 17 commits intoaws:developfrom
himani2411:xuanqi--expedited-requeue-mode-integ

Conversation

@himani2411
Copy link
Contributor

@himani2411 himani2411 commented Feb 24, 2026

Description of changes

  • cherry-pick https://github.com/aws/aws-parallelcluster/pull/7211/changes
    • Add E2E test for Slurm 25.11 expedited requeue (--requeue=expedite). The test simulates ICE on a compute node, submits a mix of expedited and normal jobs targeting that node, recovers from ICE, and verifies that the requeued expedited job runs first by comparing start time epochs from job output files.
    • Helper functions _submit_jobs_and_simulate_ice and _recover_from_ice_and_wait_for_jobs are extracted to reduce duplication in the ICE simulation cycle.
  • reverting the changes made for slurm bug in expedited-requeue mode
  • Add --exclusive flag to the job submitted so both the jobs are not assigned/started at the same time when we are no longer simulating ICE failure

Cookbook Changes -> aws/aws-parallelcluster-cookbook#3117

Tests

  • Integ test was succeesful

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@himani2411 himani2411 requested review from a team as code owners February 24, 2026 18:28
@himani2411 himani2411 added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Feb 24, 2026
@himani2411 himani2411 force-pushed the xuanqi--expedited-requeue-mode-integ branch 2 times, most recently from 4b1752a to b77ee9b Compare February 24, 2026 20:47
@himani2411 himani2411 changed the title Xuanqi expedited requeue mode integ [Integ] Add E2E test for Slurm 25.11 expedited requeue mode feature Feb 24, 2026
@himani2411 himani2411 force-pushed the xuanqi--expedited-requeue-mode-integ branch 11 times, most recently from 04a558d to 0fdd987 Compare March 3, 2026 00:25
hehe7318 and others added 12 commits March 3, 2026 10:53
Extend test_fast_capacity_failover to validate the new --requeue=expedite
option introduced in Slurm 25.11.2. This feature allows batch jobs to
automatically requeue on node failure with highest priority.
- Change job commands from simple 'sleep 30' to output hostname and
  timestamps, making it easier to verify job execution in output files
- Add --prefer option to job2 targeting the same compute resource as job1
- Increase job2 node request from 1 to 2 nodes to prevent it from
  immediately running on another CR before job1 requeues
…er and use recoverable ICE simulation

Move _test_expedited_requeue_on_ice out of test_fast_capacity_failover into a standalone
test_expedited_requeue with its own cluster config (multi-instance-type CR using create_fleet).

Replace the unrecoverable overrides.py-based ICE simulation with create_fleet_overrides.json:
write invalid 'ICE-' prefixed InstanceTypes to trigger ICE, then change them back to real ones
(t3.medium, c5.xlarge) to recover. This allows verifying that after ICE recovery, the expedited
requeue job starts before a normal job submitted earlier.
@himani2411 himani2411 force-pushed the xuanqi--expedited-requeue-mode-integ branch from 0fdd987 to 098cbea Compare March 3, 2026 15:53
logging.info("Start epochs: %s", dict(zip([j["label"] for j in jobs], start_epochs)))

assert_that(start_epochs[1]).is_less_than_or_equal_to(start_epochs[0]) # job1 (normal) after job2 (expedited)
# assert_that(start_epochs[2]).is_less_than_or_equal_to(start_epochs[0]) # job3 (expedited) before job1 (normal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have these assertions been commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a TODO which is for addition of another job 3 which is an improvement for now the test is focusing on only 2 Jobs

# so that its clear that Job 2 goes at the top of the queue
jobs = [
{"label": "job1", "expedited": False},
{"label": "job2", "expedited": True},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to verify the scenario customer requested -> the job with expedited requeue flag should still have the highest priority after requeue.

Not the scenario the job with expedited requeue flag should run before a normal job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the usecase that CX has is that the higher priority job that they want gets to the top of the queue and runs before an other job which is what i tested in the most obvious way i could. with the original submission of expediated and normal job it was not obvious whether the expediated feature was working or the original submission order was being maintained

assert_that(start_epochs[1]).is_less_than_or_equal_to(start_epochs[0]) # job1 (normal) after job2 (expedited)
# assert_that(start_epochs[2]).is_less_than_or_equal_to(start_epochs[0]) # job3 (expedited) before job1 (normal)
# assert_that(start_epochs[1]).is_less_than_or_equal_to(start_epochs[2]) # job2 (expedited) before job3 (expedited)
logging.info("Verified: expedited jobs (job2) ran before normal job (job1)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants