[Integ] Add E2E test for Slurm 25.11 expedited requeue mode feature#7246
[Integ] Add E2E test for Slurm 25.11 expedited requeue mode feature#7246himani2411 wants to merge 17 commits intoaws:developfrom
Conversation
4b1752a to
b77ee9b
Compare
04a558d to
0fdd987
Compare
Extend test_fast_capacity_failover to validate the new --requeue=expedite option introduced in Slurm 25.11.2. This feature allows batch jobs to automatically requeue on node failure with highest priority.
…eue jobs are treated as highest priority.
- Change job commands from simple 'sleep 30' to output hostname and timestamps, making it easier to verify job execution in output files - Add --prefer option to job2 targeting the same compute resource as job1 - Increase job2 node request from 1 to 2 nodes to prevent it from immediately running on another CR before job1 requeues
…er and use recoverable ICE simulation Move _test_expedited_requeue_on_ice out of test_fast_capacity_failover into a standalone test_expedited_requeue with its own cluster config (multi-instance-type CR using create_fleet). Replace the unrecoverable overrides.py-based ICE simulation with create_fleet_overrides.json: write invalid 'ICE-' prefixed InstanceTypes to trigger ICE, then change them back to real ones (t3.medium, c5.xlarge) to recover. This allows verifying that after ICE recovery, the expedited requeue job starts before a normal job submitted earlier.
…avoid MissingParameter error
…instead of fleet-config.json
…ns 1st on the node we are targetting
…ns 1st on the node we are targetting * adding a wait fix of 5 secs
0fdd987 to
098cbea
Compare
| logging.info("Start epochs: %s", dict(zip([j["label"] for j in jobs], start_epochs))) | ||
|
|
||
| assert_that(start_epochs[1]).is_less_than_or_equal_to(start_epochs[0]) # job1 (normal) after job2 (expedited) | ||
| # assert_that(start_epochs[2]).is_less_than_or_equal_to(start_epochs[0]) # job3 (expedited) before job1 (normal) |
There was a problem hiding this comment.
Why have these assertions been commented out?
There was a problem hiding this comment.
there is a TODO which is for addition of another job 3 which is an improvement for now the test is focusing on only 2 Jobs
| # so that its clear that Job 2 goes at the top of the queue | ||
| jobs = [ | ||
| {"label": "job1", "expedited": False}, | ||
| {"label": "job2", "expedited": True}, |
There was a problem hiding this comment.
We want to verify the scenario customer requested -> the job with expedited requeue flag should still have the highest priority after requeue.
Not the scenario the job with expedited requeue flag should run before a normal job.
There was a problem hiding this comment.
the usecase that CX has is that the higher priority job that they want gets to the top of the queue and runs before an other job which is what i tested in the most obvious way i could. with the original submission of expediated and normal job it was not obvious whether the expediated feature was working or the original submission order was being maintained
| assert_that(start_epochs[1]).is_less_than_or_equal_to(start_epochs[0]) # job1 (normal) after job2 (expedited) | ||
| # assert_that(start_epochs[2]).is_less_than_or_equal_to(start_epochs[0]) # job3 (expedited) before job1 (normal) | ||
| # assert_that(start_epochs[1]).is_less_than_or_equal_to(start_epochs[2]) # job2 (expedited) before job3 (expedited) | ||
| logging.info("Verified: expedited jobs (job2) ran before normal job (job1)") |
Description of changes
--exclusiveflag to the job submitted so both the jobs are not assigned/started at the same time when we are no longer simulating ICE failureCookbook Changes -> aws/aws-parallelcluster-cookbook#3117
Tests
References
Checklist
developadd the branch name as prefix in the PR title (e.g.[release-3.6]).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.