Skip to content

[feat] Aggregate polling for retrieving job pending reason#3642

Open
vkarak wants to merge 1 commit intoreframe-hpc:developfrom
vkarak:bugfix/slurm-rpc-load
Open

[feat] Aggregate polling for retrieving job pending reason#3642
vkarak wants to merge 1 commit intoreframe-hpc:developfrom
vkarak:bugfix/slurm-rpc-load

Conversation

@vkarak
Copy link
Contributor

@vkarak vkarak commented Mar 25, 2026

This PR improves the polling of jobs for the pending reason.

This is now done in a single command for all pending jobs. Two knobs are also exposed to users now as configuration options and environment variables:

  1. slurm_job_cancel_reasons: This is a list of pending reasons that reframe will check and will cancel the job proactively.
  2. slurm_pending_job_reason_poll_freq: This controls the frequency that pending jobs will be polled for their pending reasons (valid only for slurm backend).

Closes #3640.

@codecov
Copy link

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.66%. Comparing base (1eee4f9) to head (45c5613).

Files with missing lines Patch % Lines
reframe/core/schedulers/slurm.py 49.09% 28 Missing ⚠️
reframe/core/schedulers/__init__.py 33.33% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3642      +/-   ##
===========================================
- Coverage    91.70%   91.66%   -0.05%     
===========================================
  Files           62       62              
  Lines        13713    13723      +10     
===========================================
+ Hits         12576    12579       +3     
- Misses        1137     1144       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vkarak vkarak force-pushed the bugfix/slurm-rpc-load branch from 61c38ae to 45c5613 Compare March 25, 2026 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

Multiple squeue's in _cancel_if_blocked in reframe/core/schedulers/slurm.py are hitting slurm's RPC rate limit

1 participant