Skip to content

Conversation

@zhaoyuyoung
Copy link
Contributor

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@codecov
Copy link

codecov bot commented Jan 26, 2026

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.88%. Comparing base (c634e51) to head (395fc2f).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
python/lsst/ctrl/bps/panda/utils.py 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #103      +/-   ##
==========================================
- Coverage   35.01%   34.88%   -0.14%     
==========================================
  Files          13       13              
  Lines        1308     1313       +5     
  Branches      214      214              
==========================================
  Hits          458      458              
- Misses        830      835       +5     
  Partials       20       20              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zhaoyuyoung zhaoyuyoung requested review from MichelleGower and removed request for MichelleGower January 26, 2026 17:32
Copy link
Contributor

@MichelleGower MichelleGower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One question. Merge approved.

code = e.errno
if abs(code) == 9:
print(f"BPS prepare caught exception: {e.strerror}")
sys.exit(137)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to do this more generically for all signals instead of just signal 9? For example something like:

if code < 0:
    print(f"BPS prepare caught exception: {e.strerror}")
	sys.exit(128 + abs(code))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesterday I saw there is build task failed at LANCS because the pipetask graph command got wrong option. Then the exit code is 2. For all failed build task for OOM, I've only see -9. Not sure if we should generalize all the exit codes to be retried with boosted memory. Waiting for both reviewers's suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's good to do it more generally for all signals, instead of -9.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all exit codes become 137 or as Michelle suggested 128+abs(exit_code)?
Rubin workflow is getting bigger and bigger, so the build task is usually start with very high initial memory. When the job exit with code 1 or 2 for payload issue other than OOM, the retry memory will be bumped too high.

code = e.errno
if abs(code) == 9:
print(f"BPS prepare caught exception: {e.strerror}")
sys.exit(137)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's good to do it more generally for all signals, instead of -9.

task_rss = get_task_parameter(config, remote_build, "requestMemory")
memory_multiplier = get_task_parameter(config, remote_build, "memoryMultiplier")
task_rss_retry_step = task_rss * memory_multiplier if memory_multiplier else 0
task_rss_retry_offset = 0 if task_rss_retry_step else task_rss
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add task_rss_max, to limit the top memory. It will be supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this. I asked in the jira ticket whether this parameter will this break production jobs if it's not in iDDS yet?

@zhaoyuyoung zhaoyuyoung force-pushed the tickets/DM-53876 branch 2 times, most recently from c4b4ed7 to 7f74328 Compare February 11, 2026 04:03
@zhaoyuyoung zhaoyuyoung merged commit 2d0a82c into main Feb 11, 2026
18 of 20 checks passed
@zhaoyuyoung zhaoyuyoung deleted the tickets/DM-53876 branch February 11, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants