-
Notifications
You must be signed in to change notification settings - Fork 1
Support memory boost and prmon for build task #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #103 +/- ##
==========================================
- Coverage 35.01% 34.88% -0.14%
==========================================
Files 13 13
Lines 1308 1313 +5
Branches 214 214
==========================================
Hits 458 458
- Misses 830 835 +5
Partials 20 20 ☔ View full report in Codecov by Sentry. |
29a9ffe to
f732a05
Compare
MichelleGower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. One question. Merge approved.
| code = e.errno | ||
| if abs(code) == 9: | ||
| print(f"BPS prepare caught exception: {e.strerror}") | ||
| sys.exit(137) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to do this more generically for all signals instead of just signal 9? For example something like:
if code < 0:
print(f"BPS prepare caught exception: {e.strerror}")
sys.exit(128 + abs(code))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yesterday I saw there is build task failed at LANCS because the pipetask graph command got wrong option. Then the exit code is 2. For all failed build task for OOM, I've only see -9. Not sure if we should generalize all the exit codes to be retried with boosted memory. Waiting for both reviewers's suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's good to do it more generally for all signals, instead of -9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should all exit codes become 137 or as Michelle suggested 128+abs(exit_code)?
Rubin workflow is getting bigger and bigger, so the build task is usually start with very high initial memory. When the job exit with code 1 or 2 for payload issue other than OOM, the retry memory will be bumped too high.
| code = e.errno | ||
| if abs(code) == 9: | ||
| print(f"BPS prepare caught exception: {e.strerror}") | ||
| sys.exit(137) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's good to do it more generally for all signals, instead of -9.
| task_rss = get_task_parameter(config, remote_build, "requestMemory") | ||
| memory_multiplier = get_task_parameter(config, remote_build, "memoryMultiplier") | ||
| task_rss_retry_step = task_rss * memory_multiplier if memory_multiplier else 0 | ||
| task_rss_retry_offset = 0 if task_rss_retry_step else task_rss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add task_rss_max, to limit the top memory. It will be supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add this. I asked in the jira ticket whether this parameter will this break production jobs if it's not in iDDS yet?
c4b4ed7 to
7f74328
Compare
7f74328 to
395fc2f
Compare
Checklist
doc/changes