Merged
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
|
|
||
| if not pyproject_file.is_file(): | ||
| logger.debug(f"pyproject.toml not found in installed package {module_name}") | ||
| logger.debug(f"pyproject.toml not found for installed module {module_name}") |
There was a problem hiding this comment.
Skill docs missing for workflow change
Low Severity
This PR changes evaluation workflow behavior in verifiers/scripts/eval.py by altering how get_env_eval_defaults resolves pyproject.toml for environments, but no corresponding skills/*/SKILL.md updates are included. That violates the Skills Updates rule for user-facing workflow changes in verifiers/scripts/*.py.
Triggered by project rule: BugBot Instructions
…e-envs-as-packages # Conflicts: # verifiers/envs/experimental/cli_agent_env.py
Member
Author
|
@cursoragent fix merge conflicts vs main |
All changes have been pushed to the branch.
|
…solve CliAgentEnv interception handling conflicts by using _require_interception_server() Co-authored-by: will brown <willccbb@users.noreply.github.com>
…978) * Add overview panel and collapse non-running envs in vf-eval display When evaluating many environments from a TOML file, the progress display overflowed the terminal. Now a persistent overview panel at the top shows every env's status (reward, error rate, time) on one compact line, and only running envs get full detail panels below it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix compact row showing negative total when num_examples=-1 Show "..." instead of the negative number, matching the existing behavior in the full progress panel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Add arrow-key navigation between env detail panels in vf-eval display - Add key listener infrastructure to BaseDisplay (cbreak mode, stdin polling via os.read on fd 0, arrow key escape sequence parsing) - Show single selected env detail panel instead of all running envs - Left/right arrow keys cycle through envs; selected env highlighted in overview - Overview capped at half terminal height with priority ordering when truncated (failed > running > completed > pending), selected env always visible - Fix terminal size detection: use os.get_terminal_size(0) since stdout/stderr are redirected to pipes, making shutil.get_terminal_size() fall back to 24 lines - Screen mode (--tui) adaptively sizes log panel to fill terminal - Footer shows navigation hint when multiple envs present - Stop key listener before wait_for_exit to prevent stdin contention Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Use daemon thread for key listener instead of asyncio task The asyncio task couldn't run when the event loop was blocked by synchronous env startup work (package installation, etc.), so arrow key navigation only worked once all envs were running. A daemon thread runs independently of the event loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve overview panel: env counter in title, better selection indicator - Show "(env n/x)" in detail panel title for multi-env evals - Add ▶ prefix to selected env in overview for clearer indication - Change running env icon from ▸ (triangle) to ● (filled circle) to pair with ○ (empty circle) for pending - Add spacing between selection arrow and status icon Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Scroll overview panel to keep selected env visible Replace priority-ordering with a sliding window that follows the selected env. Shows "... N above" / "... and N more" indicators when the list is truncated at top/bottom. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert unrelated AGENTS.md sync changes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix: only set cbreak mode when key listener is active Move cbreak terminal setup from start() into _start_key_listener() so that GEPADisplay (sync context manager, no key listener) doesn't leave the terminal in cbreak mode with no thread consuming keystrokes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "Fix: only set cbreak mode when key listener is active" This reverts commit f8e3b8f. * Fix key listener busy-loop on stdin EOF Break out of the loop when os.read returns empty bytes (EOF from SSH disconnect, terminal close, etc.) instead of spinning at 100% CPU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Adaptive log panel height in both screen and non-screen modes Fill remaining terminal space with the log panel regardless of display mode, instead of using a fixed 20-line default in non-screen. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Adaptive log panel height fills terminal in both display modes Use terminal-adaptive log sizing in non-screen mode too (was fixed at 20 lines). Includes 2-line buffer in detail_fixed to account for Rich rendering overhead so the footer stays visible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Measure actual rendered height for adaptive log panel sizing Render content items to a temporary buffer to count real terminal lines (accounting for text wrapping) instead of counting Python objects. This ensures the log panel fills exactly the remaining space regardless of metrics count or terminal width. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Document arrow-key navigation in multi-env evaluation display Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add log panel scrolling and wrapped line indentation - Up/down arrow keys scroll through log history (3 lines per press) - Log lines wrap with 4-space indented continuation lines for clarity - Scroll offset shown in panel title; resets on env switch - Footer hint updated to show scroll controls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add colored log headers in eval display log panel Parse log lines into timestamp, source, level, and message parts with distinct styles. Level colors follow common conventions: DEBUG=dim blue, INFO=bold green, WARNING=bold yellow, ERROR=bold red, CRITICAL=bold red reverse. Timestamp is bold dim, source is dim cyan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The colored log headers make entry boundaries clear without indentation. Keep wrapping-aware height estimation so the log panel fills the correct amount of space. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes: 1. Set state["cua_sandbox_id"] immediately after sandbox creation, before _wait_for_sandbox_ready/_wait_for_server. Previously if anything failed between creation and the state assignment, cleanup_session couldn't find the sandbox_id so the sandbox leaked until teardown at the end of the run. 2. Add _create_sandbox_with_retry that cleans up orphaned sandboxes from failed retry attempts. Previously if _create_sandbox succeeded but the retry loop retried, the first sandbox was orphaned in active_sandboxes but never referenced by any state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e-envs-as-packages # Conflicts: # verifiers/envs/experimental/cli_agent_env.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Description
Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Medium Risk
Touches eval CLI configuration resolution; incorrect module path detection could silently skip defaults or pick the wrong
pyproject.toml, changing evaluation run parameters.Overview
prime eval runnow loads[tool.verifiers.eval]defaults (num_examples,rollouts_per_example) from an environment’spyproject.tomlfor both package and single-file environment modules by switchingget_env_eval_defaultsto resolve module paths viaimportlib.util.find_spec.Adds pytest coverage for both module shapes, and updates the environments docs to include
vf.TunnelErrorundervf.InfraError(plus a small import cleanup inHybridMathRubric).Written by Cursor Bugbot for commit c59f487. This will update automatically on new commits. Configure here.