Move Bare Metal Runners over to long lived WebSocket by epompeii · Pull Request #695 · bencherdev/bencher

epompeii · 2026-03-11T03:45:53Z

This changeset moves from using an HTTP long poll + per job WebSocket connection to using a single persistent WebSocket connection for each bare metal runner.

github-actions · 2026-03-11T03:48:18Z

🤖 Claude Code Review

PR: #695
Base: devel
Head: u/ep/runner-channel
Commit: e4f90aaeefcc9efd2216d7fddf802fe5c483ecdb

Now I have a comprehensive view of all the changes. Here's the review:

PR Review: Runner WebSocket Channel State Machine

Summary

This PR introduces a major architectural refactoring of the runner-server communication protocol. The key changes are:

Sans-IO state machine (state_machine.rs) - Pure decision logic separated from I/O for the runner WebSocket channel protocol
Persistent WebSocket channel (channel.rs) - Server-side replaces per-job WebSocket with a persistent connection that loops Idle -> Executing -> Idle
New protocol messages - Ready, Ack, NoJob, Job messages for job assignment over the persistent channel
Retry/recovery mechanism - Pending result retry with bounded retries across reconnections

Positives

Excellent architectural separation: The sans-IO state machine pattern is textbook correct. All decision logic is pure and testable without any I/O, with the driver (run_driver) in mod.rs handling effects. This is a significant quality improvement.
Thorough test coverage: The state machine has comprehensive unit tests covering happy paths, edge cases (UUID mismatches, max retries), full protocol sequences, and effect ordering.
Robust reconnection semantics: The pending_result / in_flight lifecycle is well-designed with bounded retries (MAX_PENDING_RESULT_RETRIES = 3) to prevent unbounded message accumulation.
Race condition handling: Server-side uses conditional UPDATEs with status filters (execute_if_status, execute_if_either_status) to avoid TOCTOU races, which is correct for the concurrent environment.
Good security practices: Runner token validation, per-runner rate limiting, job ownership verification in lookup_runner_job, and OCI token TTL scoping.
Proper use of thiserror and strong types per CLAUDE.md guidelines.

Issues

Potential Bugs

wait_for_job conflates timeout and connection errors (websocket.rs:284-287): The comment acknowledges this, but wait_for_job returns Err for both genuine timeouts and connection failures. The driver maps all Err to Input::ConnectionFailed, which means a simple poll timeout triggers a reconnect cycle instead of just sending NoJob. This could cause unnecessary reconnection churn under normal no-job conditions.
Heartbeat thread continues after ConnectionFailed during Executing (state_machine.rs:310-312): When ConnectionFailed occurs during Executing, the state machine transitions to Disconnected and reconnects, but the heartbeat thread in job.rs is still running and holding the WebSocket Arc<Mutex<JobChannel>>. The job continues executing while the state machine tries to reconnect. This seems intentional (the job should keep running), but after reconnect, the old ws reference is replaced at line 192 of mod.rs while the heartbeat thread still holds the old Arc. This could cause the heartbeat thread to send on a stale/closed connection.
ReceiveTimeout during AwaitingTerminalAck retries on same connection (state_machine.rs:351-359): The first ACK timeout calls resolve_idle() which immediately resends the pending message. But the state machine doesn't emit Close + Connect first, so it resends on the same potentially-dead connection. The comment at line 348-350 says "if the connection is actually dead, the retry send will fail and produce ConnectionFailed", but there's a risk of the send succeeding while the server never processes it (half-open TCP).

Code Quality

std::mem::replace pattern in step() (state_machine.rs:177): Using mem::replace to take ownership of state is fine, but the temporary Disconnected sentinel could mask bugs if a handler forgets to set self.state. Consider using Option<ChannelState> with .take() and asserting self.state.is_some() at the end of step() to catch missing state assignments.
ServerMessage::Cancel handling inconsistency in wait_for_job (websocket.rs:102-105): Returns a Protocol error on Cancel during wait_for_job, but the variable name server_msg is shadowed by the match — the error message format string uses {server_msg:?} but server_msg has been moved into the match. This should still compile since it's the last use, but it's slightly confusing.
Duplicated test_job_uuid / test_claimed_job helpers: The same UUID "550e8400-..." and similar JsonClaimedJob construction appears across state_machine.rs, websocket.rs, job.rs, and bencher_json/websocket.rs tests. Consider a shared test utilities module to reduce duplication.

Performance

VecDeque effect processing clears on every input (mod.rs:155): effects.clear() is called before extending with new effects on every Input. If the state machine returns multiple effects and one produces an Input, the remaining effects are discarded. This is intentional (the state machine should handle all sequencing), but worth noting for correctness — it means effects must be ordered such that only the last one can produce an Input.
Polling interval is hardcoded at 1 second (channel.rs:560): The server polls the database every 1 second during long-poll. For high-runner-count deployments, this could cause significant DB load. Consider a notification-based approach or at least making the interval configurable.

Security

Server-side wait_for_ready accepts terminal messages without rate limiting (channel.rs:924-966): A malicious runner could flood the server with Completed/Failed messages with different job UUIDs during the idle state, triggering DB lookups for each one. The lookup_runner_job function validates ownership, but the DB load could still be significant.

CLAUDE.md Compliance

rand added to bencher_runner (Cargo.lock + Cargo.toml): Verify the version is declared in workspace Cargo.toml per the rule "All dependency versions go in the workspace Cargo.toml".
#[expect(clippy::too_many_lines)] on runner_channel (channel.rs:744): The function is long but structurally simple (a loop with match arms). The suppress is properly using #[expect] not #[allow], which is correct per CLAUDE.md.

Minor Nits

channel.rs:313 has a TODO: Billing logic comment — should this be tracked in an issue?
The CloseReason variants (JobCompleted, JobFailed, etc.) are still defined but no longer used for close frames in the persistent channel (only used for heartbeat timeout). Consider documenting which are still in use.

Verdict

This is a well-architected refactoring with solid design decisions. The sans-IO pattern is the right call for testability, and the retry semantics are well thought through. The main concerns are around the wait_for_job timeout/error conflation (item 1) and the stale WebSocket reference during reconnect with an active heartbeat thread (item 2). Both should be addressed before merging.

_{Model: claude-opus-4-6}

github-actions · 2026-03-11T04:03:01Z

Bencher Report

Branch	u/ep/runner-channel
Testbed	ubuntu-22.04

Click to view all benchmark results

Benchmark	Latency	Benchmark Result microseconds (µs) (Result Δ%)	Upper Boundary microseconds (µs) (Limit %)
Adapter::Json	📈 view plot 🚷 view threshold	3.83 µs (+11.19%) Baseline: 3.44 µs	4.62 µs (82.84%)
Adapter::Magic (JSON)	📈 view plot 🚷 view threshold	3.83 µs (+11.67%) Baseline: 3.43 µs	4.52 µs (84.65%)
Adapter::Magic (Rust)	📈 view plot 🚷 view threshold	25.91 µs (+1.12%) Baseline: 25.62 µs	31.03 µs (83.49%)
Adapter::Rust	📈 view plot 🚷 view threshold	3.09 µs (+8.36%) Baseline: 2.85 µs	3.35 µs (92.31%)
Adapter::RustBench	📈 view plot 🚷 view threshold	3.09 µs (+8.36%) Baseline: 2.85 µs	3.33 µs (92.74%)
head_version_insert/batch/10	📈 view plot 🚷 view threshold	108.37 µs (+8.32%) Baseline: 100.05 µs	121.84 µs (88.94%)
head_version_insert/batch/100	📈 view plot 🚷 view threshold	249.43 µs (+5.09%) Baseline: 237.35 µs	268.89 µs (92.76%)
head_version_insert/batch/255	📈 view plot 🚷 view threshold	471.85 µs (+2.30%) Baseline: 461.24 µs	493.91 µs (95.53%)
head_version_insert/batch/50	📈 view plot 🚷 view threshold	168.72 µs (+5.09%) Baseline: 160.54 µs	184.25 µs (91.57%)
threshold_query/join/10	📈 view plot 🚷 view threshold	160.13 µs (+10.93%) Baseline: 144.35 µs	171.39 µs (93.43%)
threshold_query/join/20	📈 view plot 🚷 view threshold	176.24 µs (+11.10%) Baseline: 158.63 µs	187.90 µs (93.79%)
threshold_query/join/5	📈 view plot 🚷 view threshold	149.23 µs (+9.55%) Baseline: 136.22 µs	160.29 µs (93.10%)
threshold_query/join/50	📈 view plot 🚷 view threshold	220.12 µs (+9.95%) Baseline: 200.20 µs	235.38 µs (93.52%)

🐰 View full continuous benchmarking report in Bencher

epompeii self-assigned this Mar 11, 2026

epompeii added claude deploy labels Mar 11, 2026

epompeii force-pushed the u/ep/runner-channel branch 3 times, most recently from a973c51 to 9dcac55 Compare March 13, 2026 04:55

epompeii added 12 commits March 13, 2026 04:55

runner_ws

92ed2c0

self_review

1e2a4ae

claude_1

73b5d87

claude_2

55f3dcf

job_recovery

5b7fd1f

failed_to_completed

ee0c894

claude_3

4b41d78

claude_4

f54b6e5

claude_5

ffe9172

claude

33f05ca

state_machine

c2fea3e

clippy

e4f90aa

epompeii force-pushed the u/ep/runner-channel branch from 9dcac55 to e4f90aa Compare March 13, 2026 04:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move Bare Metal Runners over to long lived WebSocket#695

Move Bare Metal Runners over to long lived WebSocket#695
epompeii wants to merge 12 commits intodevelfrom
u/ep/runner-channel

epompeii commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

epompeii commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Claude Code Review

PR Review: Runner WebSocket Channel State Machine

Summary

Positives

Issues

Potential Bugs

Code Quality

Performance

Security

CLAUDE.md Compliance

Minor Nits

Verdict

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bencher Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Mar 11, 2026 •

edited

Loading

github-actions bot commented Mar 11, 2026 •

edited

Loading