Skip to content

Move Bare Metal Runners over to long lived WebSocket#695

Open
epompeii wants to merge 12 commits intodevelfrom
u/ep/runner-channel
Open

Move Bare Metal Runners over to long lived WebSocket#695
epompeii wants to merge 12 commits intodevelfrom
u/ep/runner-channel

Conversation

@epompeii
Copy link
Member

This changeset moves from using an HTTP long poll + per job WebSocket connection to using a single persistent WebSocket connection for each bare metal runner.

@github-actions
Copy link

github-actions bot commented Mar 11, 2026

🤖 Claude Code Review

PR: #695
Base: devel
Head: u/ep/runner-channel
Commit: e4f90aaeefcc9efd2216d7fddf802fe5c483ecdb


Now I have a comprehensive view of all the changes. Here's the review:


PR Review: Runner WebSocket Channel State Machine

Summary

This PR introduces a major architectural refactoring of the runner-server communication protocol. The key changes are:

  1. Sans-IO state machine (state_machine.rs) - Pure decision logic separated from I/O for the runner WebSocket channel protocol
  2. Persistent WebSocket channel (channel.rs) - Server-side replaces per-job WebSocket with a persistent connection that loops Idle -> Executing -> Idle
  3. New protocol messages - Ready, Ack, NoJob, Job messages for job assignment over the persistent channel
  4. Retry/recovery mechanism - Pending result retry with bounded retries across reconnections

Positives

  • Excellent architectural separation: The sans-IO state machine pattern is textbook correct. All decision logic is pure and testable without any I/O, with the driver (run_driver) in mod.rs handling effects. This is a significant quality improvement.
  • Thorough test coverage: The state machine has comprehensive unit tests covering happy paths, edge cases (UUID mismatches, max retries), full protocol sequences, and effect ordering.
  • Robust reconnection semantics: The pending_result / in_flight lifecycle is well-designed with bounded retries (MAX_PENDING_RESULT_RETRIES = 3) to prevent unbounded message accumulation.
  • Race condition handling: Server-side uses conditional UPDATEs with status filters (execute_if_status, execute_if_either_status) to avoid TOCTOU races, which is correct for the concurrent environment.
  • Good security practices: Runner token validation, per-runner rate limiting, job ownership verification in lookup_runner_job, and OCI token TTL scoping.
  • Proper use of thiserror and strong types per CLAUDE.md guidelines.

Issues

Potential Bugs

  1. wait_for_job conflates timeout and connection errors (websocket.rs:284-287): The comment acknowledges this, but wait_for_job returns Err for both genuine timeouts and connection failures. The driver maps all Err to Input::ConnectionFailed, which means a simple poll timeout triggers a reconnect cycle instead of just sending NoJob. This could cause unnecessary reconnection churn under normal no-job conditions.

  2. Heartbeat thread continues after ConnectionFailed during Executing (state_machine.rs:310-312): When ConnectionFailed occurs during Executing, the state machine transitions to Disconnected and reconnects, but the heartbeat thread in job.rs is still running and holding the WebSocket Arc<Mutex<JobChannel>>. The job continues executing while the state machine tries to reconnect. This seems intentional (the job should keep running), but after reconnect, the old ws reference is replaced at line 192 of mod.rs while the heartbeat thread still holds the old Arc. This could cause the heartbeat thread to send on a stale/closed connection.

  3. ReceiveTimeout during AwaitingTerminalAck retries on same connection (state_machine.rs:351-359): The first ACK timeout calls resolve_idle() which immediately resends the pending message. But the state machine doesn't emit Close + Connect first, so it resends on the same potentially-dead connection. The comment at line 348-350 says "if the connection is actually dead, the retry send will fail and produce ConnectionFailed", but there's a risk of the send succeeding while the server never processes it (half-open TCP).

Code Quality

  1. std::mem::replace pattern in step() (state_machine.rs:177): Using mem::replace to take ownership of state is fine, but the temporary Disconnected sentinel could mask bugs if a handler forgets to set self.state. Consider using Option<ChannelState> with .take() and asserting self.state.is_some() at the end of step() to catch missing state assignments.

  2. ServerMessage::Cancel handling inconsistency in wait_for_job (websocket.rs:102-105): Returns a Protocol error on Cancel during wait_for_job, but the variable name server_msg is shadowed by the match — the error message format string uses {server_msg:?} but server_msg has been moved into the match. This should still compile since it's the last use, but it's slightly confusing.

  3. Duplicated test_job_uuid / test_claimed_job helpers: The same UUID "550e8400-..." and similar JsonClaimedJob construction appears across state_machine.rs, websocket.rs, job.rs, and bencher_json/websocket.rs tests. Consider a shared test utilities module to reduce duplication.

Performance

  1. VecDeque effect processing clears on every input (mod.rs:155): effects.clear() is called before extending with new effects on every Input. If the state machine returns multiple effects and one produces an Input, the remaining effects are discarded. This is intentional (the state machine should handle all sequencing), but worth noting for correctness — it means effects must be ordered such that only the last one can produce an Input.

  2. Polling interval is hardcoded at 1 second (channel.rs:560): The server polls the database every 1 second during long-poll. For high-runner-count deployments, this could cause significant DB load. Consider a notification-based approach or at least making the interval configurable.

Security

  1. Server-side wait_for_ready accepts terminal messages without rate limiting (channel.rs:924-966): A malicious runner could flood the server with Completed/Failed messages with different job UUIDs during the idle state, triggering DB lookups for each one. The lookup_runner_job function validates ownership, but the DB load could still be significant.

CLAUDE.md Compliance

  1. rand added to bencher_runner (Cargo.lock + Cargo.toml): Verify the version is declared in workspace Cargo.toml per the rule "All dependency versions go in the workspace Cargo.toml".

  2. #[expect(clippy::too_many_lines)] on runner_channel (channel.rs:744): The function is long but structurally simple (a loop with match arms). The suppress is properly using #[expect] not #[allow], which is correct per CLAUDE.md.

Minor Nits

  • channel.rs:313 has a TODO: Billing logic comment — should this be tracked in an issue?
  • The CloseReason variants (JobCompleted, JobFailed, etc.) are still defined but no longer used for close frames in the persistent channel (only used for heartbeat timeout). Consider documenting which are still in use.

Verdict

This is a well-architected refactoring with solid design decisions. The sans-IO pattern is the right call for testability, and the retry semantics are well thought through. The main concerns are around the wait_for_job timeout/error conflation (item 1) and the stale WebSocket reference during reconnect with an active heartbeat thread (item 2). Both should be addressed before merging.


Model: claude-opus-4-6

@github-actions
Copy link

github-actions bot commented Mar 11, 2026

🐰 Bencher Report

Branchu/ep/runner-channel
Testbedubuntu-22.04
Click to view all benchmark results
BenchmarkLatencyBenchmark Result
microseconds (µs)
(Result Δ%)
Upper Boundary
microseconds (µs)
(Limit %)
Adapter::Json📈 view plot
🚷 view threshold
3.83 µs
(+11.19%)Baseline: 3.44 µs
4.62 µs
(82.84%)
Adapter::Magic (JSON)📈 view plot
🚷 view threshold
3.83 µs
(+11.67%)Baseline: 3.43 µs
4.52 µs
(84.65%)
Adapter::Magic (Rust)📈 view plot
🚷 view threshold
25.91 µs
(+1.12%)Baseline: 25.62 µs
31.03 µs
(83.49%)
Adapter::Rust📈 view plot
🚷 view threshold
3.09 µs
(+8.36%)Baseline: 2.85 µs
3.35 µs
(92.31%)
Adapter::RustBench📈 view plot
🚷 view threshold
3.09 µs
(+8.36%)Baseline: 2.85 µs
3.33 µs
(92.74%)
head_version_insert/batch/10📈 view plot
🚷 view threshold
108.37 µs
(+8.32%)Baseline: 100.05 µs
121.84 µs
(88.94%)
head_version_insert/batch/100📈 view plot
🚷 view threshold
249.43 µs
(+5.09%)Baseline: 237.35 µs
268.89 µs
(92.76%)
head_version_insert/batch/255📈 view plot
🚷 view threshold
471.85 µs
(+2.30%)Baseline: 461.24 µs
493.91 µs
(95.53%)
head_version_insert/batch/50📈 view plot
🚷 view threshold
168.72 µs
(+5.09%)Baseline: 160.54 µs
184.25 µs
(91.57%)
threshold_query/join/10📈 view plot
🚷 view threshold
160.13 µs
(+10.93%)Baseline: 144.35 µs
171.39 µs
(93.43%)
threshold_query/join/20📈 view plot
🚷 view threshold
176.24 µs
(+11.10%)Baseline: 158.63 µs
187.90 µs
(93.79%)
threshold_query/join/5📈 view plot
🚷 view threshold
149.23 µs
(+9.55%)Baseline: 136.22 µs
160.29 µs
(93.10%)
threshold_query/join/50📈 view plot
🚷 view threshold
220.12 µs
(+9.95%)Baseline: 200.20 µs
235.38 µs
(93.52%)
🐰 View full continuous benchmarking report in Bencher

@epompeii epompeii force-pushed the u/ep/runner-channel branch 3 times, most recently from a973c51 to 9dcac55 Compare March 13, 2026 04:55
@epompeii epompeii force-pushed the u/ep/runner-channel branch from 9dcac55 to e4f90aa Compare March 13, 2026 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant