Skip to content

docs: CI performance and warm Docker CI research#613

Open
paddymul wants to merge 230 commits intomainfrom
docs/ci-research
Open

docs: CI performance and warm Docker CI research#613
paddymul wants to merge 230 commits intomainfrom
docs/ci-research

Conversation

@paddymul
Copy link
Collaborator

@paddymul paddymul commented Mar 1, 2026

Summary

  • CI-performance.md: Analysis of current Depot CI — latency breakdown, runner tier comparison (2/4/8 CPU), per-job timing, path-gated optimization proposals
  • warm-docker-ci.md: Research into replacing Depot with a persistent Hetzner server running warm Docker containers — framework comparison, Dockerfile structure, sidecar pattern, CPU contention analysis, Hetzner Cloud vs Dedicated, provisioning automation

Context

Research/brainstorming docs, no code changes. Captures findings for future reference when implementing a faster CI setup.

🤖 Generated with Claude Code

Research into current Depot CI performance (latency breakdown, runner
tier comparison, path-gated optimizations) and a proposed warm Docker
CI setup on Hetzner (sidecar containers, lockfile-hash caching,
Playwright parallelism, Cloud vs Dedicated comparison).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 1, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.12.dev22721997915

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.12.dev22721997915

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.12.12.dev22721997915" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 18a7fbd4de

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +150 to +152
pytest -vv tests/unit/ &
(cd packages/buckaroo-js-core && pnpm test) &
wait

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate background test failures in trigger script

This script backgrounds both test commands and then calls wait with no job IDs; in Bash, help wait states that when no IDs are given it waits for all children and returns status 0, so a failing pytest or pnpm test can still produce an overall success status. In the CI-trigger context here, that creates false-green runs and can let broken commits appear healthy.

Useful? React with 👍 / 👎.


# 1. Activate rescue system (~5s API call)
curl -s -u "$AUTH" "$API/boot/$SERVER_NUM/rescue" \
-d "os=linux&authorized_key[]=$SSH_FINGERPRINT"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Define SSH key variable before invoking Robot rescue API

The rebuild script uses authorized_key[]=$SSH_FINGERPRINT but never initializes SSH_FINGERPRINT, so running the snippet as written sends an empty key and the later SSH wait loops cannot authenticate to the rescue system. This makes the documented wipe/reprovision flow fail unless callers add hidden external setup.

Useful? React with 👍 / 👎.

paddymul and others added 2 commits March 1, 2026 13:23
- Pin uv/node/pnpm versions (don't track releases, bump when needed)
- Bump Node 20 → 22 LTS
- Add HETZNER_SERVER_ID/IP to .env.example
- Add development verification section (how Claude tests each script locally)
- Add monitoring & alerting section (health endpoint, systemd watchdog, disk hygiene, dead man's switch)
- Expand testing & ongoing verification (Depot as canary, deprecation criteria)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
paddymul and others added 3 commits March 1, 2026 13:55
Adds ci/hetzner/ with everything needed to run CI on a persistent CCX33:

- Dockerfile: Ubuntu 24.04, uv 0.6.6, Python 3.11-3.14, Node 22 LTS,
  pnpm 9.10.0, all deps pre-installed, Playwright chromium
- docker-compose.yml: warm sidecar container (sleep infinity), bind-mounts
  repo + logs, named volume for Playwright browsers
- webhook.py: Flask on :9000, HMAC-SHA256, per-branch cancellation via
  pkill, /health + /logs/<sha> endpoints, systemd watchdog
- run-ci.sh: 5-phase orchestrator (parallel lint+test-js+test-py-3.13 →
  build-wheel → sequential py 3.11/3.12/3.14 → parallel mcp+smoke →
  sequential playwright) with lockfile-aware dep skipping
- lib/status.sh: GitHub commit status API helpers
- lib/lockcheck.sh: SHA256 lockfile comparison, rebuilds deps only on change
- cloud-init.yml: one-shot CCX33 provisioning
- .env.example: template for required secrets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add lib/status.sh (GitHub commit status API) and lib/lockcheck.sh
(lockfile hash comparison for warm dep skipping). Unblock them from
the lib/ gitignore rule which was intended for Python venv dirs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove owner:ci:ci from write_files (ci user doesn't exist yet at that stage)
- Fix echo runcmd entry with colon causing YAML dict parse error
- status.sh: skip GitHub API calls gracefully when GITHUB_TOKEN unset

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…it branch fix

- Add build-essential + libffi-dev + libssl-dev so cffi can compile
- cloud-init: clone --branch main (not default), add safe.directory

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e unused import

- Dockerfile: git config --system safe.directory /repo so git checkout works
  inside the container (bind-mount owned by ci on host, root in container)
- test_playwright_jupyter.sh: add --allow-root so JupyterLab starts as root
- webhook.py: remove unused import signal

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… SHA

Dockerfile COPYs ci/hetzner/run-ci.sh and lib/ into /opt/ci-runner/.
run-ci.sh sources lib from CI_RUNNER_DIR (/opt/ci-runner/) instead of
/repo/ci/hetzner/lib/, so they survive `git checkout <sha>` even when
the SHA has no ci/hetzner/ directory (e.g. commits on main branch).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
job_lint_python was running uv sync --dev --no-install-project on the 3.13
venv, which strips --all-extras packages (e.g. pl-series-hash) because
optional extras require the project to be installed. This ran in parallel
with job_test_python_3.13, causing a race condition that randomly removed
pl-series-hash from the venv before tests ran.

ruff is already installed in the venv from the image build — no sync needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JupyterLab refuses to start as root without --allow-root. Rather than
patching every test script, bake c.ServerApp.allow_root = True into
/root/.jupyter/jupyter_lab_config.py in the image.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- mp_timeout tests: forkserver subprocess spawn takes >1s in Docker (timeout)
- test_server_killed_on_parent_death: SIGKILL propagation differs in containers
- Python 3.14.0a5: segfaults on pytest startup (CPython pre-release bug)

All three disabled with a note to revisit once timing/stability is known.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents all 9 bugs fixed during bringup, known Docker-incompatible
tests (disabled), and final timing: 8m59s wall time, all jobs passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each version has its own venv at /opt/venvs/3.11-3.14 — no shared
state, safe to run concurrently. Saves ~70-80s wall time on CCX33.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Run 7 (warm, sequential Phase 3): 8m23s
Run 8 (warm, parallel Phase 3): 7m21s — saves 1m07s

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All 5 jobs bind to distinct ports (6006/8701/2718/8765/8889) — no
port conflicts. Redirect PLAYWRIGHT_HTML_OUTPUT_DIR per job to avoid
playwright-report/ write collisions. Expected saving: ~3m.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
paddymul and others added 30 commits March 4, 2026 11:08
New research doc documenting the investigation and fix:
- PARALLEL=5 caused batch 2 to reuse JupyterLab servers from batch 1
- Kernels on reused servers never reach idle from browser perspective
- Fix: PARALLEL=9 gives each notebook a dedicated server
- 4/4 b2b runs pass on VX1 32C

Updated ci-tuning-experiments.md and vx1-kernel-flakiness.md to mark
the VX1 blocker as resolved and debunk the hardware/version hypotheses.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Exp 53: overlap pw-marimo, pw-server, pw-wasm-marimo with pw-jupyter
(staggered 2s apart). Previously serialized after pw-jupyter due to
suspected CPU contention — debunked by cross-size testing (8C/16C/32C
all pass at P=9).

Add cross-size validation results to batch-reuse-fix.md:
- VX1 8C ($175/mo): ALL PASS, pw-jupyter 47s
- VX1 16C ($350/mo): 4/4 b2b, pw-jupyter 47s
- VX1 32C ($701/mo): 4/4 b2b, pw-jupyter 47s
pw-jupyter perf identical across all sizes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 2s stagger was added when batch server reuse was the root cause.
With P=9 each notebook gets a dedicated server — stagger may no longer
be needed. Previous 0s attempts failed due to server reuse, not
contention.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reuse /opt/venvs/3.13 (Docker-built) instead of creating a fresh venv
  every run. Saves ~5s of uv venv + pip install.
- Poll all 9 JupyterLab servers in parallel (was sequential with sleep 1).
  Saves ~8s of serial polling.
- 0s stagger confirmed working with P=9 fix (2 runs, all pass).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Current best: ~1m07s total on VX1 16C (warm cache).
- 0s stagger: pw-jupyter 48s → 36s
- Warmup: 20s → 10s (reuse Docker venv + parallel polling)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
build-wheel was synchronous, blocking warmup → pw-jupyter. Now runs
async alongside test-js and storybook, with elevated priority (nice -10)
so it finishes faster. Wait for both wheel + warmup before proceeding.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add tmpfs mount at /ramdisk (10GB) in docker-compose.yml. At CI start,
rsync repo to ramdisk (excluding .git) and run all builds/tests from RAM.
Fallback to /repo if ramdisk not available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rsync not in container image. Use tar cf/xf pipe instead.
Replace all hardcoded `cd /repo` with `cd "$REPO_DIR"` so jobs run
on ramdisk when available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Docker defaults tmpfs to noexec — add exec mount option so esbuild
binary can run. Set PNPM_CONFIG_PACKAGE_IMPORT_METHOD=copy to handle
cross-filesystem hardlinks (pnpm store on named volume, repo on tmpfs).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy the 375MB pnpm store to tmpfs alongside the repo. Both on the
same filesystem means pnpm can hardlink instead of copying 751 packages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The /ramdisk/repo path breaks the toolchain: editable Python install,
anywidget static paths, and pnpm hardlinks all assume /repo. Reverting
to try host-level tmpfs mount instead (same /repo path, zero code changes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mount both repo and pnpm store from a single host tmpfs at
/opt/ci/ramdisk. Container sees /repo and /opt/pnpm-store on the
same filesystem — pnpm can hardlink, all I/O in RAM, zero path
changes in run-ci.sh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Host-level tmpfs saves 4s on wheel install but total CI time unchanged
(1m06s). Critical path is CPU-bound, not I/O-bound. iowait drops from
9.7%→8.8% mean but doesn't affect wallclock. Complexity (host mount,
pnpm store duplication, reboot fragility) not justified.

Reverted run-ci.sh and docker-compose.yml to last good disk state (2f44b86).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds --fast-fail flag that skips launching subsequent waves when a gate
job fails. Useful for fast iteration during development. Default off
so webhook/CI gets full results.

Gate points: after build-js (skip all downstream), after build-wheel
(skip wheel-dependent jobs).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…h test-js

build-wheel and test-js both start after build-js. full_build.sh was running
pnpm install unconditionally (line 30) which "Recreates" node_modules,
destroying them while test-js is reading them. Skip if node_modules exists.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Exp 55: --only=job1,job2 runs only listed jobs; --skip=job1,job2 skips them.
Dependencies not auto-resolved for simplicity. Skipped jobs log SKIP.

Also reduced CI_TIMEOUT from 240s to 180s per user request.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pkill -f 'marimo' matches args like --skip=playwright-wasm-marimo,
killing the CI script during its own cleanup phase. Use pgrep + grep -v
to exclude the current PID.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Exp 54 (fast-fail): --fast-fail flag + pnpm install race fix + ci_pkill fix
Exp 55 (--only/--skip): job filtering with self-kill protection
Exp 56 (GH CI): already passing, no action needed

Current best: 51s with --skip (4 low-value jobs), ~1m10s full run.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3/3 synth commits: all 5 Playwright tests pass, failures only in
old app code (jest-util missing) and flaky timing assertions.
CI infrastructure validated.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- --first-jobs/--first-testcases for phased CI runs (run priority jobs first)
- --only-testcases for pytest -k / PW --grep filtering
- --only/--skip renamed to --only-jobs/--skip-jobs (backward compat aliases)
- STAGGER_DELAY, DISABLE_RENICE, PYTEST_WORKERS env var overrides
- Extract run_dag() for two-phase execution
- PW_GREP support in all 5 Playwright test scripts
- maybe_renice() wrapper for Exp 60 A/B testing
- New scripts: tuning-sweep.sh, analyze-gh-failures.sh, test-renice.sh, compare-gh-hetzner.sh
- create-merge-commits.sh: --set=new with 50 deeper SHAs + skip-existing
- stress-test.sh: --set=new support
- Archive Exp 52-56, 58 to archive file

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mapping from create-merge-commits.sh --set=new, branches pushed
to origin as synth/*. Enables --set=new for stress testing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Exp 59 results: 80% GH CI pass rate on main, 5/8 failures are Release
workflow issues. Only 2 real test failures (pw-server, pw-marimo).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3+3 A/B test: pw-jupyter 35-37s with or without renice.
Failures are unrelated (flaky pytest timing, b2b pw-jupyter timeout).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Force-install pytest-xdist after uv sync so `-n 4 --dist load`
  works even on old commits that don't have it in their lockfile.
- Wipe packages/node_modules in rebuild_deps before pnpm install
  so switching between commits with different pnpm-lock.yaml files
  doesn't leave a corrupted/mixed node_modules state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stale symlinks in packages/js/node_modules/ and packages/buckaroo-js-core/node_modules/
point to old .pnpm paths after lockfile change, causing pnpm to attempt concurrent
recreation -> ENOTEMPTY race between build-wheel and test-js.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
build-js uses --store-dir /opt/pnpm-store, updating .modules.yaml storeDir.
full_build.sh's pnpm run commands have no --store-dir, so pnpm sees a store
mismatch and re-links node_modules concurrently with test-js reading it.

Exporting npm_config_store_dir makes all pnpm commands inherit the same
store, eliminating the race condition.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Old commits don't have tests/unit/server/test_mcp_uvx_install.py.
pytest exits 5 (no tests collected) which we treated as failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All 50 commits fail (expected: old code + new tests). Infrastructure
stable after 4 b2b fixes: pnpm store-dir mismatch, xdist missing,
node_modules ENOTEMPTY race, test-mcp-wheel false positive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Exp 57: P<9 always times out (120s). Stagger has zero effect on pass
rate. P=9 failures are all test-python-3.13 timing flake under B2B load.
STAGGER=0 is safe to use.

Exp 62: pytest workers=8 saves 3s but triggers timing flake. Not worth it.

Exp 64: tsgo/vitest — test-js drops from ~4s to 2s, no regressions.
Branch ready to merge on clean run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant