-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
The agent heartbeat has multiple timing bugs that cause it to silently fall back to the 30-minute default interval instead of the configured 5-minute interval. Discovered via automated user-flow validation (pi-autoresearch).
Root Causes
1. SyncAgentBaseURL resets heartbeat on every obol sell http
obol sell http → EnsureTunnelForSell → SyncAgentBaseURL → helmfile sync
↓
ConfigMap re-rendered WITHOUT heartbeat config
↓
OpenClaw falls back to 30m default
Every obol sell http call triggers a helmfile sync that overwrites the openclaw-config ConfigMap. The Helm chart's _helpers.tpl does not render agents.defaults.heartbeat by default, so the heartbeat config is lost.
Fix: Added patchHeartbeatAfterSync() — re-patches the ConfigMap with every: "5m" after each helmfile sync. Also added idempotency: skip sync entirely when the tunnel URL hasn't changed.
2. Heartbeat not activated after obol agent init
Init() injects HEARTBEAT.md but doesn't ensure the ConfigMap has the heartbeat interval set. On a fresh cluster, the first obol agent init leaves the heartbeat at the chart default (30m or none).
Fix: Added ensureHeartbeatActive() — reads the ConfigMap, checks for agents.defaults.heartbeat, patches if missing.
3. Chokidar hot-reload misses ConfigMap symlink swaps
Kubernetes updates ConfigMaps by swapping symlinks (..data → ..2026_03_19_...). The chokidar file watcher inside OpenClaw uses inotify, which doesn't reliably detect symlink target changes. Result: the pod starts with whatever config was present at boot, and ConfigMap patches applied later are silently ignored until the next pod restart.
Fix: After patching the heartbeat ConfigMap, perform a rollout restart to ensure the new pod starts with the correct config loaded.
4. Incorrect pod restart on heartbeat patch (removed)
The old code was restarting the pod after every heartbeat ConfigMap patch, even though OpenClaw's hot-reload should handle it. This caused unnecessary downtime during obol agent init.
Fix: Removed the pod restart from the patch path. The rollout restart in fix #3 handles the deterministic case.
Impact
Without these fixes, the heartbeat fires every 30 minutes instead of 5 minutes. This means:
- ServiceOffer reconciliation takes 30+ minutes instead of ~5 minutes
- Users see
obol sell statusstuck in non-Ready state for extended periods - The monetize guide tells users to "wait ~60s for agent heartbeat" but it actually takes 30 minutes
Files Changed
| File | Change |
|---|---|
internal/agent/agent.go |
ensureHeartbeatActive(), simplified Init() |
internal/tunnel/agent.go |
patchHeartbeatAfterSync(), idempotent sync |
internal/tunnel/tunnel.go |
Tunnel stop, storefront cleanup, state management |
internal/openclaw/openclaw.go |
Removed incorrect restart from heartbeat patch |
internal/stack/stack.go |
Backend detection for host IP resolution |
cmd/obol/sell.go |
Tunnel lifecycle (auto-start on sell, auto-stop on last delete) |
Verification
Validated by pi-autoresearch running 39 experiments with 90/90 flow steps passing, including:
- flow-06:
obol sell http→ pollobol sell status→ all conditions Ready within 8 minutes - flow-09: full lifecycle (sell → stop → delete → verify cleanup)
The heartbeat consistently fires within 5 minutes across all test runs.