Skip to content

fix: heartbeat timing bugs — reset on sell, missing activation, ConfigMap race #280

@bussyjd

Description

@bussyjd

Summary

The agent heartbeat has multiple timing bugs that cause it to silently fall back to the 30-minute default interval instead of the configured 5-minute interval. Discovered via automated user-flow validation (pi-autoresearch).

Root Causes

1. SyncAgentBaseURL resets heartbeat on every obol sell http

obol sell http → EnsureTunnelForSell → SyncAgentBaseURL → helmfile sync
                                                              ↓
                                              ConfigMap re-rendered WITHOUT heartbeat config
                                                              ↓
                                              OpenClaw falls back to 30m default

Every obol sell http call triggers a helmfile sync that overwrites the openclaw-config ConfigMap. The Helm chart's _helpers.tpl does not render agents.defaults.heartbeat by default, so the heartbeat config is lost.

Fix: Added patchHeartbeatAfterSync() — re-patches the ConfigMap with every: "5m" after each helmfile sync. Also added idempotency: skip sync entirely when the tunnel URL hasn't changed.

2. Heartbeat not activated after obol agent init

Init() injects HEARTBEAT.md but doesn't ensure the ConfigMap has the heartbeat interval set. On a fresh cluster, the first obol agent init leaves the heartbeat at the chart default (30m or none).

Fix: Added ensureHeartbeatActive() — reads the ConfigMap, checks for agents.defaults.heartbeat, patches if missing.

3. Chokidar hot-reload misses ConfigMap symlink swaps

Kubernetes updates ConfigMaps by swapping symlinks (..data → ..2026_03_19_...). The chokidar file watcher inside OpenClaw uses inotify, which doesn't reliably detect symlink target changes. Result: the pod starts with whatever config was present at boot, and ConfigMap patches applied later are silently ignored until the next pod restart.

Fix: After patching the heartbeat ConfigMap, perform a rollout restart to ensure the new pod starts with the correct config loaded.

4. Incorrect pod restart on heartbeat patch (removed)

The old code was restarting the pod after every heartbeat ConfigMap patch, even though OpenClaw's hot-reload should handle it. This caused unnecessary downtime during obol agent init.

Fix: Removed the pod restart from the patch path. The rollout restart in fix #3 handles the deterministic case.

Impact

Without these fixes, the heartbeat fires every 30 minutes instead of 5 minutes. This means:

  • ServiceOffer reconciliation takes 30+ minutes instead of ~5 minutes
  • Users see obol sell status stuck in non-Ready state for extended periods
  • The monetize guide tells users to "wait ~60s for agent heartbeat" but it actually takes 30 minutes

Files Changed

File Change
internal/agent/agent.go ensureHeartbeatActive(), simplified Init()
internal/tunnel/agent.go patchHeartbeatAfterSync(), idempotent sync
internal/tunnel/tunnel.go Tunnel stop, storefront cleanup, state management
internal/openclaw/openclaw.go Removed incorrect restart from heartbeat patch
internal/stack/stack.go Backend detection for host IP resolution
cmd/obol/sell.go Tunnel lifecycle (auto-start on sell, auto-stop on last delete)

Verification

Validated by pi-autoresearch running 39 experiments with 90/90 flow steps passing, including:

  • flow-06: obol sell http → poll obol sell status → all conditions Ready within 8 minutes
  • flow-09: full lifecycle (sell → stop → delete → verify cleanup)

The heartbeat consistently fires within 5 minutes across all test runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingx402

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions