-
Notifications
You must be signed in to change notification settings - Fork 2
docs: partial recovery — add guide for host failure recovery and orphaned instance cleanup #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs: partial recovery — add guide for host failure recovery and orphaned instance cleanup #261
Conversation
…aned instance cleanup
📝 WalkthroughWalkthroughThis pull request introduces partial recovery functionality for handling host failures in quorum-intact clusters. It adds comprehensive recovery documentation, implements a new CleanupInstance activity to remove orphaned database instances, integrates cleanup into host removal workflows, and enhances Docker task filtering for ready nodes. Changes
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 179-183: Replace the placeholder comments with concrete psql
commands that show how to insert test data on the primary and verify it on the
replica; for example, add a psql INSERT on n1 targeting the example database and
test_table (e.g., INSERT INTO test_table (id, value) VALUES (...)) and a psql
SELECT on n2 to confirm the replicated row (e.g., SELECT * FROM test_table WHERE
value = 'recovery_test'); ensure you include host/port/user flags (-h, -p, -U)
and use the same database name (example) and table name (test_table) as shown in
the suggested improvement so readers can run the steps end-to-end.
- Around line 413-416: Replace the placeholder comments "# Insert on n1" and "#
Verify on n3" with concrete psql commands: on the primary node run a psql
command to INSERT a test row into the example database (e.g., use psql -h host-1
-p 5432 -U admin -d example -c "INSERT INTO test_table (id, value) VALUES
(...);"), and on the restored node run a psql SELECT to verify the row exists
(e.g., psql -h host-3 -p 5432 -U admin -d example -c "SELECT * FROM test_table
WHERE value = '...';"); ensure the SQL matches the test_table schema and uses a
unique value (e.g., 'full_recovery_test') so the verification step is
unambiguous.
In `@server/internal/orchestrator/swarm/postgres_service.go`:
- Around line 171-185: The timeout branch currently falls through and lets
ServiceRemove proceed even though scale-down may not have completed; update the
isTimeoutError(err) case in the same function (the scale-down/ServiceRemove
sequence) to return a descriptive error instead of continuing, e.g., return an
error indicating the scale-down timed out (wrap the original err), so callers
must handle the timeout (until the TODO RemovingHosts/context distinction is
implemented); reference isTimeoutError, the scale-down logic, and the subsequent
ServiceRemove call when making the change.
🧹 Nitpick comments (1)
server/internal/workflows/activities/cleanup_instance.go (1)
27-31:MaxAttempts: 1means no retries on transient failures.With
MaxAttempts: 1, any transient failure (network blip, etcd leader election) will fail the cleanup permanently. Consider allowing a small number of retries with backoff for resilience.Optional: Allow limited retries
options := workflow.ActivityOptions{ Queue: utils.HostQueue(a.Config.HostID), RetryOptions: workflow.RetryOptions{ - MaxAttempts: 1, + MaxAttempts: 3, }, }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 271-273: Fix the typo and endpoint mismatches: remove the extra
slash in the curl invocation used to set JOIN_TOKEN so it calls
"/v1/cluster/join-token" (the line containing JOIN_TOKEN="$(curl
http://host-1:3000//v1/cluster/join-token)") and update the summary table
entries so they exactly match the routes used earlier—use
"/v1/cluster/join-token" for fetching the join token and "/v1/cluster/join" for
joining the cluster (also correct the same occurrences noted around lines
388-390).
In `@server/internal/docker/docker.go`:
- Around line 424-449: The checkNodesUnavailable function currently treats only
NodeStateDown and NodeStateUnknown as unavailable; update the availability check
in Docker.checkNodesUnavailable to also treat swarm.NodeStateDisconnected as
unavailable (i.e., include node.Status.State == swarm.NodeStateDisconnected in
the conditional alongside NodeStateDown and NodeStateUnknown) and update the
function comment to reflect "(status 'down', 'unknown' or 'disconnected')";
ensure you reference the NodeList call and the loop over nodes in
checkNodesUnavailable when making the change.
🧹 Nitpick comments (1)
docs/disaster-recovery/partial-recovery.md (1)
131-136: Add language identifiers to output blocks (markdownlint MD040).The example output fences are missing a language. Adding
text(orconsole) will satisfy MD040.Example tweak
-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader 959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable-
+text
docker swarm join --token SWMTKN-1-xxx...xxx 10.0.0.1:2377Also applies to: 222-224
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@server/internal/docker/docker.go`:
- Around line 424-432: The current loop swallows errors from
checkNodesUnavailable by continuing, which hides Docker API failures; change the
behavior in the block that calls d.checkNodesUnavailable(ctx, stuckNodeIDs) to
return the error (or wrap it with context) instead of continue so upstream sees
the real failure; update the call site that currently checks unavailable, err :=
d.checkNodesUnavailable(...) to handle err by returning fmt.Errorf("checking
node availability: %w", err) (or similar) and keep the existing
ErrNodeUnavailable return path when unavailable is true.
🧹 Nitpick comments (1)
docs/disaster-recovery/partial-recovery.md (1)
131-136: Add language identifiers to fenced output blocks (MD040).
Usetext(or similar) to satisfy linting.📝 Example fix
-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader 959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active UnreachableAlso applies to: 222-224
server/internal/docker/docker.go
Outdated
| // If we have tasks that should be stopping, check if their nodes are unavailable | ||
| if len(stuckNodeIDs) > 0 { | ||
| unavailable, err := d.checkNodesUnavailable(ctx, stuckNodeIDs) | ||
| if err != nil { | ||
| continue | ||
| } | ||
| if unavailable { | ||
| return fmt.Errorf("%w: tasks cannot stop because their nodes are down", ErrNodeUnavailable) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t swallow node-list errors.
Continuing on checkNodesUnavailable errors hides persistent Docker API failures and turns them into a timeout instead of a root-cause error.
🔧 Suggested fix
- if err != nil {
- continue
- }
+ if err != nil {
+ return fmt.Errorf("failed to check node availability: %w", err)
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // If we have tasks that should be stopping, check if their nodes are unavailable | |
| if len(stuckNodeIDs) > 0 { | |
| unavailable, err := d.checkNodesUnavailable(ctx, stuckNodeIDs) | |
| if err != nil { | |
| continue | |
| } | |
| if unavailable { | |
| return fmt.Errorf("%w: tasks cannot stop because their nodes are down", ErrNodeUnavailable) | |
| } | |
| // If we have tasks that should be stopping, check if their nodes are unavailable | |
| if len(stuckNodeIDs) > 0 { | |
| unavailable, err := d.checkNodesUnavailable(ctx, stuckNodeIDs) | |
| if err != nil { | |
| return fmt.Errorf("failed to check node availability: %w", err) | |
| } | |
| if unavailable { | |
| return fmt.Errorf("%w: tasks cannot stop because their nodes are down", ErrNodeUnavailable) | |
| } |
🤖 Prompt for AI Agents
In `@server/internal/docker/docker.go` around lines 424 - 432, The current loop
swallows errors from checkNodesUnavailable by continuing, which hides Docker API
failures; change the behavior in the block that calls
d.checkNodesUnavailable(ctx, stuckNodeIDs) to return the error (or wrap it with
context) instead of continue so upstream sees the real failure; update the call
site that currently checks unavailable, err := d.checkNodesUnavailable(...) to
handle err by returning fmt.Errorf("checking node availability: %w", err) (or
similar) and keep the existing ErrNodeUnavailable return path when unavailable
is true.
server/internal/docker/docker.go
Outdated
| } | ||
|
|
||
| // Build a set of node IDs for tasks that should be stopping but aren't | ||
| stuckNodeIDs := make(map[string]struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pinning down how this times out. I think we can improve this change a little bit. The way this is written, we're always going to call checkNodesUnavailable, so there's little point in making it conditional.
Instead, maybe we could add a method that returns a ds.Set of all node IDs where Status.State == swarm.NodeStateReady, like:
func (d *Docker) readyNodeIDs(ctx context.Context) (ds.Set[string], error)And call it up here before we loop over the tasks. Then, we change our loop to skip tasks that aren't on a ready node:
for _, t := range tasks {
if !readyNodes.Has(t.NodeID) {
continue
}
// ...
}That way, they don't count toward our running or stopping totals. We'll still time out if we're trying to run a task on an unavailable node, which is appropriate, and we'll exit early without having to return an error if we're trying to stop a task on an unavailable node. Then, you won't need the sentinel error or the empty case in postgres_service.go.
Summary
This PR adds documentation for partial failure recovery (quorum intact) and improves node unavailability detection during force host removal operations.
Changes
disaster recovery guide for partial failure scenarios(docs/disaster-recovery/partial-recovery.md)ErrNodeUnavailablesentinel error for precise node unavailability detectionWaitForServiceto detect stuck tasks due to unavailable nodes instead of relying on generic timeoutspostgres_service.goto use the new sentinel error during service deletionCleanupInstanceactivity to clean up orphaned instance records when force-removing hostsTesting
host-3and Quorum state Intact (2 healthy hosts remained)Phase 1: Failed Host Removal (Verified)
✅ Step 1.1: Force Remove Host from Control Plane
Executed:
curl -X DELETE http://192.168.105.3:3000/v1/hosts/host-3?force=trueObserved:
Verified via:
Matches documented behavior: forced host removal + automatic database cleanup
✅ Step 1.3: Docker Swarm Cleanup
On a healthy manager (host-1):
Observed:
Confirms documented Swarm cleanup procedure
Phase 2: Reduced-Capacity Operation (Verified)
✅ Step 2.1: Host Status Verification
curl http://192.168.105.3:3000/v1/hostsObserved:
✅ Step 2.2: Database Health Verification
curl http://192.168.105.3:3000/v1/databases/storefrontObserved:
Database state:
availableInstances:
n1 on host-1
n2 on host-2
No references to host-3
✅ Step 2.3: Data Replication Verification
Executed:
Observed:
Confirms cluster remained fully operational with reduced capacity
Phase 3: Restored Host Preparation (Verified)
✅ Step 3.1: Clean Old Control Plane Data
On restored host-3:
Observed:
✅ Step 3.2: Rejoin Docker Swarm (Manager)
From host-1:
docker swarm join-token managerOn host-3:
docker swarm join --token <TOKEN> 192.168.104.1:2377Observed:
✅ Step 3.4: Redeploy Control Plane Stack
docker stack deploy -c /tmp/stack.yaml control-planeObserved:
Verified via:
Phase 4: Control Plane Cluster Rejoin (Verified)
✅ Step 4.1: Generate Join Token
curl http://192.168.105.3:3000/v1/cluster/join-tokenResponse included:
tokenserver_url✅ Step 4.2: Join Cluster from Restored Host
Observed:
✅ Step 4.3: Host Verification
curl http://192.168.105.3:3000/v1/hostsObserved:
host-3presenthealthyPhase 5: Database Capacity Restoration (Verified)
✅ Step 5.1: Add Database Node Back
Observed:
✅ Step 5.3: Final Replication Verification
Executed:
n3n2Observed:
Verification Summary:
Checklist
PLAT-313