Skip to content

docs: add ACE recovery guide for catastrophic node failures.#341

Open
ibrarahmad wants to merge 19 commits intopgEdge:mainfrom
ibrarahmad:DOC-NODE-Failure
Open

docs: add ACE recovery guide for catastrophic node failures.#341
ibrarahmad wants to merge 19 commits intopgEdge:mainfrom
ibrarahmad:DOC-NODE-Failure

Conversation

@ibrarahmad
Copy link
Contributor

Add a new Spock documentation page describing how to recover a lagging node after catastrophic failure using ACE table-diff/table-repair, including single-node and multiple-node failure scenarios and per-origin recovery.

Document multi-table diff/repair workflows and require --preserve-origin to preserve origin ID and commit timestamp; add MkDocs navigation entry and cross-links from modify/index.md and FAQ.md.

Add a new Spock documentation page describing how to recover a lagging node
after catastrophic failure using ACE table-diff/table-repair, including
single-node and multiple-node failure scenarios and per-origin recovery.

Document multi-table diff/repair workflows and require --preserve-origin to
preserve origin ID and commit timestamp; add MkDocs navigation entry and
cross-links from modify/index.md and FAQ.md.
@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds documentation for node failure recovery: a new FAQ subsection, updates to node-adding docs introducing Spockctrl and recovery notes, and a new comprehensive catastrophic node failure recovery guide detailing ACE-based workflows, commands, and troubleshooting.

Changes

Cohort / File(s) Summary
FAQ & node-adding docs
docs/FAQ.md, docs/modify/index.md
Adds a “Node failure and recovery” FAQ subsection and three bullet points to node-adding docs introducing Spockctrl, recommending it for production clusters that cannot be taken offline, and noting ACE-based resync/recovery instructions.
Catastrophic recovery guide
docs/recovery/catastrophic_node_failure.md
Adds a new, detailed recovery guide covering single- and multi-node failure scenarios, a five-phase workflow (Assess, Spock cleanup, Identify missing data, Repair, Validate), example table-diff/table-repair commands, --preserve-origin guidance, diagrams, examples, and troubleshooting notes.

Poem

🐇 I sniff the gaps where rows went stray,
I hop through origins, night and day.
With gentle paws and careful art,
I stitch each shard back to its heart.
ACE hums — the cluster finds its way. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'docs: add ACE recovery guide for catastrophic node failures' accurately reflects the main change—a new documentation file describing ACE-based recovery procedures for catastrophic node failures, which is the primary addition in this changeset.
Description check ✅ Passed The PR description is directly related to the changeset, detailing the addition of a recovery guide for catastrophic node failures using ACE, single/multiple node scenarios, per-origin recovery, and cross-links in documentation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Updated the documentation for recovering from catastrophic node failure - wrapped at 79 characters, and now starting to edit...
Updated the documentation for recovering from catastrophic node failures, including improved clarity and formatting changes.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 144-149: The admonition block starting with "!!! note" is being
treated as an indented code block (MD046); fix it by either adding an empty line
immediately after the "!!! note" marker and removing leading indentation from
the paragraph so the content is treated as admonition text, or wrap the block
with a local lint suppression using <!-- markdownlint-disable MD046 --> before
the "!!! note" and <!-- markdownlint-enable MD046 --> after it; target the "!!!
note" admonition and the indented paragraph when applying the change.
- Around line 78-79: Update the documented `table-diff` command usages that
currently show `--preserve-origin` to use the correct flag `--against-origin`
instead; specifically change occurrences like `table-diff --preserve-origin n1
--until <n1_failure_time>` and the similar instance later to `table-diff
--against-origin ...`. Leave the `--preserve-origin` flag only on `table-repair`
examples (e.g., the `table-repair --preserve-origin` lines) so `table-diff` uses
`--against-origin` and `table-repair` retains `--preserve-origin`.

Comment on lines +78 to +79
`table-diff --preserve-origin n1 --until <n1_failure_time>`

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and get its size
wc -l docs/recovery/catastrophic_node_failure.md

Repository: pgEdge/spock

Length of output: 102


🏁 Script executed:

# Read the file to see lines around 78-79 and 133-134
cat -n docs/recovery/catastrophic_node_failure.md | sed -n '70,90p'

Repository: pgEdge/spock

Length of output: 1042


🏁 Script executed:

# Also check lines 125-145 for the second instance
cat -n docs/recovery/catastrophic_node_failure.md | sed -n '125,145p'

Repository: pgEdge/spock

Length of output: 914


🏁 Script executed:

# Search for documentation of table-diff and table-repair flags throughout the file
grep -n "table-diff\|table-repair\|--against-origin\|--preserve-origin" docs/recovery/catastrophic_node_failure.md

Repository: pgEdge/spock

Length of output: 5964


Fix table-diff flags: use --against-origin, not --preserve-origin.

Lines 78–79 and 133–134 show table-diff --preserve-origin, but the correct flag for table-diff is --against-origin. The --preserve-origin flag belongs with table-repair (as shown in lines 81 and 139). Using the wrong flag will cause invalid commands.

Proposed fix
-  `table-diff --preserve-origin n1 --until <n1_failure_time>`
+  `table-diff --against-origin n1 --until <n1_failure_time>`
-  `table-diff --preserve-origin n1 --until <n1_failure_time>`
-  `table-diff --preserve-origin n4 --until <n4_failure_time>`
+  `table-diff --against-origin n1 --until <n1_failure_time>`
+  `table-diff --against-origin n4 --until <n4_failure_time>`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
`table-diff --preserve-origin n1 --until <n1_failure_time>`
`table-diff --against-origin n1 --until <n1_failure_time>`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 78 - 79, Update the
documented `table-diff` command usages that currently show `--preserve-origin`
to use the correct flag `--against-origin` instead; specifically change
occurrences like `table-diff --preserve-origin n1 --until <n1_failure_time>` and
the similar instance later to `table-diff --against-origin ...`. Leave the
`--preserve-origin` flag only on `table-repair` examples (e.g., the
`table-repair --preserve-origin` lines) so `table-diff` uses `--against-origin`
and `table-repair` retains `--preserve-origin`.

Comment on lines 144 to 149
!!! note

In the multiple-node case, you run diff and repair once per failed
origin. For each table, that means one diff (and one repair) for n1
and one diff (and one repair) for n4. The source of truth (n3) is the
same for all repairs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Address markdownlint MD046 in the admonition block.

The indented text inside the !!! note block is being interpreted as an indented code block by markdownlint (MD046). Consider adding a local lint disable/enable or adjusting lint config to allow indented admonition content.

🔧 Example lint suppress (if needed)
+<!-- markdownlint-disable MD046 -->
 !!! note
     In the multiple-node case, you run diff and repair once per failed
     origin. For each table, that means one diff (and one repair) for n1
     and one diff (and one repair) for n4. The source of truth (n3) is the
     same for all repairs.
+<!-- markdownlint-enable MD046 -->
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 146-146: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/recovery/catastrophic_node_failure.md` around lines 144 - 149, The
admonition block starting with "!!! note" is being treated as an indented code
block (MD046); fix it by either adding an empty line immediately after the "!!!
note" marker and removing leading indentation from the paragraph so the content
is treated as admonition text, or wrap the block with a local lint suppression
using <!-- markdownlint-disable MD046 --> before the "!!! note" and <!--
markdownlint-enable MD046 --> after it; target the "!!! note" admonition and the
indented paragraph when applying the change.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 78-79: Replace the incorrect table-diff flag `--preserve-origin`
with `--against-origin` in the `table-diff` examples (e.g., change `table-diff
--preserve-origin n1 --until <n1_failure_time>` to `table-diff --against-origin
n1 --until <n1_failure_time>`), and update any other `table-diff` occurrences
that still use `--preserve-origin` (the flag is intended for table-repair, not
table-diff).
- Around line 144-149: The admonition under the "!!! note" is being parsed as an
indented code block; fix MD046 by inserting a blank line immediately after the
"!!! note" marker and unindenting the following paragraph text (remove the
leading spaces before "In the multiple-node case..." through the final sentence)
so the content is treated as normal admonition body text; locate the "!!! note"
block in catastrophic_node_failure.md and apply this change to the paragraph
lines.

Trying to fix formatting lines 203 to 223
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 78-79: The document uses the wrong flag for table-diff: replace
instances of `--preserve-origin` with `--against-origin` in the `table-diff`
commands (e.g., the examples invoking `table-diff --preserve-origin n1 --until
<n1_failure_time>`); `--preserve-origin` is for `table-repair`, so update those
`table-diff` invocations to use `--against-origin` to make the commands valid.
- Around line 144-149: The MD046 warning is caused by the indented admonition
block starting with "!!! note"; fix it by converting the indented block to a
lint-safe format: either use a fenced/admonition syntax (e.g., replace the
indented "!!! note" block with a non-indented block such as a standard
blockquote or explicit HTML comment-based admonition) or add a local linter
suppression around the block (wrap the admonition with a
markdownlint-disable/enable MD046 pair). Locate the "!!! note" admonition in
catastrophic_node_failure.md and update that block accordingly so the content is
not parsed as an indented code block.

Updated the documentation for catastrophic node failure recovery, clarifying steps and improving readability.
Updated the instructions for handling catastrophic node failures to clarify the steps for dropping subscriptions and nodes. Improved the language for better readability and consistency.
Clarified instructions for handling multiple-node failures and scripted diff steps.
Clarify instructions and considerations for using ACE in node recovery scenarios, including preserving origin ID and timestamps, and troubleshooting steps.
Updated language for clarity and consistency in the catastrophic node failure documentation.
Updated language for clarity and consistency throughout the document regarding node failure recovery procedures.
Copy link
Member

@susan-pgedge susan-pgedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing this on to @mason-sharp for his input now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants