docs: add ACE recovery guide for catastrophic node failures.#341
docs: add ACE recovery guide for catastrophic node failures.#341ibrarahmad wants to merge 19 commits intopgEdge:mainfrom
Conversation
Add a new Spock documentation page describing how to recover a lagging node after catastrophic failure using ACE table-diff/table-repair, including single-node and multiple-node failure scenarios and per-origin recovery. Document multi-table diff/repair workflows and require --preserve-origin to preserve origin ID and commit timestamp; add MkDocs navigation entry and cross-links from modify/index.md and FAQ.md.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds documentation for node failure recovery: a new FAQ subsection, updates to node-adding docs introducing Spockctrl and recovery notes, and a new comprehensive catastrophic node failure recovery guide detailing ACE-based workflows, commands, and troubleshooting. Changes
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Updated the documentation for recovering from catastrophic node failure - wrapped at 79 characters, and now starting to edit...
Updated the documentation for recovering from catastrophic node failures, including improved clarity and formatting changes.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 144-149: The admonition block starting with "!!! note" is being
treated as an indented code block (MD046); fix it by either adding an empty line
immediately after the "!!! note" marker and removing leading indentation from
the paragraph so the content is treated as admonition text, or wrap the block
with a local lint suppression using <!-- markdownlint-disable MD046 --> before
the "!!! note" and <!-- markdownlint-enable MD046 --> after it; target the "!!!
note" admonition and the indented paragraph when applying the change.
- Around line 78-79: Update the documented `table-diff` command usages that
currently show `--preserve-origin` to use the correct flag `--against-origin`
instead; specifically change occurrences like `table-diff --preserve-origin n1
--until <n1_failure_time>` and the similar instance later to `table-diff
--against-origin ...`. Leave the `--preserve-origin` flag only on `table-repair`
examples (e.g., the `table-repair --preserve-origin` lines) so `table-diff` uses
`--against-origin` and `table-repair` retains `--preserve-origin`.
| `table-diff --preserve-origin n1 --until <n1_failure_time>` | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and get its size
wc -l docs/recovery/catastrophic_node_failure.mdRepository: pgEdge/spock
Length of output: 102
🏁 Script executed:
# Read the file to see lines around 78-79 and 133-134
cat -n docs/recovery/catastrophic_node_failure.md | sed -n '70,90p'Repository: pgEdge/spock
Length of output: 1042
🏁 Script executed:
# Also check lines 125-145 for the second instance
cat -n docs/recovery/catastrophic_node_failure.md | sed -n '125,145p'Repository: pgEdge/spock
Length of output: 914
🏁 Script executed:
# Search for documentation of table-diff and table-repair flags throughout the file
grep -n "table-diff\|table-repair\|--against-origin\|--preserve-origin" docs/recovery/catastrophic_node_failure.mdRepository: pgEdge/spock
Length of output: 5964
Fix table-diff flags: use --against-origin, not --preserve-origin.
Lines 78–79 and 133–134 show table-diff --preserve-origin, but the correct flag for table-diff is --against-origin. The --preserve-origin flag belongs with table-repair (as shown in lines 81 and 139). Using the wrong flag will cause invalid commands.
Proposed fix
- `table-diff --preserve-origin n1 --until <n1_failure_time>`
+ `table-diff --against-origin n1 --until <n1_failure_time>`- `table-diff --preserve-origin n1 --until <n1_failure_time>`
- `table-diff --preserve-origin n4 --until <n4_failure_time>`
+ `table-diff --against-origin n1 --until <n1_failure_time>`
+ `table-diff --against-origin n4 --until <n4_failure_time>`📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| `table-diff --preserve-origin n1 --until <n1_failure_time>` | |
| `table-diff --against-origin n1 --until <n1_failure_time>` |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/recovery/catastrophic_node_failure.md` around lines 78 - 79, Update the
documented `table-diff` command usages that currently show `--preserve-origin`
to use the correct flag `--against-origin` instead; specifically change
occurrences like `table-diff --preserve-origin n1 --until <n1_failure_time>` and
the similar instance later to `table-diff --against-origin ...`. Leave the
`--preserve-origin` flag only on `table-repair` examples (e.g., the
`table-repair --preserve-origin` lines) so `table-diff` uses `--against-origin`
and `table-repair` retains `--preserve-origin`.
| !!! note | ||
|
|
||
| In the multiple-node case, you run diff and repair once per failed | ||
| origin. For each table, that means one diff (and one repair) for n1 | ||
| and one diff (and one repair) for n4. The source of truth (n3) is the | ||
| same for all repairs. |
There was a problem hiding this comment.
Address markdownlint MD046 in the admonition block.
The indented text inside the !!! note block is being interpreted as an indented code block by markdownlint (MD046). Consider adding a local lint disable/enable or adjusting lint config to allow indented admonition content.
🔧 Example lint suppress (if needed)
+<!-- markdownlint-disable MD046 -->
!!! note
In the multiple-node case, you run diff and repair once per failed
origin. For each table, that means one diff (and one repair) for n1
and one diff (and one repair) for n4. The source of truth (n3) is the
same for all repairs.
+<!-- markdownlint-enable MD046 -->🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 146-146: Code block style
Expected: fenced; Actual: indented
(MD046, code-block-style)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/recovery/catastrophic_node_failure.md` around lines 144 - 149, The
admonition block starting with "!!! note" is being treated as an indented code
block (MD046); fix it by either adding an empty line immediately after the "!!!
note" marker and removing leading indentation from the paragraph so the content
is treated as admonition text, or wrap the block with a local lint suppression
using <!-- markdownlint-disable MD046 --> before the "!!! note" and <!--
markdownlint-enable MD046 --> after it; target the "!!! note" admonition and the
indented paragraph when applying the change.
Trying fix for Mermaid chart
There was a problem hiding this comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 78-79: Replace the incorrect table-diff flag `--preserve-origin`
with `--against-origin` in the `table-diff` examples (e.g., change `table-diff
--preserve-origin n1 --until <n1_failure_time>` to `table-diff --against-origin
n1 --until <n1_failure_time>`), and update any other `table-diff` occurrences
that still use `--preserve-origin` (the flag is intended for table-repair, not
table-diff).
- Around line 144-149: The admonition under the "!!! note" is being parsed as an
indented code block; fix MD046 by inserting a blank line immediately after the
"!!! note" marker and unindenting the following paragraph text (remove the
leading spaces before "In the multiple-node case..." through the final sentence)
so the content is treated as normal admonition body text; locate the "!!! note"
block in catastrophic_node_failure.md and apply this change to the paragraph
lines.
Trying to fix formatting lines 203 to 223
There was a problem hiding this comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/recovery/catastrophic_node_failure.md`:
- Around line 78-79: The document uses the wrong flag for table-diff: replace
instances of `--preserve-origin` with `--against-origin` in the `table-diff`
commands (e.g., the examples invoking `table-diff --preserve-origin n1 --until
<n1_failure_time>`); `--preserve-origin` is for `table-repair`, so update those
`table-diff` invocations to use `--against-origin` to make the commands valid.
- Around line 144-149: The MD046 warning is caused by the indented admonition
block starting with "!!! note"; fix it by converting the indented block to a
lint-safe format: either use a fenced/admonition syntax (e.g., replace the
indented "!!! note" block with a non-indented block such as a standard
blockquote or explicit HTML comment-based admonition) or add a local linter
suppression around the block (wrap the admonition with a
markdownlint-disable/enable MD046 pair). Locate the "!!! note" admonition in
catastrophic_node_failure.md and update that block accordingly so the content is
not parsed as an indented code block.
Updated the documentation for catastrophic node failure recovery, clarifying steps and improving readability.
Updated the instructions for handling catastrophic node failures to clarify the steps for dropping subscriptions and nodes. Improved the language for better readability and consistency.
Clarified instructions for handling multiple-node failures and scripted diff steps.
Clarify instructions and considerations for using ACE in node recovery scenarios, including preserving origin ID and timestamps, and troubleshooting steps.
Updated language for clarity and consistency in the catastrophic node failure documentation.
Updated language for clarity and consistency throughout the document regarding node failure recovery procedures.
susan-pgedge
left a comment
There was a problem hiding this comment.
Passing this on to @mason-sharp for his input now...
Add a new Spock documentation page describing how to recover a lagging node after catastrophic failure using ACE table-diff/table-repair, including single-node and multiple-node failure scenarios and per-origin recovery.
Document multi-table diff/repair workflows and require --preserve-origin to preserve origin ID and commit timestamp; add MkDocs navigation entry and cross-links from modify/index.md and FAQ.md.