Skip to content

[DO NOT MERGE]: HSM Sync Engine#69

Open
dtkav wants to merge 266 commits intomainfrom
merge-hsm
Open

[DO NOT MERGE]: HSM Sync Engine#69
dtkav wants to merge 266 commits intomainfrom
merge-hsm

Conversation

@dtkav
Copy link
Member

@dtkav dtkav commented Feb 19, 2026

No description provided.

dtkav and others added 30 commits February 16, 2026 16:00
MergeHSM._path and _guid were private with no public accessors,
causing hsm.path to be undefined and falling back to the guid.

Closes BUG-009.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Document: Clean up existing HSM provider listeners before re-adding
  to prevent accumulation on repeated setup calls.
- HasProvider: Remove status listener once connected instead of relying
  on manual cleanup in destroy().
- LiveViews: Use releaseLock() instead of directly setting userLock.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Cherry-pick event subscription infrastructure from background_sync into
the YSweetProvider (CBOR-decoded messageEvent, subscribe/unsubscribe
protocol) and wire SharedFolder to forward document.updated events to
MergeManager.handleIdleRemoteUpdate() instead of the previous ydoc
update listener approach.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…sisted CRDT state

MergeManager was created without persistence callbacks, causing every
HSM to receive empty PERSISTENCE_LOADED events. This made every file
appear brand-new and triggered spurious merge conflicts on first edit.

Closes BUG-011.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… directly

RelayCanvasView.release() set canvas.userLock = false without notifying
MergeManager, so the HSM never received RELEASE_LOCK and stayed stuck
in active mode for canvas files. Added Canvas.releaseLock() matching
Document's pattern and wired it into RelayCanvasView.release().

Closes BUG-012.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
On first ACQUIRE_LOCK, if localDoc is empty (no prior CRDT data in
IndexedDB), read current disk contents via getDiskState and call
initializeLocalDoc to populate the CRDT. Also sends DISK_CHANGED so the
HSM tracks disk metadata. Without this, localDoc stayed empty despite
the editor showing file content, causing spurious merge conflicts.

Closes BUG-013.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Corrections 1-7 from specs/persistence-corrections.md:
- Fix Y.Text field name 'content' → 'contents' to match existing IndexedDB data
- Attach IndexeddbPersistence to localDoc via injectable CreatePersistence factory
- Destroy persistence in cleanupYDocs() on RELEASE_LOCK
- Gate active.entering → active.tracking on YDOCS_READY (persistence 'synced')
- Remove loadUpdates callback from MergeManager (persistence loads internally)
- Skip Document.ts IndexeddbPersistence when HSM active mode is enabled
- Retarget uploadDoc() to insert into localDoc via MergeHSM when HSM enabled

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When files are created in shared folders via uploadDoc, the LCA (Last
Common Ancestor) is now initialized to establish the baseline sync point.
This prevents merge conflicts with an empty base when editing newly
created files.

Changes:
- Add initializeLCA() method to MergeHSM for setting up sync baseline
- Call initializeLCA in both HSM active mode and standard upload paths
- Persist PERSIST_STATE effects to IndexedDB via saveMergeState

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Track pending write operations and wait for them to complete in
destroy() before closing the database. This prevents data loss when
the persistence layer is torn down while writes are still in flight.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…k error

When clicking the merge conflict banner on a local-only shared folder
(no relayId), checkStale() would call backgroundSync.downloadItem()
which throws "Unable to decode S3RN" since there's no server to
download from.

Now checkStale() checks for relayId before attempting server download.
For local-only folders, it skips the download and just compares the
local CRDT with disk contents.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The automatic diff resolution feature caused silent data loss when
merging with empty/uninitialized remote CRDTs. Rather than trying
to fix the complex merge logic, remove the feature entirely.

- Remove enableAutomaticDiffResolution flag from FeatureFlags
- Make Document.process() a no-op (called by vault.process patch)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, performIdleRemoteAutoMerge() applied pendingIdleUpdates to an
empty Y.Doc, causing data loss when the remote CRDT was empty/uninitialized.

The fix:
- Load local updates from IndexedDB using injectable loadUpdatesRaw function
- Merge local + remote updates using Y.mergeUpdates()
- Early exit if merge adds nothing new (compare state vectors)
- Only hydrate Y.Doc when needed to extract text content

Also fixed performIdleThreeWayMerge() which had the same issue.

Added awaitIdleAutoMerge() method for test synchronization since the
fix makes idle auto-merge async.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
RESOLVE_ACCEPT_DISK was causing data loss because conflictData.remote
was empty when pendingDiskContents was null. This implements the
merge-hsm-v6 spec changes:

- ACQUIRE_LOCK event now requires editorContent parameter containing
  the current editor/disk content at the moment of opening
- handleYDocsReady() always compares localDoc vs pendingEditorContent
  to determine if merge is needed
- conflictData.remote is now always populated with actual disk content

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ACQUIRE_LOCK

Two changes to ensure LCA is always established:

1. SharedFolder.downloadDoc() now awaits download completion and calls
   hsm.initializeLCA() to establish the sync point when disk, local CRDT,
   and remote CRDT are all in agreement.

2. MergeHSM.handleYDocsReady() creates LCA if null when content matches
   during ACQUIRE_LOCK, handling edge cases like corrupted persistence.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
performIdleThreeWayMerge() was made async but not assigned to
_pendingIdleAutoMerge, causing awaitIdleAutoMerge() to not wait for it.

Also exposed awaitIdleAutoMerge() on TestHSM wrapper and added the
await call to the failing test.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…entering

Two related fixes for the HSM → CM6 update flow:

1. Track editor state during active.entering: CM6_CHANGE events now update
   lastKnownEditorText even in entering state, not just tracking. This prevents
   data loss when users type while IndexedDB persistence is loading.
   handleYDocsReady now uses lastKnownEditorText (current) ?? pendingEditorContent
   (fallback) for merge decisions.

2. Use Yjs deltas instead of string diffing: mergeRemoteToLocal() now uses a
   Y.Text observer that fires for remote-originated changes. The observer
   converts event.delta directly to PositionedChange[], producing accurate
   character positions. This avoids the incorrect assumption that CM6 content
   equals beforeText.

Changes:
- handleCM6Change: Always update lastKnownEditorText, gate localDoc apply
- handleAcquireLock: Initialize lastKnownEditorText in all code paths
- handleYDocsReady: Use lastKnownEditorText ?? pendingEditorContent
- createYDocs: Call setupLocalDocObserver after persistence syncs
- cleanupYDocs: Unobserve Y.Text observer
- Add setupLocalDocObserver: Subscribe to Y.Text with origin='remote' filter
- Add deltaToPositionedChanges: Convert Yjs delta to PositionedChange[]
- mergeRemoteToLocal: Simplified - observer handles DISPATCH_CM6

Tests added for delta-based positioned changes (insert/delete).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a new file is created and immediately opened, acquireLock() would
deadlock waiting for idle state while HSM is in loading.awaitingLCA.

Fix: Send ACQUIRE_LOCK first (HSM queues it via pendingLockAcquisition),
then wait for active.tracking using new awaitActive() method.

Also adds awaitState(predicate) to HSM for general state waiting, and
refactors awaitIdle() to use it.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add new helpers that drive HSM through real state transitions instead of
bypassing the state machine:

- loadAndActivate(hsm, content): Drive to active.tracking with content
- loadToIdle(hsm, opts): Drive to idle.clean state
- loadToAwaitingLCA(hsm): Drive to loading.awaitingLCA
- createYjsUpdate(content): Create Yjs update for testing

These helpers enable gradual migration away from the forTesting() factory
which directly mutates internal state. The new approach:
- Tests validate actual transition paths
- No internal state exposure (_statePath, _lca, etc.)
- Tests serve as executable documentation of valid event sequences

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace forTesting({ initialState: 'active.tracking', localDoc }) with
  loadAndActivate(t, content)
- Replace forTesting({ initialState: 'idle.clean' }) with loadToIdle(t)
- Add t.clearEffects() after loadAndActivate when tests count effects
- Add MIGRATION.md documenting migration progress and patterns

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Migrate 9 more tests using loadAndActivate/loadToIdle with mtime option
- Tests migrated:
  - SAVE_COMPLETE updates LCA mtime
  - active mode NEVER emits WRITE_DISK
  - DISK_CHANGED with identical content stays in tracking
  - DISK_CHANGED with disk-only changes stays in tracking
  - idle.diskAhead auto-merges when remote==lca
  - idle.diverged auto-merges when no conflicts
  - SAVE_COMPLETE emits PERSIST_STATE
  - STATUS_CHANGED emitted on state transition
  - creates serializable snapshot
- Document that lock cycle tests can't be migrated (state vector mismatch)
- Update MIGRATION.md: 48 migrated, 23 remaining

Progress: 37 → 23 forTesting usages in MergeHSM.test.ts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
dtkav and others added 30 commits March 18, 2026 19:30
Digital twin replays exact recorded events from both vaults through
full provider integration. Reproduces BUG-123: conflictData.theirs
is "" because old provider syncs to stale remoteDoc on reconnect.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
B's SYNC_TO_REMOTE effects fire but ops never reach the server.
Server still has original content after B finishes editing.
B's remoteDoc is empty — SyncBridge doesn't forward localDoc changes.

Added doc-content fixtures from TP-017 run for state assertions.
Added empty state vector assertion to ProviderIntegration.handleSync.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assert server, local, and remote content at conflict detection point,
matching TP-018 step 1.5.1 / TP-017 doc-content snapshots.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MockYjsProvider now registers an update handler on remoteDoc that
forwards changes to the server doc (matching y-websocket behavior).
Also connects providers when vaults reach active.tracking.

Server now correctly receives B's edits. BUG-123 confirmed as
receiver-side: A's old provider on reconnect doesn't deliver
server's updated state to fresh remoteDoc → theirs = "".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- MockYjsProvider now forwards remoteDoc updates to server (matches
  y-websocket behavior)
- Connect providers when vaults reach active.tracking
- Seed remoteDocs with multi-client content for SyncBridge deltas
- Add doc-content state assertions comparing twin vs production snapshot
- Twin local matches production exactly
- Twin server has B's edit (correct), production didn't (unknown cause)
- Test fails at remoteText === serverText (BUG-123: old provider)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The digital twin now reproduces the complete BUG-123 state:
- server: blank line 2 (1 client) — matches production exactly
- local: LOCAL DISK EDIT — matches production exactly
- conflictData.theirs: "" — matches production exactly

Root cause modeled: B's provider transport silently drops updates
(forwardingEnabled=false), server only has enrollment client 1's
ops, A's old provider on reconnect delivers partial state.

Test fails at TP-018 step 1.5.1: server MUST have B's edit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Log when registerMachineEdit fires so next E2E run reveals if
vault.process() is causing B's SYNC_TO_REMOTE ops to be deferred
as machine edits (never reaching the server).

Hypothesis: vault.process() fires during B's editing session,
registering a machine edit whose expectedText matches B's CM6_CHANGE
docText. The SyncBridge defers the update → never forwarded to server.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`node esbuild.config.mjs staging .` now builds and exits.
`npm run staging` passes --watch to preserve existing watch behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: y-websocket's broadcastMessage silently drops updates
when wsconnected is false (WebSocket not yet open). The provider
is created with connect:false, connection is async, and B types
before the WebSocket opens. All CM6_CHANGE updates are lost.

MockYjsProvider now has wsReady flag matching y-websocket's
wsconnected — false until deferred sync completes. B's provider
connects after RELEASE_LOCK (too late), reproducing the drop.

No more forwardingEnabled hack. Twin server content matches
production exactly (blank line 2, single client).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All three layers now match production exactly at the conflict
detection checkpoint (TP-018 step 1.5.1):
  local:  LOCAL DISK EDIT ✅
  server: blank line 2, 1 client ✅
  remote: blank line 2 (same as server) ✅

Test fails at TP-018 pass criteria (server should have B's edit,
theirs should have remote content) — correct, BUG-123 is present.

Sender-side root cause reproduced via wsReady timing:
MockYjsProvider drops updates when WebSocket isn't ready,
matching y-websocket's broadcastMessage behavior.

Doc-content state captured at bannerShown (before RESOLVE),
compared against production snapshot from tp017-doc-content.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Confirms BUG-123 sender-side root cause: broadcastMessage silently
drops updates when wsconnected is false or WebSocket not OPEN.
171+ drops observed on live2 during normal operation before any
test actions. The wsReady timing theory is confirmed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
broadcastMessage silently dropped sync updates when the WebSocket
wasn't ready (wsconnected=false or readyState!=OPEN). Confirmed
6810+ drops during normal operation via instrumentation.

Buffer dropped messages in _pendingMessages array. Flush the
buffer in onopen after syncStep1 is sent. This ensures ops
that arrive during connect/disconnect cycles are not lost.

Root cause of BUG-123 sender side: enrollment and editing ops
were silently dropped because the WebSocket wasn't open yet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents stuck in idle.diverged because the global RelayMergeHSM IDB
states store accumulated orphaned entries. On doc deletion and GUID
remap, per-doc y-indexeddb databases were deleted but the global state
entry was not, leaving stale data that prevented proper cold-start
classification.

- notifyHSMDestroyed: clear _lcaCache and _localStateVectorCache
- SharedFolder.deleteFile: delete global state entry for removed doc
- SharedFolders.delete: delete folder-level index entry
- Document.handleGuidRemap: delete old GUID's global state entry
- All IDB deletes registered with awaitOnReload for reload safety

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…immediately

The machine interpreter never cleared _activeInvoke for internal
self-transitions (no reenter), leaving a stale reference that caused
hibernate() to defer indefinitely. Docs stuck in idle.diverged after a
failed three-way merge consumed warm slots forever.

Two fixes:
- Clear _activeInvoke in handleInvokeEvent before transition resolution
- Emit REQUEST_HIBERNATE when idle.diverged merge fails, so MergeManager
  hibernates the doc immediately instead of waiting for the 60s timer

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add outbound safety net to flushOutbound — after queue drain, checks if
localDoc has state that remoteDoc doesn't. If so, syncs it and emits a
DIAGNOSTIC effect with the CM6 diff of what bypassed the queue.

Convert existing inbound safety net from console.error to the same
DIAGNOSTIC effect pattern with CM6 diff output.

Both use computeDiffChanges (PositionedChange[] format) for precise
reproduction of integrity violations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`staging --watch <dir>` was setting output to `--watch` instead of
the vault path because argv[3] grabbed the flag positionally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
installGlobal() created a plain API object without registerBridge,
so SharedFolder's debugAPI?.registerBridge was always undefined and
recording bridges were never registered. HSM recording was silently
broken — startRecording returned recording: false with 0 documents.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ke queue

Expands RelayMetrics with 13 new metrics covering LiveViews refresh,
FolderNav refresh, PostOffice delivery, BackgroundSync concurrency/timing,
and wake queue slot utilization. All no-op without obsidian-metrics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
vault.process() link repair after file rename now creates a fork in idle
mode to prevent CRDT duplication when both vaults independently repair
wikilinks.

The idle path sets the fork synchronously (using LCA content) to gate
REMOTE_UPDATE via hasFork, then awaits IDB persistence sync to set a
valid OpCapture captureMark. After fork-reconcile succeeds, the edit is
registered as a pending machine edit with a 5s TTL so late-arriving
remote CRDTs carrying the same edit are detected and skipped by
idle-merge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3-bar indicator in the file explorer sidebar shows wake queue pressure:
- 1 green bar: low usage (<30%)
- green + yellow: moderate usage (30-70%)
- 3 red bars: at capacity (100%)

Hover popover shows exact slot counts and pending queue depth.
Gated behind enableResourceMeter feature flag.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Distinguishes background-woken docs (working) from registered/unloaded
docs (cached) for clearer concurrency accounting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10s interval keeps the meter in sync when hibernation frees slots
outside of layout change events.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ct resolution

beginReleaseLock now reads the definitive editor content via
EditorViewRef.getViewData() before nulling the ref. This ensures
deactivateEditor computes the correct LCA even when DISPATCH_CM6
changes reached the editor without a CM6_CHANGE echo (suppressed
by ySyncAnnotation). Also removes the WRITE_DISK workaround from
resolveConflict — Obsidian's onUnloadFile guarantees disk flush.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the unconditional initializeIfReady() call on every CM6
update() with a single targeted call from LiveView.attach() when
acquireLock completes. This avoids redundant work on every keystroke,
cursor move, and scroll for passive observers. The docChanged guard
and GUID-remap re-init paths are preserved.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rgence

Content string equality (localText === diskText) was used to skip
three-way merge on reconcile entry and in performThreeWayMergeFromState.
This allowed identical text content to mask divergent CRDT state vectors,
leaving one peer permanently one op behind with no alarm or recovery.

diff3 on identical inputs produces a clean no-op merge and goes through
the normal flushOutbound + entry flushInbound path, converging state
vectors without the shortcut.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add assertConvergence as an entry action on active.tracking so that
after both flushOutbound (from merge) and flushInbound (from entry)
have run, any remaining state vector divergence is detected and
force-synced bidirectionally. Respects localOnly and fork gates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant