Skip to content

feat: non-blocking concurrent checkpoint (WAL rotation + MVCC snapshot)#332

Open
loganpowell wants to merge 16 commits intoLadybugDB:mainfrom
loganpowell:feat/concurrent-writes
Open

feat: non-blocking concurrent checkpoint (WAL rotation + MVCC snapshot)#332
loganpowell wants to merge 16 commits intoLadybugDB:mainfrom
loganpowell:feat/concurrent-writes

Conversation

@loganpowell
Copy link
Copy Markdown

@loganpowell loganpowell commented Mar 26, 2026

Summary

This PR ports Vela-Engineering/kuzu's concurrent multi-writer checkpoint implementation
into ladybug. The implementation was developed by Aikins Laryea (Vela Partners) across
4 phases plus a read-only lock fix; the full diff is 2,360 lines across 45 files.

Before: checkpoint drains all transactions (reads + writes) and blocks new writes
until the entire checkpoint completes.

After: checkpoint only drains write transactions, releases the write gate as soon
as the WAL is rotated, and reads continue concurrently via MVCC snapshot isolation.
Shadow pages are applied with per-page CAS locking so optimistic readers detect updates.

What changed

Core mechanism

  • WAL rotation: at checkpoint start, the active .wal is atomically renamed to
    .wal.checkpoint (frozen WAL). New write transactions immediately create a fresh
    active WAL, isolated from the checkpoint.
  • Write gate release: once WAL is rotated the gate is released; storage
    materialization runs concurrently with new writers via per-node-group locking.
  • MVCC catalog snapshots: Catalog::serializeSnapshot(snapshotTS) and
    CatalogSet::serializeSnapshot(snapshotTxn) use a pinned snapshot timestamp for
    consistent serialization after the gate is released.
  • Epoch-based change tracking: Table::hasChanges (bool) replaced by
    changeEpoch (atomic uint64). captureChangeEpochs() snapshots per-table
    watermarks under the write gate; Table::checkpoint() uses watermarks to skip
    tables that haven't changed since the last gate.

Thread safety

  • StorageManager::mtx upgraded from exclusive std::mutex to
    mutable std::shared_mutex (reads take shared lock, DDL takes unique lock).
  • NodeTable::schemaMtx protects the columns vector during concurrent
    checkpoint vacuum.
  • PageManager::version and Catalog::version use std::atomic<uint64_t>
    with a stable lastCheckpointVersion baseline.
  • ShadowFile::applyShadowPages now calls BufferManager::updateFrameIfPageIsInFrame
    (CAS-loop locked variant) instead of the lock-free variant.

WAL recovery

  • WALReplayer::replay now handles both frozen (.wal.checkpoint) and active
    (.wal) WAL files independently via replayFrozenWAL / replayActiveWAL.

Read-only databases

  • File lock is skipped when readOnly=true, enabling concurrent read-only opens.

Apple clang compatibility

  • Explicit constructors added to SegmentCheckpointState and LazySegmentData.

Testing

  • Updated test/transaction/checkpoint_test.cpp to use the new
    logCheckpointAndApplyShadowPages(bool walRotated) signature, WAL rotation-aware
    flaky checkpointer classes, and ShadowFileDatabaseIDMismatchExistingDB now
    validates the frozen WAL (.wal.checkpoint) rather than the active WAL.
  • CI workflow at embed-graph-db/.github/workflows/ladybug-concurrent-writes-ci.yml
    covers build (ubuntu + macos), full ctest suite, checkpoint/concurrent-write test
    filter, and benchmark comparison (blocks on >5% single-writer regression or
    zero read-during-checkpoint throughput).

Checkout Latest Benchmark Run

Attribution

Concurrent checkpoint implementation: Aikins Laryea, Vela Partners
(Vela-Engineering/kuzu, commits 3ad9071..d298e9c)

Ported and adapted for ladybug (lbug namespace, mainStorageManager field,
2-arg applyShadowPages API, std::format, graph catalog support) by Logan Powell.

@loganpowell loganpowell force-pushed the feat/concurrent-writes branch 6 times, most recently from a531858 to 49d4a85 Compare March 27, 2026 12:55
@adsharma
Copy link
Copy Markdown
Contributor

@loganpowell we use a clang-format inside a docker container on CI. The ubuntu version has weird patches applied on top of upstream sources. Please use scripts/run-clang-format-docker.sh.

@adsharma
Copy link
Copy Markdown
Contributor

clang-format-18 on ubuntu is not the same as the clang-format-18 in debian, alpine or homebrew. There are patches in the version that's shipped with ubuntu that annoys people because of this.

#207

Ports Vela-Engineering/kuzu concurrent multi-writer support into ladybug.
Based on vela-kuzu commits 3ad9071..d298e9c relative to upstream
kuzudb/kuzu fork point 89f0263cc.

Changes
-------
WAL rotation
  checkpoint renames active WAL to .wal.checkpoint (frozen), allowing new
  write transactions to create a fresh WAL immediately

Non-blocking checkpoint
  only drains write transactions; reads continue via MVCC snapshot isolation
  (stopNewWriteTransactionsAndWaitUntilAllWriteTransactionsLeave replaces
  stopNewTransactionsAndWaitUntilAllTransactionsLeave)

MVCC catalog snapshots
  serializeSnapshot() uses a pinned snapshotTS for consistent catalog
  serialization after the write gate is released

Concurrent storage checkpoint
  epoch-based change tracking (changeEpoch replaces hasChanges bool) with
  per-table watermarks captured at drain time

Thread-safe shadow page application
  updateFrameIfPageIsInFrame with CAS loop for concurrent optimistic readers

Read-only opens
  skip file lock (LOCK_EX) when readOnly=true, allowing --readonly processes
  to open the DB concurrently with an active writer/checkpoint

WAL replayer
  handles both frozen (.wal.checkpoint) and active (.wal)

Atomic version tracking
  PageManager and Catalog use atomic<uint64_t> with lastCheckpointVersion

NodeTable schemaMtx
  shared_mutex protecting columns vector during concurrent checkpoint

StorageManager
  shared_mutex (was exclusive mutex) for getTable/serialize

Apple clang compat
  constructors for SegmentCheckpointState and LazySegmentData

Attribution: Vela-Engineering/kuzu (Aikins Laryea, Vela Partners)
Adapted for ladybug (lbug namespace, mainStorageManager field,
2-arg applyShadowPages API, std::format, graph catalog support)
- run_benchmarks.py: 4-metric harness (single-writer throughput,
  checkpoint latency, concurrent write throughput via gtest, and
  reads-during-checkpoint via --readonly subprocess)
- compare_benchmarks.py: CI gate -- fails if single_writer regresses >5%
  or read_during_checkpoint_ops_per_sec < 0.5 (checkpoint still blocking)
- report.py: history table from results/ dir; emoji thresholds aligned
  with CI gate (5% for red, not 2%)
- baseline.json: captured on main (201a727ee) with auto_checkpoint=false
  in concurrent tests for a fair comparison
- results/: fresh benchmark run data for main and feat/concurrent-writes
- .github/workflows/concurrent-writes-ci.yml: build + test + benchmark
  jobs; posts PR comment with comparison table

Fix concurrent-write benchmark accuracy:
  The 4 concurrent gtest tests were missing CALL auto_checkpoint=false,
  causing auto-checkpoint to fire repeatedly during the concurrent phase.
  On feat, each checkpoint drain holds mtxForStartingNewTransactions,
  serialising all writer threads around every checkpoint event -- making
  feat appear ~2x slower than main. With auto_checkpoint=false added
  (plus an explicit CHECKPOINT after setup and after the concurrent
  section), throughput is within ~2% between main and feat (noise level).
@loganpowell loganpowell force-pushed the feat/concurrent-writes branch from c6e842d to 4363810 Compare March 28, 2026 21:22
@loganpowell
Copy link
Copy Markdown
Author

@adsharma, I rebased and made the clang change as per your suggestion.

…e checkpoint timeout tests

- Add Catalog::getVersionSinceCheckpoint() returning version - lastCheckpointVersion
  so CALL catalog_version() shows 0 immediately after CHECKPOINT (fixes CallCatalogVersion)
- Update CheckpointTimeoutErrorTest and AutoCheckpointTimeoutErrorTest to expect
  success (---- ok) instead of a timeout error: non-blocking checkpoint only waits
  on write transactions, so an open read-only transaction no longer blocks it
@loganpowell loganpowell force-pushed the feat/concurrent-writes branch from b4f87b5 to 7e16a0a Compare March 29, 2026 00:27
@adsharma
Copy link
Copy Markdown
Contributor

Overall: The "do not block readers while checkpointing" part is more interesting than concurrent write throughput. I don't think the ladybugdb user base cares about concurrent write throughput.

Historically, this project is very closely related to DuckDB, so I'd be careful about deviating substantially from DuckDB's concurrency model. DuckDB supports single writer, but allows multiple-readers during checkpointing.

https://duckdb.org/docs/stable/connect/concurrency

  • The CI doesn't need clang-format (already handled in ci-workflow.yml) and this doesn't need to run on every PR.
  • Running it as a part of nightly scheduled CI is a good idea

We should also do some antithesis based runs to flush out bugs arising out of this PR or pre-existing issues. Let me know if it's something you're excited about taking on after this is merged.

@loganpowell
Copy link
Copy Markdown
Author

@adsharma the concurrency model is what allows reads during checkpointing. The Vela use-case for concurrency is mult-agent architectures wherein agents can read/write from/to the same graph safely, without any app-level locking. I really liked kuzu and am just happy folks are coming out to keep it alive (🙏). If you want to merge this in and have some specific tests you'd like me to write, I'd be happy to. If you do merge it, I can reach out to the Vela team to see if we can roll them in instead of maintaining divergent forks. In the meantime, I can pull out the clang-formatting.

…d PR comment step

- Workflow now triggers on schedule (02:00 UTC) and workflow_dispatch only
- clang-format is handled by ci-workflow.yml + local pre-commit hook
- Removed PR comment step (no PR context in scheduled runs)
@adsharma
Copy link
Copy Markdown
Contributor

multi-agent architectures wherein agents can read/write from/to the same graph safely, without any app-level locking.

This is an interesting use case, but the question is can LadybugDB offer competitive performance vs OLTP tabular databases?

The way I'm imagining data flow is:

{agents} -> pgembed -> duckdb -> ladybugdb -> icebug

The other three are open source projects hosted here: https://github.com/Ladybug-Memory

Also, you had the concurrent write throughput before vs after as a comment on this page. Am I reading this correctly?

baseline:

 1/3 single_writer_txn_per_sec ...
       → 609.3 txn/s
  2/3 checkpoint_latency_ms ...
       → 31 ms
       ...

after:

  1/3 single_writer_txn_per_sec ...
       → 611.4 txn/s
  2/3 checkpoint_latency_ms ...
       → 32 ms
  3/3 concurrent write gtests (4 tests × 4 threads × 1000 inserts) ...
       EmptyDBTransactionTest.ConcurrentNodeInsertions: 1.8s
       EmptyDBTransactionTest.ConcurrentNodeUpdates: 12.6s
       EmptyDBTransactionTest.ConcurrentRelationshipInsertions: 6.0s
       EmptyDBTransactionTest.ConcurrentRelationshipUpdates: 15.3s
       → 447.6 txn/s aggregate
  4/4 read_during_checkpoint_ops_per_sec ...
       → 26.80 ops/s  (0 = blocked/locked, >0 = non-blocking)

So the aggregate throughput is actually lower than single threaded write throughput? With a postgres storage engine in beta right now, I've measured 100k writes/sec in my past life. The only problem is it's stuck in beta for at least 3 years right now.

@adsharma
Copy link
Copy Markdown
Contributor

Also I misread the duckdb docs around "single writer" model. Here's the relevant paragraph:

DuckDB supports concurrency within a single process according to the following rules. As long as there are no write conflicts, multiple concurrent writes will succeed. Appends will never conflict, even on the same table. Multiple threads can also simultaneously update separate tables or separate subsets of the same table. Optimistic concurrency control comes into play when two threads attempt to edit (update or delete) the same row at the same time. In that situation, the second thread to attempt the edit will fail with a conflict error.

Good news: this PR doesn't break the duckdb concurrency model. I have no objections to it at a conceptual level.

I have some disagreements with the duckdb folks about VFS level concurrency. I think they're stricter than they need to be to the point where a WAL replication setup is impossible on all platforms. But that's a separate topic.

For this particular PR, we just need to work through the 3 review comments and a post-merge MVCC testing plan to flush out any bugs.

@loganpowell
Copy link
Copy Markdown
Author

Sorry, I don't actually see any in-line reviews/comments. Can you reiterate them?
WRT the benchmarks, yes, the concurrency feature seems to actually speed up single-writer TPS. You can see it more clearly here (sorry for the messy/duplicate job summaries). It also slightly increases checkpoint time.

@adsharma
Copy link
Copy Markdown
Contributor

adsharma commented Mar 30, 2026

I think an earlier version of your benchmarking workflow added comments to this PR with output that looked like what you pointed to. But once you force pushed, they disappeared and more recent runs are failing because they don't have permissions to add comments. So I looked into the runs manually and extracted the relevant data.

Benchmarking on github runners is a bit tricky because of the amount of noise and all the problems github is going through.

So I ran some benchmark like this

wget -o bench.py https://gist.github.com/adsharma/36dd6ae67390ec0d612e8564082ad37a
uv run bench.py

baseline.txt

Generating 1,000 rows...

--- Cypher CREATE (vanilla insert) --- 3 iters ---
    Run 1: 7.546s
    Run 2: 7.533s
    Run 3: 7.547s
  Avg Time : 7.542s
  Rows     : 1,000
  Rate     : 133 rows/s

--- Cypher MERGE (no conflicts) --- 3 iters ---
    Run 1: 7.643s
    Run 2: 7.697s
    Run 3: 7.640s
  Avg Time : 7.660s
  Rows     : 1,000
  Rate     : 131 rows/s

--- Cypher MERGE (50% conflicts) --- 3 iters ---
    Run 1: 7.629s
    Run 2: 7.606s
    Run 3: 7.617s
  Avg Time : 7.617s
  Rows     : 1,000
  Rate     : 131 rows/s

branch.txt:

Generating 1,000 rows...

--- Cypher CREATE (vanilla insert) --- 3 iters ---
    Run 1: 7.505s
    Run 2: 7.516s
    Run 3: 7.537s
  Avg Time : 7.519s
  Rows     : 1,000
  Rate     : 133 rows/s

--- Cypher MERGE (no conflicts) --- 3 iters ---
    Run 1: 7.637s
    Run 2: 7.683s
    Run 3: 7.614s
  Avg Time : 7.645s
  Rows     : 1,000
  Rate     : 131 rows/s

--- Cypher MERGE (50% conflicts) --- 3 iters ---
    Run 1: 7.672s
    Run 2: 7.607s
    Run 3: 7.580s
  Avg Time : 7.620s
  Rows     : 1,000
  Rate     : 131 rows/s

I can't detect any benefit on single threaded writes. There are no regressions either. Which is good news.

Like your link says:

Informational only — MVCC write-pipeline overhead; ~2x slower than main is expected trade-off

The cost of enabling concurrent writes is that your throughput drops. In that particular benchmark run by 2x. The main benefit is that multiple writers are able to write at the same time and reads are not blocked. If you want optimal throughput, you're better off using a mutex in the application.

auto checkpointer = initCheckpointerFunc(clientContext);
try {
checkpointer->writeCheckpoint();
checkpointer->beginCheckpoint(lastTimestamp);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastTimestamp is written under a mutex, but read without a mutex (since this is checkpointNoLock()). This is UB under C++ memory rules. How is safety guaranteed here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in 71cb2a7. lastTimestamp is a plain common::transaction_t (non-atomic). Every writer increments it inside mtxForSerializingPublicFunctionCalls (see commit()), but the old code read it bare in checkpointNoLock(). Even though the acquire/release pair on activeWriteTransactionCount gives ordering for the count itself, concurrent access to a non-atomic variable is still UB under the C++ memory model.

We now snapshot it under a brief lock before calling beginCheckpoint():

transaction_t snapshotTimestamp;
{
    std::unique_lock lck{mtxForSerializingPublicFunctionCalls};
    snapshotTimestamp = lastTimestamp;
}
checkpointer->beginCheckpoint(snapshotTimestamp);

See the fix →

lock, state.columnIDs, columnPtrs);
return insertChunkedGroup->flush(&DUMMY_CHECKPOINT_TRANSACTION, state.pageAllocator);
lock, state.columnIDs, columnPtrs, txn);
return insertChunkedGroup->flush(const_cast<Transaction*>(txn), state.pageAllocator);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this const_cast<> safe?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was not safe — and both casts are now removed (71cb2a7, compile fix in 4d02d95).

The root cause: InMemChunkedNodeGroup::flush() and ChunkedNodeGroup::scanCommitted() took non-const Transaction* even though neither function mutates through the pointer. NodeGroupCheckpointState::transaction is const Transaction*, so both call sites needed a cast to paper over the mismatch.

Fix: propagated const Transaction* through the full class hierarchy:

-STATEMENT [conn2] CHECKPOINT;
---- error
Timeout waiting for active transactions to leave the system before checkpointing. If you have an open transaction, please close it and try again.
---- ok
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the WAL related code handle MVCC checkpoints, we should still review secondary data structures:

  • Hash index: index.checkpoint(context, pageAllocator) in NodeTable::checkpoint is still called without a snapshotTxn — it uses whatever live index state is present during materialization.
  • Stats: tableEntry->vacuumColumnIDs(0) modifies the catalog entry's column ID set while a read transaction may be using those columns.
  • serializeCatalogAndMetadata: The catalog snapshot path (serializeCatalogSnapshot) is gated on snapshotTS > 0, which is only set in the new beginCheckpoint(lastTimestamp) path. But the WAL rotation path calls beginCheckpoint then releases the write gate early — new catalog mutations (DDL)
    could race with serialization in finishCheckpoint.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vacuumColumnIDs race — Fixed (71cb2a7). tableEntry->vacuumColumnIDs(0) is now called under a std::unique_lock on schemaMtx. Concurrent readers that iterate the column set also acquire schemaMtx, so they will block until the mutation completes:

See the fix →

DDL race in finishCheckpoint — Safe by timestamp exclusion. snapshotTS is captured under the write gate in beginCheckpoint() before the gate is released. finishCheckpoint() calls serializeCatalogSnapshot() (selected when snapshotTS > 0), which calls catalog.serializeSnapshot(serializer, snapshotTS) — only entries with commit timestamp <= snapshotTS are included. Any DDL that commits after the gate releases gets a strictly greater timestamp and is invisible to the serialized snapshot. We added a comment in finishCheckpoint() making this invariant explicit:

See the comment →

Hash index without snapshotTxn — Acknowledged as a pre-existing limitation. HashIndexLocalStorage has no per-entry timestamps, so there is no mechanism to bound the index to snapshotTS without a deeper refactor of the hash-index infrastructure. I added a TODO comment at the call site in NodeTable::checkpoint() and it is tracked as a follow-up; it is not introduced by this PR:

See the TODO →

loganpowell added a commit to loganpowell/ladybug that referenced this pull request Mar 30, 2026
…amp race, const_cast, vacuumColumnIDs, DDL-race comment)
@loganpowell
Copy link
Copy Markdown
Author

Give me some time to address your comments

…amp race, const_cast, vacuumColumnIDs, DDL-race comment)
- Wrap pre-reload CHECKPOINT result in inner scope to prevent
  use-after-free when createDBAndConn() destroys the database
- Explicitly reset conn2/r2 before reloading DB in drain test
- Update CI workflow with correct test filter names
…ationale

The write gate (mtxForStartingNewTransactions) is correctly released after
WAL rotation. Holding it through the full checkpointStoragePhase would cause
a deadlock in tests that restart the database mid-checkpoint (e.g.
ShadowFileDatabaseIDMismatch*), since Database::~Database() calls
transactionManager->checkpoint() which would block indefinitely.

The hash-index/node-data snapshot divergence is a pre-existing limitation
of HashIndexLocalStorage having no per-entry timestamps. The correct fix
requires adding timestamp-aware snapshotting to HashIndexLocalStorage and
is tracked as a follow-up; document it in transaction_manager.cpp.

Also:
- Add rationale comment to postCheckpointCleanup explaining why no
  try/catch is intentional (data already durable at finishCheckpoint)
- Rename HashIndexConsistentAfterCheckpoint test to
  HashIndexBasicRecoveryAfterCheckpoint to reflect what it actually
  tests (baseline recovery, not the ghost-key invariant)
Add git submodule update --init --depth 1 dataset to the test job so
the full e2e_test_transaction suite can run (copy/recovery tests need CSV
files from dataset/tinysnb/). Locally verified: 515/530 pass, 0 fail
(15 skipped are LDBC large-dataset tests disabled in CMake).

Also bump the Transaction tests timeout from 120s to 300s to accommodate
the concurrent transaction suite on slower CI runners.
- compare_benchmarks.py already appends to $GITHUB_STEP_SUMMARY;
  remove the manual 'cat comparison.md >> $GITHUB_STEP_SUMMARY' step
  and the second bare compare call (which was adding it a 3rd time).
  Use continue-on-error + outcome check instead.
- Both matrix legs (ubuntu + macos) wrote test tables to the run
  summary; redirect $GITHUB_STEP_SUMMARY to /dev/null for non-ubuntu
  so only one table appears.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants