feat: non-blocking concurrent checkpoint (WAL rotation + MVCC snapshot) by loganpowell · Pull Request #332 · LadybugDB/ladybug

loganpowell · 2026-03-26T19:05:27Z

Summary

This PR ports Vela-Engineering/kuzu's concurrent multi-writer checkpoint implementation
into ladybug. The implementation was developed by Aikins Laryea (Vela Partners) across
4 phases plus a read-only lock fix; the full diff is 2,360 lines across 45 files.

Before: checkpoint drains all transactions (reads + writes) and blocks new writes
until the entire checkpoint completes.

After: checkpoint only drains write transactions, releases the write gate as soon
as the WAL is rotated, and reads continue concurrently via MVCC snapshot isolation.
Shadow pages are applied with per-page CAS locking so optimistic readers detect updates.

What changed

Core mechanism

WAL rotation: at checkpoint start, the active .wal is atomically renamed to
.wal.checkpoint (frozen WAL). New write transactions immediately create a fresh
active WAL, isolated from the checkpoint.
Write gate release: once WAL is rotated the gate is released; storage
materialization runs concurrently with new writers via per-node-group locking.
MVCC catalog snapshots: Catalog::serializeSnapshot(snapshotTS) and
CatalogSet::serializeSnapshot(snapshotTxn) use a pinned snapshot timestamp for
consistent serialization after the gate is released.
Epoch-based change tracking: Table::hasChanges (bool) replaced by
changeEpoch (atomic uint64). captureChangeEpochs() snapshots per-table
watermarks under the write gate; Table::checkpoint() uses watermarks to skip
tables that haven't changed since the last gate.

Thread safety

StorageManager::mtx upgraded from exclusive std::mutex to
mutable std::shared_mutex (reads take shared lock, DDL takes unique lock).
NodeTable::schemaMtx protects the columns vector during concurrent
checkpoint vacuum.
PageManager::version and Catalog::version use std::atomic<uint64_t>
with a stable lastCheckpointVersion baseline.
ShadowFile::applyShadowPages now calls BufferManager::updateFrameIfPageIsInFrame
(CAS-loop locked variant) instead of the lock-free variant.

WAL recovery

WALReplayer::replay now handles both frozen (.wal.checkpoint) and active
(.wal) WAL files independently via replayFrozenWAL / replayActiveWAL.

Read-only databases

File lock is skipped when readOnly=true, enabling concurrent read-only opens.

Apple clang compatibility

Explicit constructors added to SegmentCheckpointState and LazySegmentData.

Testing

Updated test/transaction/checkpoint_test.cpp to use the new
logCheckpointAndApplyShadowPages(bool walRotated) signature, WAL rotation-aware
flaky checkpointer classes, and ShadowFileDatabaseIDMismatchExistingDB now
validates the frozen WAL (.wal.checkpoint) rather than the active WAL.
CI workflow at embed-graph-db/.github/workflows/ladybug-concurrent-writes-ci.yml
covers build (ubuntu + macos), full ctest suite, checkpoint/concurrent-write test
filter, and benchmark comparison (blocks on >5% single-writer regression or
zero read-during-checkpoint throughput).

Checkout Latest Benchmark Run

Attribution

Concurrent checkpoint implementation: Aikins Laryea, Vela Partners
(Vela-Engineering/kuzu, commits 3ad9071..d298e9c)

Ported and adapted for ladybug (lbug namespace, mainStorageManager field,
2-arg applyShadowPages API, std::format, graph catalog support) by Logan Powell.

adsharma · 2026-03-27T19:02:37Z

@loganpowell we use a clang-format inside a docker container on CI. The ubuntu version has weird patches applied on top of upstream sources. Please use scripts/run-clang-format-docker.sh.

adsharma · 2026-03-28T19:48:34Z

clang-format-18 on ubuntu is not the same as the clang-format-18 in debian, alpine or homebrew. There are patches in the version that's shipped with ubuntu that annoys people because of this.

#207

Ports Vela-Engineering/kuzu concurrent multi-writer support into ladybug. Based on vela-kuzu commits 3ad9071..d298e9c relative to upstream kuzudb/kuzu fork point 89f0263cc. Changes ------- WAL rotation checkpoint renames active WAL to .wal.checkpoint (frozen), allowing new write transactions to create a fresh WAL immediately Non-blocking checkpoint only drains write transactions; reads continue via MVCC snapshot isolation (stopNewWriteTransactionsAndWaitUntilAllWriteTransactionsLeave replaces stopNewTransactionsAndWaitUntilAllTransactionsLeave) MVCC catalog snapshots serializeSnapshot() uses a pinned snapshotTS for consistent catalog serialization after the write gate is released Concurrent storage checkpoint epoch-based change tracking (changeEpoch replaces hasChanges bool) with per-table watermarks captured at drain time Thread-safe shadow page application updateFrameIfPageIsInFrame with CAS loop for concurrent optimistic readers Read-only opens skip file lock (LOCK_EX) when readOnly=true, allowing --readonly processes to open the DB concurrently with an active writer/checkpoint WAL replayer handles both frozen (.wal.checkpoint) and active (.wal) Atomic version tracking PageManager and Catalog use atomic<uint64_t> with lastCheckpointVersion NodeTable schemaMtx shared_mutex protecting columns vector during concurrent checkpoint StorageManager shared_mutex (was exclusive mutex) for getTable/serialize Apple clang compat constructors for SegmentCheckpointState and LazySegmentData Attribution: Vela-Engineering/kuzu (Aikins Laryea, Vela Partners) Adapted for ladybug (lbug namespace, mainStorageManager field, 2-arg applyShadowPages API, std::format, graph catalog support)

- run_benchmarks.py: 4-metric harness (single-writer throughput, checkpoint latency, concurrent write throughput via gtest, and reads-during-checkpoint via --readonly subprocess) - compare_benchmarks.py: CI gate -- fails if single_writer regresses >5% or read_during_checkpoint_ops_per_sec < 0.5 (checkpoint still blocking) - report.py: history table from results/ dir; emoji thresholds aligned with CI gate (5% for red, not 2%) - baseline.json: captured on main (201a727ee) with auto_checkpoint=false in concurrent tests for a fair comparison - results/: fresh benchmark run data for main and feat/concurrent-writes - .github/workflows/concurrent-writes-ci.yml: build + test + benchmark jobs; posts PR comment with comparison table Fix concurrent-write benchmark accuracy: The 4 concurrent gtest tests were missing CALL auto_checkpoint=false, causing auto-checkpoint to fire repeatedly during the concurrent phase. On feat, each checkpoint drain holds mtxForStartingNewTransactions, serialising all writer threads around every checkpoint event -- making feat appear ~2x slower than main. With auto_checkpoint=false added (plus an explicit CHECKPOINT after setup and after the concurrent section), throughput is within ~2% between main and feat (noise level).

…ne 2026)

…es that differed from clang-22)

…apt patches)

loganpowell · 2026-03-28T21:24:07Z

@adsharma, I rebased and made the clang change as per your suggestion.

…e checkpoint timeout tests - Add Catalog::getVersionSinceCheckpoint() returning version - lastCheckpointVersion so CALL catalog_version() shows 0 immediately after CHECKPOINT (fixes CallCatalogVersion) - Update CheckpointTimeoutErrorTest and AutoCheckpointTimeoutErrorTest to expect success (---- ok) instead of a timeout error: non-blocking checkpoint only waits on write transactions, so an open read-only transaction no longer blocks it

adsharma · 2026-03-29T18:18:31Z

Overall: The "do not block readers while checkpointing" part is more interesting than concurrent write throughput. I don't think the ladybugdb user base cares about concurrent write throughput.

Historically, this project is very closely related to DuckDB, so I'd be careful about deviating substantially from DuckDB's concurrency model. DuckDB supports single writer, but allows multiple-readers during checkpointing.

https://duckdb.org/docs/stable/connect/concurrency

The CI doesn't need clang-format (already handled in ci-workflow.yml) and this doesn't need to run on every PR.
Running it as a part of nightly scheduled CI is a good idea

We should also do some antithesis based runs to flush out bugs arising out of this PR or pre-existing issues. Let me know if it's something you're excited about taking on after this is merged.

loganpowell · 2026-03-29T20:02:30Z

@adsharma the concurrency model is what allows reads during checkpointing. The Vela use-case for concurrency is mult-agent architectures wherein agents can read/write from/to the same graph safely, without any app-level locking. I really liked kuzu and am just happy folks are coming out to keep it alive (🙏). If you want to merge this in and have some specific tests you'd like me to write, I'd be happy to. If you do merge it, I can reach out to the Vela team to see if we can roll them in instead of maintaining divergent forks. In the meantime, I can pull out the clang-formatting.

…d PR comment step - Workflow now triggers on schedule (02:00 UTC) and workflow_dispatch only - clang-format is handled by ci-workflow.yml + local pre-commit hook - Removed PR comment step (no PR context in scheduled runs)

adsharma · 2026-03-29T22:33:57Z

multi-agent architectures wherein agents can read/write from/to the same graph safely, without any app-level locking.

This is an interesting use case, but the question is can LadybugDB offer competitive performance vs OLTP tabular databases?

The way I'm imagining data flow is:

{agents} -> pgembed -> duckdb -> ladybugdb -> icebug

The other three are open source projects hosted here: https://github.com/Ladybug-Memory

Also, you had the concurrent write throughput before vs after as a comment on this page. Am I reading this correctly?

baseline:

 1/3 single_writer_txn_per_sec ...
       → 609.3 txn/s
  2/3 checkpoint_latency_ms ...
       → 31 ms
       ...

after:

  1/3 single_writer_txn_per_sec ...
       → 611.4 txn/s
  2/3 checkpoint_latency_ms ...
       → 32 ms
  3/3 concurrent write gtests (4 tests × 4 threads × 1000 inserts) ...
       EmptyDBTransactionTest.ConcurrentNodeInsertions: 1.8s
       EmptyDBTransactionTest.ConcurrentNodeUpdates: 12.6s
       EmptyDBTransactionTest.ConcurrentRelationshipInsertions: 6.0s
       EmptyDBTransactionTest.ConcurrentRelationshipUpdates: 15.3s
       → 447.6 txn/s aggregate
  4/4 read_during_checkpoint_ops_per_sec ...
       → 26.80 ops/s  (0 = blocked/locked, >0 = non-blocking)

So the aggregate throughput is actually lower than single threaded write throughput? With a postgres storage engine in beta right now, I've measured 100k writes/sec in my past life. The only problem is it's stuck in beta for at least 3 years right now.

adsharma · 2026-03-29T22:48:55Z

Also I misread the duckdb docs around "single writer" model. Here's the relevant paragraph:

DuckDB supports concurrency within a single process according to the following rules. As long as there are no write conflicts, multiple concurrent writes will succeed. Appends will never conflict, even on the same table. Multiple threads can also simultaneously update separate tables or separate subsets of the same table. Optimistic concurrency control comes into play when two threads attempt to edit (update or delete) the same row at the same time. In that situation, the second thread to attempt the edit will fail with a conflict error.

Good news: this PR doesn't break the duckdb concurrency model. I have no objections to it at a conceptual level.

I have some disagreements with the duckdb folks about VFS level concurrency. I think they're stricter than they need to be to the point where a WAL replication setup is impossible on all platforms. But that's a separate topic.

For this particular PR, we just need to work through the 3 review comments and a post-merge MVCC testing plan to flush out any bugs.

loganpowell · 2026-03-29T23:58:53Z

Sorry, I don't actually see any in-line reviews/comments. Can you reiterate them?
WRT the benchmarks, yes, the concurrency feature seems to actually speed up single-writer TPS. You can see it more clearly here (sorry for the messy/duplicate job summaries). It also slightly increases checkpoint time.

adsharma · 2026-03-30T01:31:48Z

I think an earlier version of your benchmarking workflow added comments to this PR with output that looked like what you pointed to. But once you force pushed, they disappeared and more recent runs are failing because they don't have permissions to add comments. So I looked into the runs manually and extracted the relevant data.

Benchmarking on github runners is a bit tricky because of the amount of noise and all the problems github is going through.

So I ran some benchmark like this

wget -o bench.py https://gist.github.com/adsharma/36dd6ae67390ec0d612e8564082ad37a
uv run bench.py

baseline.txt

Generating 1,000 rows...

--- Cypher CREATE (vanilla insert) --- 3 iters ---
    Run 1: 7.546s
    Run 2: 7.533s
    Run 3: 7.547s
  Avg Time : 7.542s
  Rows     : 1,000
  Rate     : 133 rows/s

--- Cypher MERGE (no conflicts) --- 3 iters ---
    Run 1: 7.643s
    Run 2: 7.697s
    Run 3: 7.640s
  Avg Time : 7.660s
  Rows     : 1,000
  Rate     : 131 rows/s

--- Cypher MERGE (50% conflicts) --- 3 iters ---
    Run 1: 7.629s
    Run 2: 7.606s
    Run 3: 7.617s
  Avg Time : 7.617s
  Rows     : 1,000
  Rate     : 131 rows/s

branch.txt:

Generating 1,000 rows...

--- Cypher CREATE (vanilla insert) --- 3 iters ---
    Run 1: 7.505s
    Run 2: 7.516s
    Run 3: 7.537s
  Avg Time : 7.519s
  Rows     : 1,000
  Rate     : 133 rows/s

--- Cypher MERGE (no conflicts) --- 3 iters ---
    Run 1: 7.637s
    Run 2: 7.683s
    Run 3: 7.614s
  Avg Time : 7.645s
  Rows     : 1,000
  Rate     : 131 rows/s

--- Cypher MERGE (50% conflicts) --- 3 iters ---
    Run 1: 7.672s
    Run 2: 7.607s
    Run 3: 7.580s
  Avg Time : 7.620s
  Rows     : 1,000
  Rate     : 131 rows/s

I can't detect any benefit on single threaded writes. There are no regressions either. Which is good news.

Like your link says:

Informational only — MVCC write-pipeline overhead; ~2x slower than main is expected trade-off

The cost of enabling concurrent writes is that your throughput drops. In that particular benchmark run by 2x. The main benefit is that multiple writers are able to write at the same time and reads are not blocked. If you want optimal throughput, you're better off using a mutex in the application.

adsharma · 2026-03-29T18:01:40Z

src/transaction/transaction_manager.cpp

    auto checkpointer = initCheckpointerFunc(clientContext);
    try {
-        checkpointer->writeCheckpoint();
+        checkpointer->beginCheckpoint(lastTimestamp);


lastTimestamp is written under a mutex, but read without a mutex (since this is checkpointNoLock()). This is UB under C++ memory rules. How is safety guaranteed here?

Good catch. Fixed in 71cb2a7. lastTimestamp is a plain common::transaction_t (non-atomic). Every writer increments it inside mtxForSerializingPublicFunctionCalls (see commit()), but the old code read it bare in checkpointNoLock(). Even though the acquire/release pair on activeWriteTransactionCount gives ordering for the count itself, concurrent access to a non-atomic variable is still UB under the C++ memory model.

We now snapshot it under a brief lock before calling beginCheckpoint():

transaction_t snapshotTimestamp; { std::unique_lock lck{mtxForSerializingPublicFunctionCalls}; snapshotTimestamp = lastTimestamp; } checkpointer->beginCheckpoint(snapshotTimestamp);

See the fix →

adsharma · 2026-03-29T18:03:46Z

src/storage/table/node_group.cpp

-        lock, state.columnIDs, columnPtrs);
-    return insertChunkedGroup->flush(&DUMMY_CHECKPOINT_TRANSACTION, state.pageAllocator);
+        lock, state.columnIDs, columnPtrs, txn);
+    return insertChunkedGroup->flush(const_cast<Transaction*>(txn), state.pageAllocator);


Is this const_cast<> safe?

No, it was not safe — and both casts are now removed (71cb2a7, compile fix in 4d02d95).

The root cause: InMemChunkedNodeGroup::flush() and ChunkedNodeGroup::scanCommitted() took non-const Transaction* even though neither function mutates through the pointer. NodeGroupCheckpointState::transaction is const Transaction*, so both call sites needed a cast to paper over the mismatch.

Fix: propagated const Transaction* through the full class hierarchy:

ChunkedNodeGroup::flush / scanCommitted base class — signatures changed to const Transaction*

InMemChunkedCSRNodeGroup::flush override — updated to match

Both call sites in node_group.cpp now pass txn / transaction directly — no cast

src/storage/table/node_group.cpp

adsharma · 2026-03-29T18:06:03Z

test/test_files/transaction/basic.test

 -STATEMENT [conn2] CHECKPOINT;
---- error
-Timeout waiting for active transactions to leave the system before checkpointing. If you have an open transaction, please close it and try again.
+---- ok


Assuming the WAL related code handle MVCC checkpoints, we should still review secondary data structures:

Hash index: index.checkpoint(context, pageAllocator) in NodeTable::checkpoint is still called without a snapshotTxn — it uses whatever live index state is present during materialization.

Stats: tableEntry->vacuumColumnIDs(0) modifies the catalog entry's column ID set while a read transaction may be using those columns.

serializeCatalogAndMetadata: The catalog snapshot path (serializeCatalogSnapshot) is gated on snapshotTS > 0, which is only set in the new beginCheckpoint(lastTimestamp) path. But the WAL rotation path calls beginCheckpoint then releases the write gate early — new catalog mutations (DDL)
could race with serialization in finishCheckpoint.

vacuumColumnIDs race — Fixed (71cb2a7). tableEntry->vacuumColumnIDs(0) is now called under a std::unique_lock on schemaMtx. Concurrent readers that iterate the column set also acquire schemaMtx, so they will block until the mutation completes:

See the fix →

DDL race in finishCheckpoint — Safe by timestamp exclusion. snapshotTS is captured under the write gate in beginCheckpoint() before the gate is released. finishCheckpoint() calls serializeCatalogSnapshot() (selected when snapshotTS > 0), which calls catalog.serializeSnapshot(serializer, snapshotTS) — only entries with commit timestamp <= snapshotTS are included. Any DDL that commits after the gate releases gets a strictly greater timestamp and is invisible to the serialized snapshot. We added a comment in finishCheckpoint() making this invariant explicit:

See the comment →

Hash index without snapshotTxn — Acknowledged as a pre-existing limitation. HashIndexLocalStorage has no per-entry timestamps, so there is no mechanism to bound the index to snapshotTS without a deeper refactor of the hash-index infrastructure. I added a TODO comment at the call site in NodeTable::checkpoint() and it is tracked as a follow-up; it is not introduced by this PR:

See the TODO →

…amp race, const_cast, vacuumColumnIDs, DDL-race comment)

loganpowell · 2026-03-31T15:03:39Z

Give me some time to address your comments

…amp race, const_cast, vacuumColumnIDs, DDL-race comment)

…override

- Wrap pre-reload CHECKPOINT result in inner scope to prevent use-after-free when createDBAndConn() destroys the database - Explicitly reset conn2/r2 before reloading DB in drain test - Update CI workflow with correct test filter names

…ationale The write gate (mtxForStartingNewTransactions) is correctly released after WAL rotation. Holding it through the full checkpointStoragePhase would cause a deadlock in tests that restart the database mid-checkpoint (e.g. ShadowFileDatabaseIDMismatch*), since Database::~Database() calls transactionManager->checkpoint() which would block indefinitely. The hash-index/node-data snapshot divergence is a pre-existing limitation of HashIndexLocalStorage having no per-entry timestamps. The correct fix requires adding timestamp-aware snapshotting to HashIndexLocalStorage and is tracked as a follow-up; document it in transaction_manager.cpp. Also: - Add rationale comment to postCheckpointCleanup explaining why no try/catch is intentional (data already durable at finishCheckpoint) - Rename HashIndexConsistentAfterCheckpoint test to HashIndexBasicRecoveryAfterCheckpoint to reflect what it actually tests (baseline recovery, not the ghost-key invariant)

Add git submodule update --init --depth 1 dataset to the test job so the full e2e_test_transaction suite can run (copy/recovery tests need CSV files from dataset/tinysnb/). Locally verified: 515/530 pass, 0 fail (15 skipped are LDBC large-dataset tests disabled in CMake). Also bump the Transaction tests timeout from 120s to 300s to accommodate the concurrent transaction suite on slower CI runners.

- compare_benchmarks.py already appends to $GITHUB_STEP_SUMMARY; remove the manual 'cat comparison.md >> $GITHUB_STEP_SUMMARY' step and the second bare compare call (which was adding it a 3rd time). Use continue-on-error + outcome check instead. - Both matrix legs (ubuntu + macos) wrote test tables to the run summary; redirect $GITHUB_STEP_SUMMARY to /dev/null for non-ubuntu so only one table appears.

loganpowell force-pushed the feat/concurrent-writes branch 6 times, most recently from a531858 to 49d4a85 Compare March 27, 2026 12:55

loganpowell added 7 commits March 28, 2026 17:21

style: clang-format all files changed by feat/concurrent-writes

2903c45

ci: opt into Node.js 24 for all GitHub Actions (node 20 deprecated Ju…

ddbaf9a

…ne 2026)

style: reformat with clang-18 via upstream Docker script (fixes 4 fil…

db86235

…es that differed from clang-22)

ci: add clang-format-18 check job (mirrors upstream Docker script)

5a91088

ci: use upstream Docker script for clang-format check (avoids Ubuntu …

4363810

…apt patches)

loganpowell force-pushed the feat/concurrent-writes branch from c6e842d to 4363810 Compare March 28, 2026 21:22

loganpowell force-pushed the feat/concurrent-writes branch from b4f87b5 to 7e16a0a Compare March 29, 2026 00:27

adsharma reviewed Mar 30, 2026

View reviewed changes

loganpowell added a commit to loganpowell/ladybug that referenced this pull request Mar 30, 2026

fix: address adsharma review comments on PR LadybugDB#332 (lastTimest…

e101fb2

…amp race, const_cast, vacuumColumnIDs, DDL-race comment)

fix: address adsharma review comments on PR LadybugDB#332 (lastTimest…

71cb2a7

…amp race, const_cast, vacuumColumnIDs, DDL-race comment)

loganpowell added 6 commits March 31, 2026 11:40

fix: propagate const Transaction* to CSRInMemChunkedNodeGroup::flush …

4d02d95

…override

ci: remove temporary push trigger from concurrent-writes-ci

34ebe12

Conversation

loganpowell commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Core mechanism

Thread safety

WAL recovery

Read-only databases

Apple clang compatibility

Testing

Checkout Latest Benchmark Run

Attribution

Uh oh!

adsharma commented Mar 27, 2026

Uh oh!

adsharma commented Mar 28, 2026

Uh oh!

loganpowell commented Mar 28, 2026

Uh oh!

adsharma commented Mar 29, 2026

Uh oh!

loganpowell commented Mar 29, 2026

Uh oh!

adsharma commented Mar 29, 2026

Uh oh!

adsharma commented Mar 29, 2026

Uh oh!

loganpowell commented Mar 29, 2026

Uh oh!

adsharma commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adsharma Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

loganpowell Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

adsharma Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

loganpowell Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adsharma Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

loganpowell Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

loganpowell commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

loganpowell commented Mar 26, 2026 •

edited

Loading

adsharma commented Mar 30, 2026 •

edited

Loading