feat: non-blocking concurrent checkpoint (WAL rotation + MVCC snapshot)#332
feat: non-blocking concurrent checkpoint (WAL rotation + MVCC snapshot)#332loganpowell wants to merge 16 commits intoLadybugDB:mainfrom
Conversation
a531858 to
49d4a85
Compare
|
@loganpowell we use a clang-format inside a docker container on CI. The ubuntu version has weird patches applied on top of upstream sources. Please use |
|
|
Ports Vela-Engineering/kuzu concurrent multi-writer support into ladybug. Based on vela-kuzu commits 3ad9071..d298e9c relative to upstream kuzudb/kuzu fork point 89f0263cc. Changes ------- WAL rotation checkpoint renames active WAL to .wal.checkpoint (frozen), allowing new write transactions to create a fresh WAL immediately Non-blocking checkpoint only drains write transactions; reads continue via MVCC snapshot isolation (stopNewWriteTransactionsAndWaitUntilAllWriteTransactionsLeave replaces stopNewTransactionsAndWaitUntilAllTransactionsLeave) MVCC catalog snapshots serializeSnapshot() uses a pinned snapshotTS for consistent catalog serialization after the write gate is released Concurrent storage checkpoint epoch-based change tracking (changeEpoch replaces hasChanges bool) with per-table watermarks captured at drain time Thread-safe shadow page application updateFrameIfPageIsInFrame with CAS loop for concurrent optimistic readers Read-only opens skip file lock (LOCK_EX) when readOnly=true, allowing --readonly processes to open the DB concurrently with an active writer/checkpoint WAL replayer handles both frozen (.wal.checkpoint) and active (.wal) Atomic version tracking PageManager and Catalog use atomic<uint64_t> with lastCheckpointVersion NodeTable schemaMtx shared_mutex protecting columns vector during concurrent checkpoint StorageManager shared_mutex (was exclusive mutex) for getTable/serialize Apple clang compat constructors for SegmentCheckpointState and LazySegmentData Attribution: Vela-Engineering/kuzu (Aikins Laryea, Vela Partners) Adapted for ladybug (lbug namespace, mainStorageManager field, 2-arg applyShadowPages API, std::format, graph catalog support)
- run_benchmarks.py: 4-metric harness (single-writer throughput, checkpoint latency, concurrent write throughput via gtest, and reads-during-checkpoint via --readonly subprocess) - compare_benchmarks.py: CI gate -- fails if single_writer regresses >5% or read_during_checkpoint_ops_per_sec < 0.5 (checkpoint still blocking) - report.py: history table from results/ dir; emoji thresholds aligned with CI gate (5% for red, not 2%) - baseline.json: captured on main (201a727ee) with auto_checkpoint=false in concurrent tests for a fair comparison - results/: fresh benchmark run data for main and feat/concurrent-writes - .github/workflows/concurrent-writes-ci.yml: build + test + benchmark jobs; posts PR comment with comparison table Fix concurrent-write benchmark accuracy: The 4 concurrent gtest tests were missing CALL auto_checkpoint=false, causing auto-checkpoint to fire repeatedly during the concurrent phase. On feat, each checkpoint drain holds mtxForStartingNewTransactions, serialising all writer threads around every checkpoint event -- making feat appear ~2x slower than main. With auto_checkpoint=false added (plus an explicit CHECKPOINT after setup and after the concurrent section), throughput is within ~2% between main and feat (noise level).
…es that differed from clang-22)
c6e842d to
4363810
Compare
|
@adsharma, I rebased and made the clang change as per your suggestion. |
…e checkpoint timeout tests - Add Catalog::getVersionSinceCheckpoint() returning version - lastCheckpointVersion so CALL catalog_version() shows 0 immediately after CHECKPOINT (fixes CallCatalogVersion) - Update CheckpointTimeoutErrorTest and AutoCheckpointTimeoutErrorTest to expect success (---- ok) instead of a timeout error: non-blocking checkpoint only waits on write transactions, so an open read-only transaction no longer blocks it
b4f87b5 to
7e16a0a
Compare
|
Overall: The "do not block readers while checkpointing" part is more interesting than concurrent write throughput. I don't think the ladybugdb user base cares about concurrent write throughput. Historically, this project is very closely related to DuckDB, so I'd be careful about deviating substantially from DuckDB's concurrency model. DuckDB supports single writer, but allows multiple-readers during checkpointing. https://duckdb.org/docs/stable/connect/concurrency
We should also do some antithesis based runs to flush out bugs arising out of this PR or pre-existing issues. Let me know if it's something you're excited about taking on after this is merged. |
|
@adsharma the concurrency model is what allows reads during checkpointing. The Vela use-case for concurrency is mult-agent architectures wherein agents can read/write from/to the same graph safely, without any app-level locking. I really liked kuzu and am just happy folks are coming out to keep it alive (🙏). If you want to merge this in and have some specific tests you'd like me to write, I'd be happy to. If you do merge it, I can reach out to the Vela team to see if we can roll them in instead of maintaining divergent forks. In the meantime, I can pull out the clang-formatting. |
…d PR comment step - Workflow now triggers on schedule (02:00 UTC) and workflow_dispatch only - clang-format is handled by ci-workflow.yml + local pre-commit hook - Removed PR comment step (no PR context in scheduled runs)
This is an interesting use case, but the question is can LadybugDB offer competitive performance vs OLTP tabular databases? The way I'm imagining data flow is: {agents} -> pgembed -> duckdb -> ladybugdb -> icebug The other three are open source projects hosted here: https://github.com/Ladybug-Memory Also, you had the concurrent write throughput before vs after as a comment on this page. Am I reading this correctly? baseline: after: So the aggregate throughput is actually lower than single threaded write throughput? With a postgres storage engine in beta right now, I've measured 100k writes/sec in my past life. The only problem is it's stuck in beta for at least 3 years right now. |
|
Also I misread the duckdb docs around "single writer" model. Here's the relevant paragraph:
Good news: this PR doesn't break the duckdb concurrency model. I have no objections to it at a conceptual level. I have some disagreements with the duckdb folks about VFS level concurrency. I think they're stricter than they need to be to the point where a WAL replication setup is impossible on all platforms. But that's a separate topic. For this particular PR, we just need to work through the 3 review comments and a post-merge MVCC testing plan to flush out any bugs. |
|
Sorry, I don't actually see any in-line reviews/comments. Can you reiterate them? |
|
I think an earlier version of your benchmarking workflow added comments to this PR with output that looked like what you pointed to. But once you force pushed, they disappeared and more recent runs are failing because they don't have permissions to add comments. So I looked into the runs manually and extracted the relevant data. Benchmarking on github runners is a bit tricky because of the amount of noise and all the problems github is going through. So I ran some benchmark like this baseline.txt branch.txt: I can't detect any benefit on single threaded writes. There are no regressions either. Which is good news. Like your link says:
The cost of enabling concurrent writes is that your throughput drops. In that particular benchmark run by 2x. The main benefit is that multiple writers are able to write at the same time and reads are not blocked. If you want optimal throughput, you're better off using a mutex in the application. |
| auto checkpointer = initCheckpointerFunc(clientContext); | ||
| try { | ||
| checkpointer->writeCheckpoint(); | ||
| checkpointer->beginCheckpoint(lastTimestamp); |
There was a problem hiding this comment.
lastTimestamp is written under a mutex, but read without a mutex (since this is checkpointNoLock()). This is UB under C++ memory rules. How is safety guaranteed here?
There was a problem hiding this comment.
Good catch. Fixed in 71cb2a7. lastTimestamp is a plain common::transaction_t (non-atomic). Every writer increments it inside mtxForSerializingPublicFunctionCalls (see commit()), but the old code read it bare in checkpointNoLock(). Even though the acquire/release pair on activeWriteTransactionCount gives ordering for the count itself, concurrent access to a non-atomic variable is still UB under the C++ memory model.
We now snapshot it under a brief lock before calling beginCheckpoint():
transaction_t snapshotTimestamp;
{
std::unique_lock lck{mtxForSerializingPublicFunctionCalls};
snapshotTimestamp = lastTimestamp;
}
checkpointer->beginCheckpoint(snapshotTimestamp);
src/storage/table/node_group.cpp
Outdated
| lock, state.columnIDs, columnPtrs); | ||
| return insertChunkedGroup->flush(&DUMMY_CHECKPOINT_TRANSACTION, state.pageAllocator); | ||
| lock, state.columnIDs, columnPtrs, txn); | ||
| return insertChunkedGroup->flush(const_cast<Transaction*>(txn), state.pageAllocator); |
There was a problem hiding this comment.
Is this const_cast<> safe?
There was a problem hiding this comment.
No, it was not safe — and both casts are now removed (71cb2a7, compile fix in 4d02d95).
The root cause: InMemChunkedNodeGroup::flush() and ChunkedNodeGroup::scanCommitted() took non-const Transaction* even though neither function mutates through the pointer. NodeGroupCheckpointState::transaction is const Transaction*, so both call sites needed a cast to paper over the mismatch.
Fix: propagated const Transaction* through the full class hierarchy:
ChunkedNodeGroup::flush/scanCommittedbase class — signatures changed toconst Transaction*InMemChunkedCSRNodeGroup::flushoverride — updated to match- Both call sites in
node_group.cppnow passtxn/transactiondirectly — no cast
| -STATEMENT [conn2] CHECKPOINT; | ||
| ---- error | ||
| Timeout waiting for active transactions to leave the system before checkpointing. If you have an open transaction, please close it and try again. | ||
| ---- ok |
There was a problem hiding this comment.
Assuming the WAL related code handle MVCC checkpoints, we should still review secondary data structures:
- Hash index: index.checkpoint(context, pageAllocator) in NodeTable::checkpoint is still called without a snapshotTxn — it uses whatever live index state is present during materialization.
- Stats: tableEntry->vacuumColumnIDs(0) modifies the catalog entry's column ID set while a read transaction may be using those columns.
- serializeCatalogAndMetadata: The catalog snapshot path (serializeCatalogSnapshot) is gated on snapshotTS > 0, which is only set in the new beginCheckpoint(lastTimestamp) path. But the WAL rotation path calls beginCheckpoint then releases the write gate early — new catalog mutations (DDL)
could race with serialization in finishCheckpoint.
There was a problem hiding this comment.
vacuumColumnIDs race — Fixed (71cb2a7). tableEntry->vacuumColumnIDs(0) is now called under a std::unique_lock on schemaMtx. Concurrent readers that iterate the column set also acquire schemaMtx, so they will block until the mutation completes:
DDL race in finishCheckpoint — Safe by timestamp exclusion. snapshotTS is captured under the write gate in beginCheckpoint() before the gate is released. finishCheckpoint() calls serializeCatalogSnapshot() (selected when snapshotTS > 0), which calls catalog.serializeSnapshot(serializer, snapshotTS) — only entries with commit timestamp <= snapshotTS are included. Any DDL that commits after the gate releases gets a strictly greater timestamp and is invisible to the serialized snapshot. We added a comment in finishCheckpoint() making this invariant explicit:
Hash index without snapshotTxn — Acknowledged as a pre-existing limitation. HashIndexLocalStorage has no per-entry timestamps, so there is no mechanism to bound the index to snapshotTS without a deeper refactor of the hash-index infrastructure. I added a TODO comment at the call site in NodeTable::checkpoint() and it is tracked as a follow-up; it is not introduced by this PR:
…amp race, const_cast, vacuumColumnIDs, DDL-race comment)
|
Give me some time to address your comments |
…amp race, const_cast, vacuumColumnIDs, DDL-race comment)
- Wrap pre-reload CHECKPOINT result in inner scope to prevent use-after-free when createDBAndConn() destroys the database - Explicitly reset conn2/r2 before reloading DB in drain test - Update CI workflow with correct test filter names
…ationale The write gate (mtxForStartingNewTransactions) is correctly released after WAL rotation. Holding it through the full checkpointStoragePhase would cause a deadlock in tests that restart the database mid-checkpoint (e.g. ShadowFileDatabaseIDMismatch*), since Database::~Database() calls transactionManager->checkpoint() which would block indefinitely. The hash-index/node-data snapshot divergence is a pre-existing limitation of HashIndexLocalStorage having no per-entry timestamps. The correct fix requires adding timestamp-aware snapshotting to HashIndexLocalStorage and is tracked as a follow-up; document it in transaction_manager.cpp. Also: - Add rationale comment to postCheckpointCleanup explaining why no try/catch is intentional (data already durable at finishCheckpoint) - Rename HashIndexConsistentAfterCheckpoint test to HashIndexBasicRecoveryAfterCheckpoint to reflect what it actually tests (baseline recovery, not the ghost-key invariant)
Add git submodule update --init --depth 1 dataset to the test job so the full e2e_test_transaction suite can run (copy/recovery tests need CSV files from dataset/tinysnb/). Locally verified: 515/530 pass, 0 fail (15 skipped are LDBC large-dataset tests disabled in CMake). Also bump the Transaction tests timeout from 120s to 300s to accommodate the concurrent transaction suite on slower CI runners.
- compare_benchmarks.py already appends to $GITHUB_STEP_SUMMARY; remove the manual 'cat comparison.md >> $GITHUB_STEP_SUMMARY' step and the second bare compare call (which was adding it a 3rd time). Use continue-on-error + outcome check instead. - Both matrix legs (ubuntu + macos) wrote test tables to the run summary; redirect $GITHUB_STEP_SUMMARY to /dev/null for non-ubuntu so only one table appears.
Summary
This PR ports Vela-Engineering/kuzu's concurrent multi-writer checkpoint implementation
into ladybug. The implementation was developed by Aikins Laryea (Vela Partners) across
4 phases plus a read-only lock fix; the full diff is 2,360 lines across 45 files.
Before: checkpoint drains all transactions (reads + writes) and blocks new writes
until the entire checkpoint completes.
After: checkpoint only drains write transactions, releases the write gate as soon
as the WAL is rotated, and reads continue concurrently via MVCC snapshot isolation.
Shadow pages are applied with per-page CAS locking so optimistic readers detect updates.
What changed
Core mechanism
.walis atomically renamed to.wal.checkpoint(frozen WAL). New write transactions immediately create a freshactive WAL, isolated from the checkpoint.
materialization runs concurrently with new writers via per-node-group locking.
Catalog::serializeSnapshot(snapshotTS)andCatalogSet::serializeSnapshot(snapshotTxn)use a pinned snapshot timestamp forconsistent serialization after the gate is released.
Table::hasChanges(bool) replaced bychangeEpoch(atomic uint64).captureChangeEpochs()snapshots per-tablewatermarks under the write gate;
Table::checkpoint()uses watermarks to skiptables that haven't changed since the last gate.
Thread safety
StorageManager::mtxupgraded from exclusivestd::mutextomutable std::shared_mutex(reads take shared lock, DDL takes unique lock).NodeTable::schemaMtxprotects thecolumnsvector during concurrentcheckpoint vacuum.
PageManager::versionandCatalog::versionusestd::atomic<uint64_t>with a stable
lastCheckpointVersionbaseline.ShadowFile::applyShadowPagesnow callsBufferManager::updateFrameIfPageIsInFrame(CAS-loop locked variant) instead of the lock-free variant.
WAL recovery
WALReplayer::replaynow handles both frozen (.wal.checkpoint) and active(
.wal) WAL files independently viareplayFrozenWAL/replayActiveWAL.Read-only databases
readOnly=true, enabling concurrent read-only opens.Apple clang compatibility
SegmentCheckpointStateandLazySegmentData.Testing
test/transaction/checkpoint_test.cppto use the newlogCheckpointAndApplyShadowPages(bool walRotated)signature, WAL rotation-awareflaky checkpointer classes, and
ShadowFileDatabaseIDMismatchExistingDBnowvalidates the frozen WAL (
.wal.checkpoint) rather than the active WAL.embed-graph-db/.github/workflows/ladybug-concurrent-writes-ci.ymlcovers build (ubuntu + macos), full ctest suite, checkpoint/concurrent-write test
filter, and benchmark comparison (blocks on >5% single-writer regression or
zero read-during-checkpoint throughput).
Checkout Latest Benchmark Run
Attribution
Concurrent checkpoint implementation: Aikins Laryea, Vela Partners
(Vela-Engineering/kuzu, commits 3ad9071..d298e9c)
Ported and adapted for ladybug (
lbugnamespace,mainStorageManagerfield,2-arg
applyShadowPagesAPI,std::format, graph catalog support) by Logan Powell.