Implement a standard cache. by cody-littley · Pull Request #3077 · sei-protocol/sei-chain

cody-littley · 2026-03-16T20:36:38Z

Describe your changes and provide context

This PR contains the standard cache implementation. It intentionally does not integrate this cache with FlatKV, that will be done in the following PR.

Testing performed to validate your change

Tested locally, run in benchmark for ~6 days so far.

github-actions · 2026-03-16T20:37:39Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	Mar 17, 2026, 8:10 PM

codecov · 2026-03-16T23:21:55Z

Codecov Report

❌ Patch coverage is 87.05584% with 51 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.47%. Comparing base (dcdde1b) to head (fe31475).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
sei-db/db_engine/dbcache/shard.go	90.17%	17 Missing and 5 partials ⚠️
sei-db/db_engine/dbcache/cache.go	0.00%	16 Missing ⚠️
sei-db/db_engine/dbcache/cache_metrics.go	85.48%	8 Missing and 1 partial ⚠️
sei-db/db_engine/dbcache/cache_impl.go	95.65%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3077      +/-   ##
==========================================
+ Coverage   58.43%   58.47%   +0.03%     
==========================================
  Files        2088     2091       +3     
  Lines      172084   172275     +191     
==========================================
+ Hits       100559   100733     +174     
+ Misses      62599    62596       -3     
- Partials     8926     8946      +20

Flag	Coverage Δ
sei-chain-pr	`73.25% <87.05%> (?)`
sei-db	`70.41% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
sei-db/db_engine/dbcache/cache_impl.go	`95.65% <95.65%> (ø)`
sei-db/db_engine/dbcache/cache_metrics.go	`85.48% <85.48%> (ø)`
sei-db/db_engine/dbcache/cache.go	`11.11% <0.00%> (-88.89%)`	⬇️
sei-db/db_engine/dbcache/shard.go	`90.17% <90.17%> (ø)`

... and 47 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

yzang2019 · 2026-03-17T00:53:26Z

sei-db/db_engine/dbcache/cache_impl.go

+func NewStandardCache(
+	ctx context.Context,
+	// The number of shards in the cache. Must be a power of two and greater than 0.
+	shardCount uint64,


Is this fixed size? how does user choose the shardCount

In the PR that integrates this with FlatKV, I introduce configuration for FlatKV that allows for each of the 5 DBs to be individually configured, meaning that we'll be able to tune the configuration of each DB's cache for that DB's workload. That PR will follow this one.

yzang2019 · 2026-03-17T00:55:47Z

sei-db/db_engine/dbcache/cache_impl.go

+	var wg sync.WaitGroup
+	for shardIndex, shardEntries := range shardMap {
+		wg.Add(1)
+		err := c.miscPool.Submit(c.ctx, func() {


The closure passed to miscPool.Submit calls wg.Done() directly after c.shards[shardIndex].BatchSet(shardEntries). If the shard's BatchSet panics, wg.Done() is never called, and wg.Wait() hangs forever, deadlocking the calling goroutine?

My general ethos is to treat panic as an unrecoverable error, and to not worry too much about things like deadlocked goroutines and to just let the entire process tear itself down, since this sort of blocked goroutine doesn't stop the process from terminating. Sometimes it makes sense to catch a panic in order to do final logging/cleanup, but I think attempting to recover the go process after a panic is almost always a yucky pattern.

https://gobyexample.com/range-over-iterators

That being said, deferring release of the waitgroup isn't expensive, so no harm doing it. Change made.

yzang2019 · 2026-03-17T00:58:13Z

sei-db/db_engine/dbcache/cache_impl.go

+	for shardIndex, subMap := range work {
+		wg.Add(1)
+
+		err := c.miscPool.Submit(c.ctx, func() {


Just to be safe, each closure needs to capture the variables explicitly via local copies. Prior to Go 1.22, range variables are reused across iterations, so all closures could end up operating on the same (final) shard, which could be a bug.

Same for batchSet

I've got painful memories of that footgun. 😅

The current codebase is unable to compile with any go version before 1.22. In this code, we utilize the new iterator pattern that was added as an experimental feature in 1.22 (requiring a special experimental flag to be set at compile time), and added as a standard feature in 1.23. Meaning that it should be impossible to compile this in a way that ends up using the wrong variable capture semantics.

With this in mind, would you still like me to make the change? Willing to make the change if you feel strongly on the topic.

Sounds good, if we will hit compile errors, probably OK to just keep it today

yzang2019 · 2026-03-17T00:59:24Z

sei-db/db_engine/dbcache/cache_impl.go

+			wg.Done()
+		})
+		if err != nil {
+			return fmt.Errorf("failed to submit batch set: %w", err)


How do we handle partial failures here, if some shards succeed, some shards fail, what should the user do?

Any failure from this method is unlikely to be recoverable. If paired with a DB, failure to update the cache can result in "read your writes" consistency breaking down.

From the documentation on the cache interface, the conditions under which the cache is allowed to return an error is limited:

// Although several methods on this interface return errors, the conditions when a cache // is permitted to actually return an error is limited at the API level. A cache method // may return an error under the following conditions: // - malformed input (e.g. a nil key) // - the Reader method returns an error (for methods that accpet a Reader) // - the cache is shutting down // - the cache's work pools are shutting down

The actual place that might fail is submission to the work pool: err := c.miscPool.Submit(c.ctx, ...). Pools have similar semantics, and can only return errors under very specific circumstances. This wasn't documented, so I added the following to the Pool.Submit() godoc:

// ... // // This method is permitted to return an error only under the following conditions: // - the pool is shutting down (i.e. its context is done/cancelled) // - the provided ctx parameter is done/cancelled before this method returns // - invalid input (e.g. the task is nil) // // If this method returns an error, the task may or may not have been executed. Submit(ctx context.Context, task func()) error

I also added the following to the Cache doucmentation:

// ... // // Cache errors are are generally not recoverable, and it should be assumed that a cache that has returned an error // is in a corrupted state, and should be discarded. type Cache interface {

yzang2019 · 2026-03-17T01:01:58Z

sei-db/db_engine/dbcache/shard.go

+		} else if result.value == nil {
+			se.status = statusDeleted
+			se.value = nil
+			se.shard.gcQueue.Push(key, uint64(len(key)))


When a key is looked up but not found, injectValue marks it statusDeleted and pushes into the gcQueue. These tombstone entries consume cache capacity and can evict real data. Under a workload with many lookups of non-existent keys, tombstones could dominate the cache?

This was an intentional design decision, but I agree this is a topic that requires scrutiny.

Reading a non-existent value is potentially a high-latency operation, and the goal of caching a value's non-existence is to avoid paying this high latency cost over and over if somebody keeps trying to read the same non-existent value.

With respect to protecting the cache against attack, I don't think keeping non-existent values makes the problem worse. Presumably we charge the same amount of gas for any read, regardless of whether or not the key is in the DB. If an attacker reads many non-existent values, they will fill up less of the cache than if they read the same number of random but existing values.

As a middle ground, what would you think about me adding a configuration option that allows us to toggle the cache behavior of missing values? In the future if this ever became a problem, we could disable missing value storage.

FYI, I haven't implemented my proposed change yet, waiting to hear back before I implement.

We discussed this on a call, and decided together that the current behavior of caching non-existent values was ok for now.

yzang2019 · 2026-03-17T01:05:30Z

sei-db/db_engine/dbcache/shard.go

+		} else {
+			se.status = statusAvailable
+			se.value = result.value
+			se.shard.gcQueue.Push(key, uint64(len(key)+len(result.value))) //nolint:gosec // G115: len is non-negative


The gcQueue tracks len(key) + len(value) bytes, but actual memory consumption per entry includes: Go map bucket overhead (~100–200 bytes per entry), the shardEntry struct itself, and any channel buffers for in-flight reads. Actual memory usage could be 2–5x the tracked size?

Good point.

I've added a configurable field that allows you to specify an estimated overhead, in bytes, for each cache entry. The LLM guesses that a sane default value is 256 bytes, so I'll use that for now. Later, I'll experimentally derive/verify this value.

Assuming 256 bytes is an accurate estimate, in the long run I'll probably revisit the cache implementation to make things more memory efficient. Key-value pairs are small, meaning that the memory footprint of the cache will be dominated by overhead. It should be possible to significantly reduce this overhead with careful restructuring though.

yzang2019 · 2026-03-17T01:06:21Z

sei-db/db_engine/dbcache/shard.go

+	ctx context.Context
+
+	// A lock to protect the shard's data.
+	lock sync.Mutex


Using a sync.RWMutex and taking a read lock for the hot cache-hit path (with a write lock only for LRU updates and mutations) could significantly improve throughput under read-heavy workloads?

We discussed this on our call, summarizing my understanding of that discussion here for the record:

Benefit of RW mutex is strongly dependent on workload and access pattern

In production, this cache layer is mediated by a cache layer that lives in the execution part of the stack, potentially reducing contention on the storage layer for heavy read workloads.

Although technically feasible, implementing a RW lock requires nontrivial changes and increased complexity.

Based on these facts, we decided not to change the locking pattern for the time being. I will document the changes needed to upgrade to a RW lock, but will avoid implementing them unless experimental evidence suggests that lock contention here is a major problem.

/* This implementation currently uses a single exlusive lock, as opposed to a RW lock. This is a lot simpler than using a RW lock, but it comes at higher risk of contention under certain workloads. If this contention ever becomes a problem, we might consider switching to a RW lock. Below is a potential implementation strategy for converting to a RW lock: - Create a background goroutine that is responsible for garbage collection and updating the LRU. - The GC goroutine should periodically wake up, grab the lock, and do garbage collection. - When Get() is called, the calling goroutine should grab a read lock and attempt to read the value. - If the value is present, send a message to the GC goroutine over a channel (so it can update the LRU) and return the value. In this way, many readers can read from this shard concurrently. - If the value is missing, drop the read lock and acquire a write lock. Then, handle the read like we currently handle in the current implementation. */

yzang2019 · 2026-03-17T01:07:12Z

sei-db/db_engine/dbcache/shard.go

+
+// Handles Get for a key whose value is already cached. Lock must be held; releases it.
+func (s *shard) getAvailable(entry *shardEntry, key []byte, updateLru bool) ([]byte, bool, error) {
+	value := bytes.Clone(entry.value)


The Cache interface already documents that callers must not mutate returned slices, and explicitly states the cache is not required to make defensive copies. Removing this clone would eliminate a per-hit allocation and copy?

Good point, removed the clone.

yzang2019 · 2026-03-17T01:09:24Z

sei-db/db_engine/dbcache/shard.go

+		if entries[i].IsDelete() {
+			s.deleteUnlocked(entries[i].Key)
+		} else {
+			s.setUnlocked(entries[i].Key, entries[i].Value)


Every call to setUnlocked triggers evictUnlocked(), which loops through the LRU queue until size is within budget. In BatchSet, this means eviction runs N times (once per entry) instead of once at the end. Moving the evictUnlocked() call out of setUnlocked and running it once after the batch loop in BatchSet would be substantially more efficient for large batches?

Good point. Refactored so that BatchSet() only does eviction once at the end.

blindchaser · 2026-03-17T17:06:46Z

sei-db/db_engine/dbcache/cache_impl.go

+	return nil
+}
+
+func (c *cache) BatchGet(read Reader, keys map[string]types.BatchGetResult) error {


semantic inconsistency between Get and BatchGet for nil-value entries

Set(key, nil) stores the entry with statusAvailable and a nil value (TestCacheSetNilValue).

Get returns (nil, true, nil) — key found, value is nil.

BatchGet stores BatchGetResult{Value: nil}, and IsFound() returns false — key not found.

Good catch. Calling Set(key, nil) is now equivalent to calling Delete(key).

blindchaser · 2026-03-17T17:09:46Z

sei-db/db_engine/dbcache/cache_impl.go

+		wg.Add(1)
+		err := c.miscPool.Submit(c.ctx, func() {
+			c.shards[shardIndex].BatchSet(shardEntries)
+			wg.Done()


wg.Done() not in defer. If shard.BatchSet panics, wg.Done() is never called and wg.Wait() blocks.
in BatchGet we uses defer wg.Done(). This should be consistent.

Is now deferred (once I push current diff).

blindchaser · 2026-03-17T17:10:27Z

sei-db/db_engine/dbcache/shard.go

+	maxSize uint64,
+) (*shard, error) {
+
+	if maxSize <= 0 {


nit: maxSize is uint64 cant be negaive. so == for clarify?

blindchaser · 2026-03-17T17:11:30Z

sei-db/db_engine/dbcache/shard.go

+const (
+	// The value is not known and we are not currently attempting to find it.
+	statusUnknown valueStatus = iota
+	// We've scheduled a read of the value but haven't yet finsihed the read.


typo: finished

blindchaser · 2026-03-17T17:14:57Z

sei-db/db_engine/dbcache/shard.go

+}
+
+// Delete a value. Caller is required to hold the lock.
+func (s *shard) deleteUnlocked(key []byte) {


how about In deleteUnlocked, check if the key exists before creating an entry. this will avoid creating nonexistent keys pollutes cache

change made

blindchaser · 2026-03-17T17:16:36Z

sei-db/db_engine/dbcache/cache.go

 func (u *CacheUpdate) IsDelete() bool {
 	return u.Value == nil
 }
+


how do we manage the cache context. like do we need a Close()/Shutdown() to the cache interface for collectLoop goroutine?

It's not in this PR (since integration will happen in the next PR), but the context is managed by FlatKV, and is cancelled when FlatKV is shut down.

// Close closes all database instances, cancels the store's context to // stop background goroutines (pools, caches, metrics), and releases the // file lock. func (s *CommitStore) Close() error { err := s.closeDBsOnly() s.cancel() // ...

The cancel method above will cause the ctx used by the worker pools and by the DB caches to have their contexts cancelled, thus freeing up all threading resources in these utilities.

yzang2019 · 2026-03-17T18:40:33Z

sei-db/db_engine/dbcache/cache_impl.go

+	for shardIndex, subMap := range work {
+		wg.Add(1)
+
+		err := c.miscPool.Submit(c.ctx, func() {


Sounds good, if we will hit compile errors, probably OK to just keep it today

Implement a standard cache.

7835683

cody-littley requested review from blindchaser and yzang2019 March 16, 2026 20:36

cody-littley self-assigned this Mar 16, 2026

cody-littley added the non-app-hash-breaking label Mar 16, 2026

cleanup

950197c

yzang2019 reviewed Mar 17, 2026

View reviewed changes

Cody Littley added 3 commits March 17, 2026 08:26

Merge branch 'main' into cjl/cache-impl

cff96ab

made suggested changes

003fcc9

made suggested change

a208a1b

blindchaser reviewed Mar 17, 2026

View reviewed changes

made suggested changes

157a600

yzang2019 approved these changes Mar 17, 2026

View reviewed changes

Cody Littley added 2 commits March 17, 2026 15:04

fix unit test

b41639f

fix unit test

fe31475

Conversation

cody-littley commented Mar 16, 2026

Describe your changes and provide context

Testing performed to validate your change

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cody-littley Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cody-littley Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

codecov bot commented Mar 16, 2026 •

edited

Loading

cody-littley Mar 17, 2026 •

edited

Loading

cody-littley Mar 17, 2026 •

edited

Loading