Conversation
rs/boundary_node/ic_boundary/src/http/middleware/subnet_read_state_cache.rs
Show resolved
Hide resolved
| } | ||
|
|
||
| #[derive(Clone)] | ||
| struct CachedResponse { |
There was a problem hiding this comment.
We're missing response headers here completely.
Not sure it's a problem, but I'd just drop CachedResponse and use normal Response<Bytes> as a cache value. You will still have to split it into parts after fetching from the cache and reassemble as Response<AxumBody> but it's unavoidable probably.
There was a problem hiding this comment.
Very good, body mapping is even more concise than .into_parts()/from_parts() 👍
And we avoid panic-prone .expect() too, which would probably never happen, but still.
|
|
||
| if response.status().is_success() { | ||
| let (parts, body) = response.into_parts(); | ||
| let body_bytes = axum::body::to_bytes(body, 10 * 1024 * 1024) |
There was a problem hiding this comment.
Probably better to use buffer_body from ic-bn-lib and use a configurable limit from CLI instead of hardcoded one. Maybe cache_max_item_size that is applied to normal cached requests. Or add a separate option to your new CLI section (also add timeout too, needed for buffer_body).
There was a problem hiding this comment.
Thank, Igor! Used 1mb for subnet_read_state_cache_max_item_size default instaed of cache_max_item_size because 10mb seemed too high, but I'm happy to reuse cache_max_item_size for simplicity.
There was a problem hiding this comment.
👍 I guess we also need (separately) to improve the caching CLI opts (cache_...) to add the timeout there too. Currently it resorts to some default of 1 minute and not configurable. But that's out of the scope here.
| pub subnet_read_state_cache_ttl: Duration, | ||
|
|
||
| /// Maximum number of cached subnet read_state entries | ||
| #[clap(env, long, default_value = "1000")] |
There was a problem hiding this comment.
I think it would be nicer to use bytes here, not number of entries. You can look at how that's done in generic http cache https://github.com/dfinity/ic-bn-lib/blob/main/ic-bn-lib/src/http/cache.rs using weigh_entry
| /// TTL for cached subnet read_state responses. | ||
| /// Set to 0 to disable caching. | ||
| #[clap(env, long, default_value = "30s", value_parser = parse_duration)] | ||
| pub subnet_read_state_cache_ttl: Duration, |
There was a problem hiding this comment.
If we want this thing to be enabled by default, I'd add a separate bool flag e.g. subnet_read_state_cache_disable and rely on that to enable/disable it instead of zero TTL. This would be more explicit.
| } | ||
| } | ||
|
|
||
| fn build_cache_key(subnet_id: SubnetId, ctx: &RequestContext) -> Option<CacheKey> { |
There was a problem hiding this comment.
There's some path, @r-birkner told me, that leads to the metrics endpoint of the subnets. Probably it's worth figuring it out and bypass the cache if it's metrics (so that we don't serve stale metrics). Not sure if that's critical, but probably worth doing.
There was a problem hiding this comment.
The requests that we should cache are the following: https://github.com/dfinity/agent-rs/blob/fa746467213003de2d8316240e55a1cb3a3281ad/ic-agent/src/agent/mod.rs#L1327-L1328
There was a problem hiding this comment.
Ah yes, I forgot the ticket context, that it should be an opt-in set of paths that we cache.
| .time_to_live(ttl) | ||
| .build(); | ||
|
|
||
| let hits = register_int_counter_with_registry!( |
There was a problem hiding this comment.
Once you change the limit from number of entries to memory - I'd also add a gauge that shows how much memory it consumes. And the number of entries too.
I think they're stored as atomics in Moka and should be cheap to read on each call.
https://docs.rs/moka/latest/moka/sync/struct.Cache.html#method.entry_count
https://docs.rs/moka/latest/moka/sync/struct.Cache.html#method.weighted_size
| let (parts, body) = response.into_parts(); | ||
| let body_bytes = buffer_body(body, state.max_item_size, state.body_timeout) | ||
| .await | ||
| .map_err(|e| ErrorCause::Other(format!("failed to buffer response body: {e}")))?; |
There was a problem hiding this comment.
nit: there are special ErrorCause::UnableToReadBody/ErrorCause::BodyTimedOut and ErrorCause::PayloadTooLarge for such cases.
Idealy we should map the library error causes to our local ones, like this:
ic/rs/boundary_node/ic_boundary/src/http/middleware/process.rs
Lines 71 to 78 in d17cf0b
Better even to create a helper function that wraps buffer_body and returns correct error cause, to avoid duplication.
P.S.
Damn, here we also have a hardcoded timeout, need to improve :)
| #[derive(Clone, Debug, PartialEq, Eq, Hash)] | ||
| struct CacheKey { | ||
| subnet_id: SubnetId, | ||
| paths: ReadStatePaths, |
There was a problem hiding this comment.
nit: how big the paths can be? I wonder if we shouldn't store them directly in the cache and just calculate some hash over them before passing to moka.
We do smth like this here:
https://github.com/dfinity/ic-bn-lib/blob/b58540c74c224266be8fce3f7ebc41840dbe16c4/ic-bn-lib/src/http/cache.rs#L696-L718
Not sure it will save us a lot, though. Especially since we will scope the paths that we cache to only a small subset.
| pub hits: IntCounter, | ||
| pub misses: IntCounter, |
There was a problem hiding this comment.
nit: do these need to be public?
NODE-1863
This change adds an in-memory cache in the boundary node for
subnetread_stateresponses, keyed by (subnet_id,requestpaths).When the same subnet/path request repeats within the TTL (default 30s), the BN serves the cached response directly instead of forwarding to the replica.