Skip to content

Make ObjectStore backend generic over any obspec store#3698

Open
kylebarron wants to merge 2 commits intozarr-developers:mainfrom
kylebarron:kyle/zarr-generic-over-obspec
Open

Make ObjectStore backend generic over any obspec store#3698
kylebarron wants to merge 2 commits intozarr-developers:mainfrom
kylebarron:kyle/zarr-generic-over-obspec

Conversation

@kylebarron
Copy link
Contributor

@kylebarron kylebarron commented Feb 9, 2026

In #1661 we merged an Obstore-based backend. It sounds like the Obstore backend has become very popular, and @maxrjones found that, at least in some situations, the Obstore backend can be significantly faster.

One problem with the Obstore backend, however, is that it's strictly tied to Obstore. The type hinting and runtime behavior all require exact instances from the Obstore package.

But very often people might want to insert some middleware.

  • Caching: disk cache, memory cache of different flavors
  • Timing: how long are different requests taking
  • Request inspection, e.g. for debugging, or for evaluating async performance

The goal of Obspec is to define generic protocols to cleanly enable this. As described in the initial release post from last summer, Obspec should allow downstream libraries to depend on a protocol-based API that works with any implementation that provides the given signature.

E.g. if you think of the simplest pseudocode example of a cache:

from __future__ import annotations
from typing_extensions import Buffer
from obspec import GetRange

class SimpleCache(GetRange):
    """A simple cache for synchronous range requests that never evicts data."""

    def __init__(self, client: GetRange):
        self.client = client
        self.cache: dict[tuple[str, int, int | None, int | None], Buffer] = {}

    def get_range(
        self,
        path: str,
        *,
        start: int,
        end: int | None = None,
        length: int | None = None,
    ) -> Buffer:
        cache_key = (path, start, end, length)
        if cache_key in self.cache:
            return self.cache[cache_key]

        response = self.client.get_range(
            path,
            start=start,
            end=end,
            length=length,
        )
        self.cache[cache_key] = response
        return response

Then if a function expects an object implementing GetRange:

def my_function(client: GetRange, path: str, *, start: int, end: int):
    buffer = client.get_range(path, start=start, end=end)
    # Do something with the buffer
    print(len(memoryview(buffer)))

Then now you can pass in either the raw obstore backend or the backend wrapped by the cache:

from obstore.store import S3Store

store = S3Store("bucket")
caching_wrapper = SimpleCache(store)
my_function(caching_wrapper, "path.txt", start=0, end=10)
# second request will be cached by `SimpleCache`
my_function(caching_wrapper, "path.txt", start=0, end=10)

This architecture is much more tractable for end users than if Obstore implemented its own caching natively. Since users have full access to the cache (i.e. it isn't hidden away inside Rust), users can check methods of SimpleCache to track how much memory the cache is using and to manually evict cache items if they wanted.

@maxrjones has been starting to collect utilities around obspec in https://github.com/virtual-zarr/obspec-utils.

This is backwards-compatible (at least if you ignore the obstore version bump from 0.5.1 — released March 2025 — to 0.7.0 — released June 2025).

Implementation notes

  • This relies on structural subtyping, i.e. that the shape matters not the name. This works for all protocols, but we have to implement special support for exceptions, since exceptions don't support structural subtyping 🥲. As described in Exceptions in the obspec docs, the workaround I chose is to use well-defined names, and map_exception will convert any external exceptions to exceptions subclassing from obspec.exceptions.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 9, 2026
@TomNicholas
Copy link
Member

+1 Xarray could make use of this. (When reading non-icechunk stores - icechunk already has a caching layer)

@kylebarron
Copy link
Contributor Author

I'm not sure where the other locations are where I need to define obspec as a dependency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs release notes Automatically applied to PRs which haven't added release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants