Skip to content

Conversation

@LucaMarconato
Copy link
Member

@LucaMarconato LucaMarconato commented Feb 6, 2026

  • Big performance boost: avoid calling get_element_instances() when the info is present in the metadata.
  • Fix a falsely failing assert due to different index dtype (range index vs string) for some Xenium versions. Also some minor code cleanup.
  • Remove an old deprecation warning (and finally change the default argument as anticipated!)

@marcovarrone I noticed that even after #337 the memory usage was not great. This was due to remaining calls of compute() (called twice) via get_element_instances(). For some xenium version (some prototype versions right after 2.0.0 was released) it may be more difficult to avoid such call, but for newer Xenium version there is a massive performance boost. I will do systematic benchmarks soon, but I anticipate a 95% decrease in peak memory and some modest time improvement as well.

@codecov-commenter
Copy link

codecov-commenter commented Feb 6, 2026

Codecov Report

❌ Patch coverage is 59.09091% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.85%. Comparing base (28eacc9) to head (7143d8e).

Files with missing lines Patch % Lines
src/spatialdata_io/readers/xenium.py 59.09% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #369      +/-   ##
==========================================
- Coverage   63.08%   62.85%   -0.23%     
==========================================
  Files          27       27              
  Lines        3153     3158       +5     
==========================================
- Hits         1989     1985       -4     
- Misses       1164     1173       +9     
Files with missing lines Coverage Δ
src/spatialdata_io/readers/xenium.py 71.10% <59.09%> (-2.18%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@LucaMarconato LucaMarconato marked this pull request as draft February 7, 2026 11:14
@LucaMarconato LucaMarconato changed the title Xenium fix assert shapes id Xenium avoid calling get_element_instances(); fix deprecation warning Feb 7, 2026
@marcovarrone
Copy link
Contributor

Good catch @LucaMarconato !
Since get_element_instances() runs compute()only in the context of finding the unique elements, I assume it does not actually load the full image in RAM. The peak memory should then be the memory occupied by the chunks being processed in parallel at any given time, right?

What do you think about adding a num_workers parameter that can be used to wrap get_element_instances() as with dask.config.set(num_workers=num_workers):
That would allow users to reduce peak memory at the cost of running time.

@LucaMarconato
Copy link
Member Author

Yes I would expect the behavior like what you describe. I did some experiments and the memory indeed decreases, but there is still quite some peak usage, probably because of some rechunking operations involved beforehand.

I could optimize this further and expose the parameters, but since get_element_instances() is now triggered only for some rare Xenium versions (before <2.0.0 the unique labels are found in a zarr file, from 2.0.0 in a parquet file; the info is missing in some prototype 2.0.0 xenium files, like in the xenium_2.0.0_io dataset from spatialdata-sandbox), I prefer not optimize this further in this PR.

If for some reasons this data turns out to be less rare than anticipated, or if the fallback code is needed for other reasons, I'd be open to have the behavior you described implemented.

@LucaMarconato
Copy link
Member Author

LucaMarconato commented Feb 9, 2026

Unreleated failing: scverse/spatialdata#1065. Will be fixed later today with a pre-release.

@LucaMarconato
Copy link
Member Author

Turns out the offending compute was somewhere else! Here is the culprit: https://github.com/spatial-image/multiscale-spatial-image/blob/6217a331d3e0897dcb1d857086f68a6ab2a50f72/multiscale_spatial_image/to_multiscale/_dask_image.py#L196

The latest multiscale-spatial-image seems to have changed that, but it now depends on ngff-zarr. I will think what's best to do https://github.com/fideus-labs/ngff-zarr/blob/main/py/ngff_zarr/methods/_dask_image.py#L108, to use a different downsampling strategy, such as https://github.com/spatial-image/multiscale-spatial-image/blob/main/multiscale_spatial_image/to_multiscale/_xarray.py#L4 (and then change the way we read the data back, by loading each scale directly from disk), or start using ngff-zarr as a backend.

@LucaMarconato
Copy link
Member Author

Considerations about the ngff-zarr dependency: fideus-labs/ngff-zarr#340

@LucaMarconato
Copy link
Member Author

Turns out the offending compute was somewhere else! Here is the culprit: spatial-image/multiscale-spatial-image@6217a33/multiscale_spatial_image/to_multiscale/_dask_image.py#L196

This PR scverse/spatialdata#1068 fixes the above and leads to a massive performance improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants