Performance optimisations and pandas 2.x compatibility fixes #135
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
A collection of performance optimisations, bug fixes, and pandas 2.x compatibility improvements for TopoPyScale. Tested on a full 2000-cluster Central Asia domain (33-44°N, 60-80°E) with 500m DEM resolution.
Changes
Bug Fixes
8a3a535): Pandas 2.xitertuples()creates dynamically-named namedtuple classes that cannot be pickled for multiprocessing. Converted toSimpleNamespacewhich provides the same attribute access while being fully picklable.e2b18bf): Prevents KMeans crash (Input X contains NaN) when DEM contains nodata pixels (e.g. from reprojected rasters). Nodata pixels are now automatically excluded from clustering, with or without a user-provided mask file.Performance Optimisations
41588dd): Extract worker function and add optionaln_workersparameter tosearch_number_of_clusters(). Default is sequential (backward compatible). Expected 4-8x speedup when enabled.03782b1): CreateTransformer.from_crs()once and pass via meta dict instead of creating per-point (2000 repeated calls). Includes fallback for backward compatibility.30c3bb3): Enableparallel=Trueinopen_mfdatasetcalls for concurrent NetCDF file opening. Safe for read operations.860e2b3): Replaceiterrows()withitertuples()intopo_scale.pyandtopo_param.py. Eliminates redundant loops. 10-15% faster for DataFrame iteration.2134b5b): Cachemonthly_coeffs.coef.sel(...)andelev_diffto avoid duplicate xarray operations. Minor (~2%) but runs 2000x per domain.Reverted
c966d2b): Reverted. The numpy fast-path targeted 1D geopotential height arrays, but ERA5 geopotential is always time-varying (2D), so the optimisation never triggered. Restored original xarray.where().argmax/argminapproach.Minor Fixes (fetch_era5.py)
isel(slice(None,None,-1))withsortby('level')for robustnesseraDirinstr()to supportPathobjectsBreaking Change Analysis
horizon_daDataArray, not from rowsrow.Indexmaps correctly. No pandas Series methods called on row objectsn_workers=None= sequential = identical to current behaviorTest Results
Full pipeline on 2000-cluster domain completed successfully:
🤖 Generated with Claude Code