perf: optimize shape_to_stop_matching validator#2
perf: optimize shape_to_stop_matching validator#2ryan-mahoney wants to merge 4 commits intomasterfrom
Conversation
…travel_speed) - Optimize shape_to_stop_matching (2.631s) and stop_time_travel_speed (0.906s) - Improve shape_to_stop_matching_util with more efficient algorithms - Update dependencies in pyproject.toml for performance gains - Add tests for optimized utility functions
…ays passing shape_arrays Remove conditional use_vectorized checks in match_using_geo_distance and match_using_user_distance functions. The shape_arrays parameter is now passed directly to helper functions, letting them handle None cases.
- Introduce TripInfo dataclass to replace dict usage for trip data - Defer CandidateMatch instantiation in find_potential_matches and find_closest_on_shape by tracking scalar values during iteration - Simplify routes_by_id to route_type_by_id storing only needed field - Remove Optional types and assertions from hot paths
- Early return when no trips have shape_id to avoid unnecessary processing - Cache stop_id in local variable in build_stop_points to avoid repeated dictionary lookups in hot loop
There was a problem hiding this comment.
Pull request overview
This PR focuses on improving validator throughput by reducing Python-side work in hot paths, primarily for shape_to_stop_matching, and also refactors parts of stop_time_travel_speed toward Polars-native computations.
Changes:
- Add numpy-backed vectorized geometry helpers, spatial indexing, and caching for
shape_to_stop_matching. - Refactor trip/route handling to reduce allocations (e.g.,
TripInfo, slimmer lookups) and add early exit when noshape_idis present. - Rework
stop_time_travel_speedconsecutive-stop checks using Polars expressions and joins.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/gtfs_validator/validators/shape_to_stop_matching_util.py |
Introduces numpy vectorization, spatial index, and caching keys for matching helpers. |
src/gtfs_validator/validators/shape_to_stop_matching.py |
Uses TripInfo, groups trips by shape_id, adds early exit, and wires in caches/vectorized helpers. |
src/gtfs_validator/validators/stop_time_travel_speed.py |
Moves consecutive-stop speed checking to Polars-native pipeline and restructures far-stop computation inputs. |
tests/test_validators/test_shape_to_stop_matching_util.py |
Adds tests for fast-path assignment and vectorized/spatial-index cross-checks. |
pyproject.toml |
Adds numpy dependency. |
uv.lock |
Locks numpy dependency. |
specops/slowest.md |
Updates benchmark timings after performance work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_find_closest_on_shape_vec_matches_scalar() -> None: | ||
| """Vectorised find_closest_on_shape returns same result as scalar.""" | ||
| shape_points = build_shape_points(SHAPE_ROWS) | ||
| shape_arrays = build_shape_arrays(shape_points) | ||
| # A stop that's far from the shape | ||
| stop = _make_stop(47.370, 8.530) | ||
|
|
||
| scalar = find_closest_on_shape(stop, shape_points) | ||
| vec = find_closest_on_shape(stop, shape_points, shape_arrays) | ||
|
|
There was a problem hiding this comment.
This test is labeled as exercising the vectorised closest-point path, but find_closest_on_shape only uses the numpy implementation when the shape has at least _VEC_MIN_SEGMENTS segments. With SHAPE_ROWS (5 points), the scalar path is still used even when shape_arrays is passed. Use a larger shape (>=512 segments) so the vectorised branch is actually executed.
| from gtfs_validator.validators.shape_to_stop_matching_util import ( | ||
| CandidateMatch, | ||
| ShapeArrays, | ||
| ShapePoint, | ||
| StopPoint, | ||
| build_shape_arrays, | ||
| build_shape_points, | ||
| closest_point_on_edge, |
There was a problem hiding this comment.
ShapeArrays is imported but never used in this test module, which will fail Ruff (F401). Remove the unused import (or add a usage if intended).
| import numpy as np | ||
| from gtfs_validator.validators.shape_to_stop_matching_util import ( | ||
| _cross, | ||
| _dot, | ||
| _norm, | ||
| _normalize, | ||
| _seg_fraction, | ||
| ) |
There was a problem hiding this comment.
There is a second import block mid-file (import numpy as np and helpers). Ruff enforces import-at-top-of-file and sorted imports (I001/E402). Move these imports to the top-level import section and remove any unused ones.
| from gtfs_validator.validators.shape_to_stop_matching_util import ( | ||
| _cross, | ||
| _dot, | ||
| _norm, | ||
| _normalize, | ||
| _seg_fraction, | ||
| ) |
There was a problem hiding this comment.
_cross, _dot, _norm, and _normalize are imported here but not referenced anywhere in the file, which will fail Ruff unused-import checks. Drop these imports (keeping only the helpers actually used, e.g. _seg_fraction).
| from gtfs_validator.validators.shape_to_stop_matching_util import ( | |
| _cross, | |
| _dot, | |
| _norm, | |
| _normalize, | |
| _seg_fraction, | |
| ) | |
| from gtfs_validator.validators.shape_to_stop_matching_util import _seg_fraction |
| # --- Add resolved lat/lon via join --- | ||
| latlng_rows = [ | ||
| {"stop_id": sid, "_rlat": ll[0], "_rlon": ll[1]} | ||
| for sid, ll in resolved_latlng_by_stop_id.items() | ||
| if ll is not None | ||
| ] | ||
| if latlng_rows: | ||
| latlng_df = pl.DataFrame(latlng_rows).with_columns( | ||
| pl.col("_rlat").cast(pl.Float64), pl.col("_rlon").cast(pl.Float64), | ||
| ) | ||
| seg_dists_raw = _haversine_km_vectorized( | ||
| lats_arr[:-1], lons_arr[:-1], lats_arr[1:], lons_arr[1:] | ||
| joined = joined.join(latlng_df, on="stop_id", how="left") | ||
| else: | ||
| joined = joined.with_columns( | ||
| pl.lit(None, dtype=pl.Float64).alias("_rlat"), | ||
| pl.lit(None, dtype=pl.Float64).alias("_rlon"), | ||
| ) | ||
| seg_dists_safe = np.where(np.isnan(seg_dists_raw), 0.0, seg_dists_raw) | ||
| prefix_distances_km = np.concatenate([[0.0], np.cumsum(seg_dists_safe)]) | ||
| total_trip_distance_km = float(prefix_distances_km[-1]) | ||
|
|
||
| # Consecutive-stop check: iterate over consecutive valid-latlng pairs. | ||
| # Precompute distances between consecutive valid stops using NumPy. | ||
| valid_rel_indices = [i for i in range(trip_len) if trip_latlngs[i] is not None] | ||
| if len(valid_rel_indices) >= 2: | ||
| v_lats = lats_arr[valid_rel_indices] | ||
| v_lons = lons_arr[valid_rel_indices] | ||
| consec_dists = _haversine_km_vectorized( | ||
| v_lats[:-1], v_lons[:-1], v_lats[1:], v_lons[1:] | ||
| ) | ||
| for j in range(len(valid_rel_indices) - 1): | ||
| start_rel = valid_rel_indices[j] | ||
| end_rel = valid_rel_indices[j + 1] | ||
| start_idx = trip_start + start_rel | ||
| end_idx = trip_start + end_rel | ||
|
|
||
| departure_secs = departure_times[start_idx] | ||
| arrival_secs = arrival_times[end_idx] | ||
| if departure_secs is None or arrival_secs is None: | ||
| continue | ||
|
|
||
| distance_km = float(consec_dists[j]) | ||
| speed = get_speed_kph(distance_km, departure_secs, arrival_secs) | ||
| if speed > max_speed: | ||
| notices.append( | ||
| Notice( | ||
| code="fast_travel_between_consecutive_stops", | ||
| severity=Severity.WARNING, | ||
| fields=_build_notice_fields( | ||
| int(trip_csv_row_number), | ||
| str(trip_id), | ||
| str(route_id), | ||
| speed, | ||
| distance_km, | ||
| start_idx, | ||
| end_idx, | ||
| csv_row_numbers, | ||
| stop_sequences, | ||
| stop_ids, | ||
| departure_times, | ||
| arrival_times, | ||
| stop_name_by_stop_id, | ||
| ), | ||
| ) | ||
| ) | ||
| # --- Add max speed per route_type --- | ||
| speed_rows = [{"route_type": k, "max_speed_kph": v} for k, v in MAX_SPEED_KPH.items()] | ||
| speed_df = pl.DataFrame(speed_rows).with_columns( | ||
| pl.col("route_type").cast(joined["route_type"].dtype), | ||
| ) | ||
| joined = joined.join(speed_df, on="route_type", how="left").with_columns( | ||
| pl.col("max_speed_kph").fill_null(DEFAULT_MAX_SPEED_KPH), | ||
| ) |
There was a problem hiding this comment.
Polars join() does not guarantee preserving the existing sort order. The consecutive-stop and far-stop logic relies on rows being ordered by trip_id, stop_sequence before using shift().over("trip_id") and later list-based iteration. Re-sort after the joins (or use maintain_order="left" on the joins) to ensure correct pairing.
| def _haversine_km_expr( | ||
| lat1: str, lon1: str, lat2: str, lon2: str, | ||
| ) -> pl.Expr: | ||
| """Polars expression: haversine distance in km between two lat/lon column pairs.""" | ||
| R = 6371.0 | ||
| dlat = (pl.col(lat2) - pl.col(lat1)).radians() / 2 | ||
| dlon = (pl.col(lon2) - pl.col(lon1)).radians() / 2 | ||
| a = ( | ||
| dlat.sin() ** 2 | ||
| + pl.col(lat1).radians().cos() * pl.col(lat2).radians().cos() * dlon.sin() ** 2 | ||
| ) | ||
| return R * 2 * a.sqrt().arcsin() |
There was a problem hiding this comment.
_haversine_km_expr computes a for the haversine formula but does not clamp it into [0, 1]. Due to floating-point error, a can slightly exceed 1 and make arcsin() return NaN/Null, which would silently drop violations. Clamp a (e.g., with .clip(0.0, 1.0)) before taking sqrt/arcsin.
| def test_find_potential_matches_vec_matches_scalar() -> None: | ||
| """Vectorised find_potential_matches returns same candidates as scalar.""" | ||
| shape_points = build_shape_points(SHAPE_ROWS) | ||
| shape_arrays = build_shape_arrays(shape_points) | ||
| stop = _make_stop(47.3659, 8.5254) | ||
|
|
||
| scalar_matches = find_potential_matches(stop, shape_points, 150.0) | ||
| vec_matches = find_potential_matches(stop, shape_points, 150.0, shape_arrays) | ||
|
|
There was a problem hiding this comment.
These “vec_matches” assertions won’t exercise the vectorised implementation because find_potential_matches only takes the numpy path when (len(shape_points) - 1) >= _VEC_MIN_SEGMENTS (currently 512). With SHAPE_ROWS (5 points), both calls will run the scalar path, so the test can’t catch vectorisation regressions. Adjust the fixture to exceed the segment threshold (e.g., repeat the shape rows) or parameterize the threshold for testing.
Summary
This PR optimizes the
shape_to_stop_matchingvalidator, which was identified as one of the slowest validators during performance profiling.Changes
use_vectorizedchecks and always passshape_arraysto helper functions, letting them handle None cases directlyTripInfodataclass to replace dict usage for trip dataCandidateMatchinstantiation by tracking scalar values during iterationroutes_by_idtoroute_type_by_idstoring only the needed fieldshape_idto avoid unnecessary processingstop_idin local variable inbuild_stop_pointshot loopCommits
Testing