perf: optimize shape_to_stop_matching validator by ryan-mahoney · Pull Request #2 · TransitOPS/gtfs-validator

ryan-mahoney · 2026-03-10T05:01:56Z

Summary

This PR optimizes the shape_to_stop_matching validator, which was identified as one of the slowest validators during performance profiling.

Changes

Simplify vectorization logic: Remove conditional use_vectorized checks and always pass shape_arrays to helper functions, letting them handle None cases directly
Reduce object allocation in hot loops:
- Introduce TripInfo dataclass to replace dict usage for trip data
- Defer CandidateMatch instantiation by tracking scalar values during iteration
- Simplify routes_by_id to route_type_by_id storing only the needed field
Add early exit: Return early when no trips have shape_id to avoid unnecessary processing
Reduce dict lookups: Cache stop_id in local variable in build_stop_points hot loop

Commits

5110a9f perf: optimize slowest validators (shape_to_stop_matching, stop_time_travel_speed)
0c742c9 refactor(shape_to_stop_matching): simplify vectorization logic by always passing shape_arrays
80898eb perf(shape_to_stop_matching): reduce object allocation in hot loops
9197d50 perf(shape_to_stop_matching): add early exit and reduce dict lookups

Testing

All existing tests pass
Behavioral equivalence maintained with previous implementation

…travel_speed) - Optimize shape_to_stop_matching (2.631s) and stop_time_travel_speed (0.906s) - Improve shape_to_stop_matching_util with more efficient algorithms - Update dependencies in pyproject.toml for performance gains - Add tests for optimized utility functions

…ays passing shape_arrays Remove conditional use_vectorized checks in match_using_geo_distance and match_using_user_distance functions. The shape_arrays parameter is now passed directly to helper functions, letting them handle None cases.

- Introduce TripInfo dataclass to replace dict usage for trip data - Defer CandidateMatch instantiation in find_potential_matches and find_closest_on_shape by tracking scalar values during iteration - Simplify routes_by_id to route_type_by_id storing only needed field - Remove Optional types and assertions from hot paths

- Early return when no trips have shape_id to avoid unnecessary processing - Cache stop_id in local variable in build_stop_points to avoid repeated dictionary lookups in hot loop

Copilot

Pull request overview

This PR focuses on improving validator throughput by reducing Python-side work in hot paths, primarily for shape_to_stop_matching, and also refactors parts of stop_time_travel_speed toward Polars-native computations.

Changes:

Add numpy-backed vectorized geometry helpers, spatial indexing, and caching for shape_to_stop_matching.
Refactor trip/route handling to reduce allocations (e.g., TripInfo, slimmer lookups) and add early exit when no shape_id is present.
Rework stop_time_travel_speed consecutive-stop checks using Polars expressions and joins.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`src/gtfs_validator/validators/shape_to_stop_matching_util.py`	Introduces numpy vectorization, spatial index, and caching keys for matching helpers.
`src/gtfs_validator/validators/shape_to_stop_matching.py`	Uses `TripInfo`, groups trips by `shape_id`, adds early exit, and wires in caches/vectorized helpers.
`src/gtfs_validator/validators/stop_time_travel_speed.py`	Moves consecutive-stop speed checking to Polars-native pipeline and restructures far-stop computation inputs.
`tests/test_validators/test_shape_to_stop_matching_util.py`	Adds tests for fast-path assignment and vectorized/spatial-index cross-checks.
`pyproject.toml`	Adds numpy dependency.
`uv.lock`	Locks numpy dependency.
`specops/slowest.md`	Updates benchmark timings after performance work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T05:09:20Z

tests/test_validators/test_shape_to_stop_matching_util.py

+def test_find_closest_on_shape_vec_matches_scalar() -> None:
+    """Vectorised find_closest_on_shape returns same result as scalar."""
+    shape_points = build_shape_points(SHAPE_ROWS)
+    shape_arrays = build_shape_arrays(shape_points)
+    # A stop that's far from the shape
+    stop = _make_stop(47.370, 8.530)
+
+    scalar = find_closest_on_shape(stop, shape_points)
+    vec = find_closest_on_shape(stop, shape_points, shape_arrays)
+


This test is labeled as exercising the vectorised closest-point path, but find_closest_on_shape only uses the numpy implementation when the shape has at least _VEC_MIN_SEGMENTS segments. With SHAPE_ROWS (5 points), the scalar path is still used even when shape_arrays is passed. Use a larger shape (>=512 segments) so the vectorised branch is actually executed.

Copilot · 2026-03-10T05:09:20Z

tests/test_validators/test_shape_to_stop_matching_util.py

 from gtfs_validator.validators.shape_to_stop_matching_util import (
    CandidateMatch,
+    ShapeArrays,
    ShapePoint,
    StopPoint,
+    build_shape_arrays,
    build_shape_points,
    closest_point_on_edge,


ShapeArrays is imported but never used in this test module, which will fail Ruff (F401). Remove the unused import (or add a usage if intended).

Copilot · 2026-03-10T05:09:21Z

tests/test_validators/test_shape_to_stop_matching_util.py

+import numpy as np
+from gtfs_validator.validators.shape_to_stop_matching_util import (
+    _cross,
+    _dot,
+    _norm,
+    _normalize,
+    _seg_fraction,
+)


There is a second import block mid-file (import numpy as np and helpers). Ruff enforces import-at-top-of-file and sorted imports (I001/E402). Move these imports to the top-level import section and remove any unused ones.

Copilot · 2026-03-10T05:09:21Z

tests/test_validators/test_shape_to_stop_matching_util.py

+from gtfs_validator.validators.shape_to_stop_matching_util import (
+    _cross,
+    _dot,
+    _norm,
+    _normalize,
+    _seg_fraction,
+)


_cross, _dot, _norm, and _normalize are imported here but not referenced anywhere in the file, which will fail Ruff unused-import checks. Drop these imports (keeping only the helpers actually used, e.g. _seg_fraction).

Suggested change

from gtfs_validator.validators.shape_to_stop_matching_util import (

_cross,

_dot,

_norm,

_normalize,

_seg_fraction,

)

from gtfs_validator.validators.shape_to_stop_matching_util import _seg_fraction

Copilot · 2026-03-10T05:09:21Z

src/gtfs_validator/validators/stop_time_travel_speed.py

+    # --- Add resolved lat/lon via join ---
+    latlng_rows = [
+        {"stop_id": sid, "_rlat": ll[0], "_rlon": ll[1]}
+        for sid, ll in resolved_latlng_by_stop_id.items()
+        if ll is not None
+    ]
+    if latlng_rows:
+        latlng_df = pl.DataFrame(latlng_rows).with_columns(
+            pl.col("_rlat").cast(pl.Float64), pl.col("_rlon").cast(pl.Float64),
        )
-        seg_dists_raw = _haversine_km_vectorized(
-            lats_arr[:-1], lons_arr[:-1], lats_arr[1:], lons_arr[1:]
+        joined = joined.join(latlng_df, on="stop_id", how="left")
+    else:
+        joined = joined.with_columns(
+            pl.lit(None, dtype=pl.Float64).alias("_rlat"),
+            pl.lit(None, dtype=pl.Float64).alias("_rlon"),
        )
-        seg_dists_safe = np.where(np.isnan(seg_dists_raw), 0.0, seg_dists_raw)
-        prefix_distances_km = np.concatenate([[0.0], np.cumsum(seg_dists_safe)])
-        total_trip_distance_km = float(prefix_distances_km[-1])
-
-        # Consecutive-stop check: iterate over consecutive valid-latlng pairs.
-        # Precompute distances between consecutive valid stops using NumPy.
-        valid_rel_indices = [i for i in range(trip_len) if trip_latlngs[i] is not None]
-        if len(valid_rel_indices) >= 2:
-            v_lats = lats_arr[valid_rel_indices]
-            v_lons = lons_arr[valid_rel_indices]
-            consec_dists = _haversine_km_vectorized(
-                v_lats[:-1], v_lons[:-1], v_lats[1:], v_lons[1:]
-            )
-            for j in range(len(valid_rel_indices) - 1):
-                start_rel = valid_rel_indices[j]
-                end_rel = valid_rel_indices[j + 1]
-                start_idx = trip_start + start_rel
-                end_idx = trip_start + end_rel
-
-                departure_secs = departure_times[start_idx]
-                arrival_secs = arrival_times[end_idx]
-                if departure_secs is None or arrival_secs is None:
-                    continue

-                distance_km = float(consec_dists[j])
-                speed = get_speed_kph(distance_km, departure_secs, arrival_secs)
-                if speed > max_speed:
-                    notices.append(
-                        Notice(
-                            code="fast_travel_between_consecutive_stops",
-                            severity=Severity.WARNING,
-                            fields=_build_notice_fields(
-                                int(trip_csv_row_number),
-                                str(trip_id),
-                                str(route_id),
-                                speed,
-                                distance_km,
-                                start_idx,
-                                end_idx,
-                                csv_row_numbers,
-                                stop_sequences,
-                                stop_ids,
-                                departure_times,
-                                arrival_times,
-                                stop_name_by_stop_id,
-                            ),
-                        )
-                    )
+    # --- Add max speed per route_type ---
+    speed_rows = [{"route_type": k, "max_speed_kph": v} for k, v in MAX_SPEED_KPH.items()]
+    speed_df = pl.DataFrame(speed_rows).with_columns(
+        pl.col("route_type").cast(joined["route_type"].dtype),
+    )
+    joined = joined.join(speed_df, on="route_type", how="left").with_columns(
+        pl.col("max_speed_kph").fill_null(DEFAULT_MAX_SPEED_KPH),
+    )


Polars join() does not guarantee preserving the existing sort order. The consecutive-stop and far-stop logic relies on rows being ordered by trip_id, stop_sequence before using shift().over("trip_id") and later list-based iteration. Re-sort after the joins (or use maintain_order="left" on the joins) to ensure correct pairing.

Copilot · 2026-03-10T05:09:21Z

src/gtfs_validator/validators/stop_time_travel_speed.py

+def _haversine_km_expr(
+    lat1: str, lon1: str, lat2: str, lon2: str,
+) -> pl.Expr:
+    """Polars expression: haversine distance in km between two lat/lon column pairs."""
+    R = 6371.0
+    dlat = (pl.col(lat2) - pl.col(lat1)).radians() / 2
+    dlon = (pl.col(lon2) - pl.col(lon1)).radians() / 2
+    a = (
+        dlat.sin() ** 2
+        + pl.col(lat1).radians().cos() * pl.col(lat2).radians().cos() * dlon.sin() ** 2
+    )
+    return R * 2 * a.sqrt().arcsin()


_haversine_km_expr computes a for the haversine formula but does not clamp it into [0, 1]. Due to floating-point error, a can slightly exceed 1 and make arcsin() return NaN/Null, which would silently drop violations. Clamp a (e.g., with .clip(0.0, 1.0)) before taking sqrt/arcsin.

Copilot · 2026-03-10T05:09:21Z

tests/test_validators/test_shape_to_stop_matching_util.py

+def test_find_potential_matches_vec_matches_scalar() -> None:
+    """Vectorised find_potential_matches returns same candidates as scalar."""
+    shape_points = build_shape_points(SHAPE_ROWS)
+    shape_arrays = build_shape_arrays(shape_points)
+    stop = _make_stop(47.3659, 8.5254)
+
+    scalar_matches = find_potential_matches(stop, shape_points, 150.0)
+    vec_matches = find_potential_matches(stop, shape_points, 150.0, shape_arrays)
+


These “vec_matches” assertions won’t exercise the vectorised implementation because find_potential_matches only takes the numpy path when (len(shape_points) - 1) >= _VEC_MIN_SEGMENTS (currently 512). With SHAPE_ROWS (5 points), both calls will run the scalar path, so the test can’t catch vectorisation regressions. Adjust the fixture to exceed the segment threshold (e.g., repeat the shape rows) or parameterize the threshold for testing.

ryan-mahoney added 4 commits March 10, 2026 00:06

perf(shape_to_stop_matching): add early exit and reduce dict lookups

9197d50

- Early return when no trips have shape_id to avoid unnecessary processing - Cache stop_id in local variable in build_stop_points to avoid repeated dictionary lookups in hot loop

ryan-mahoney requested a review from Copilot March 10, 2026 05:04

Copilot started reviewing on behalf of ryan-mahoney March 10, 2026 05:05 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize shape_to_stop_matching validator#2

perf: optimize shape_to_stop_matching validator#2
ryan-mahoney wants to merge 4 commits intomasterfrom
rpm-migrate-to-python-polars

ryan-mahoney commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ryan-mahoney commented Mar 10, 2026

Summary

Changes

Commits

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants