Skip to content

perf: optimize shape_to_stop_matching validator#2

Open
ryan-mahoney wants to merge 4 commits intomasterfrom
rpm-migrate-to-python-polars
Open

perf: optimize shape_to_stop_matching validator#2
ryan-mahoney wants to merge 4 commits intomasterfrom
rpm-migrate-to-python-polars

Conversation

@ryan-mahoney
Copy link
Member

Summary

This PR optimizes the shape_to_stop_matching validator, which was identified as one of the slowest validators during performance profiling.

Changes

  • Simplify vectorization logic: Remove conditional use_vectorized checks and always pass shape_arrays to helper functions, letting them handle None cases directly
  • Reduce object allocation in hot loops:
    • Introduce TripInfo dataclass to replace dict usage for trip data
    • Defer CandidateMatch instantiation by tracking scalar values during iteration
    • Simplify routes_by_id to route_type_by_id storing only the needed field
  • Add early exit: Return early when no trips have shape_id to avoid unnecessary processing
  • Reduce dict lookups: Cache stop_id in local variable in build_stop_points hot loop

Commits

  • 5110a9f perf: optimize slowest validators (shape_to_stop_matching, stop_time_travel_speed)
  • 0c742c9 refactor(shape_to_stop_matching): simplify vectorization logic by always passing shape_arrays
  • 80898eb perf(shape_to_stop_matching): reduce object allocation in hot loops
  • 9197d50 perf(shape_to_stop_matching): add early exit and reduce dict lookups

Testing

  • All existing tests pass
  • Behavioral equivalence maintained with previous implementation

…travel_speed)

- Optimize shape_to_stop_matching (2.631s) and stop_time_travel_speed (0.906s)
- Improve shape_to_stop_matching_util with more efficient algorithms
- Update dependencies in pyproject.toml for performance gains
- Add tests for optimized utility functions
…ays passing shape_arrays

Remove conditional use_vectorized checks in match_using_geo_distance and
match_using_user_distance functions. The shape_arrays parameter is now
passed directly to helper functions, letting them handle None cases.
- Introduce TripInfo dataclass to replace dict usage for trip data
- Defer CandidateMatch instantiation in find_potential_matches and
  find_closest_on_shape by tracking scalar values during iteration
- Simplify routes_by_id to route_type_by_id storing only needed field
- Remove Optional types and assertions from hot paths
- Early return when no trips have shape_id to avoid unnecessary processing
- Cache stop_id in local variable in build_stop_points to avoid repeated
  dictionary lookups in hot loop
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR focuses on improving validator throughput by reducing Python-side work in hot paths, primarily for shape_to_stop_matching, and also refactors parts of stop_time_travel_speed toward Polars-native computations.

Changes:

  • Add numpy-backed vectorized geometry helpers, spatial indexing, and caching for shape_to_stop_matching.
  • Refactor trip/route handling to reduce allocations (e.g., TripInfo, slimmer lookups) and add early exit when no shape_id is present.
  • Rework stop_time_travel_speed consecutive-stop checks using Polars expressions and joins.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/gtfs_validator/validators/shape_to_stop_matching_util.py Introduces numpy vectorization, spatial index, and caching keys for matching helpers.
src/gtfs_validator/validators/shape_to_stop_matching.py Uses TripInfo, groups trips by shape_id, adds early exit, and wires in caches/vectorized helpers.
src/gtfs_validator/validators/stop_time_travel_speed.py Moves consecutive-stop speed checking to Polars-native pipeline and restructures far-stop computation inputs.
tests/test_validators/test_shape_to_stop_matching_util.py Adds tests for fast-path assignment and vectorized/spatial-index cross-checks.
pyproject.toml Adds numpy dependency.
uv.lock Locks numpy dependency.
specops/slowest.md Updates benchmark timings after performance work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +477 to +486
def test_find_closest_on_shape_vec_matches_scalar() -> None:
"""Vectorised find_closest_on_shape returns same result as scalar."""
shape_points = build_shape_points(SHAPE_ROWS)
shape_arrays = build_shape_arrays(shape_points)
# A stop that's far from the shape
stop = _make_stop(47.370, 8.530)

scalar = find_closest_on_shape(stop, shape_points)
vec = find_closest_on_shape(stop, shape_points, shape_arrays)

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is labeled as exercising the vectorised closest-point path, but find_closest_on_shape only uses the numpy implementation when the shape has at least _VEC_MIN_SEGMENTS segments. With SHAPE_ROWS (5 points), the scalar path is still used even when shape_arrays is passed. Use a larger shape (>=512 segments) so the vectorised branch is actually executed.

Copilot uses AI. Check for mistakes.
Comment on lines 9 to 16
from gtfs_validator.validators.shape_to_stop_matching_util import (
CandidateMatch,
ShapeArrays,
ShapePoint,
StopPoint,
build_shape_arrays,
build_shape_points,
closest_point_on_edge,
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShapeArrays is imported but never used in this test module, which will fail Ruff (F401). Remove the unused import (or add a usage if intended).

Copilot uses AI. Check for mistakes.
Comment on lines +394 to +401
import numpy as np
from gtfs_validator.validators.shape_to_stop_matching_util import (
_cross,
_dot,
_norm,
_normalize,
_seg_fraction,
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a second import block mid-file (import numpy as np and helpers). Ruff enforces import-at-top-of-file and sorted imports (I001/E402). Move these imports to the top-level import section and remove any unused ones.

Copilot uses AI. Check for mistakes.
Comment on lines +395 to +401
from gtfs_validator.validators.shape_to_stop_matching_util import (
_cross,
_dot,
_norm,
_normalize,
_seg_fraction,
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_cross, _dot, _norm, and _normalize are imported here but not referenced anywhere in the file, which will fail Ruff unused-import checks. Drop these imports (keeping only the helpers actually used, e.g. _seg_fraction).

Suggested change
from gtfs_validator.validators.shape_to_stop_matching_util import (
_cross,
_dot,
_norm,
_normalize,
_seg_fraction,
)
from gtfs_validator.validators.shape_to_stop_matching_util import _seg_fraction

Copilot uses AI. Check for mistakes.
Comment on lines +324 to +348
# --- Add resolved lat/lon via join ---
latlng_rows = [
{"stop_id": sid, "_rlat": ll[0], "_rlon": ll[1]}
for sid, ll in resolved_latlng_by_stop_id.items()
if ll is not None
]
if latlng_rows:
latlng_df = pl.DataFrame(latlng_rows).with_columns(
pl.col("_rlat").cast(pl.Float64), pl.col("_rlon").cast(pl.Float64),
)
seg_dists_raw = _haversine_km_vectorized(
lats_arr[:-1], lons_arr[:-1], lats_arr[1:], lons_arr[1:]
joined = joined.join(latlng_df, on="stop_id", how="left")
else:
joined = joined.with_columns(
pl.lit(None, dtype=pl.Float64).alias("_rlat"),
pl.lit(None, dtype=pl.Float64).alias("_rlon"),
)
seg_dists_safe = np.where(np.isnan(seg_dists_raw), 0.0, seg_dists_raw)
prefix_distances_km = np.concatenate([[0.0], np.cumsum(seg_dists_safe)])
total_trip_distance_km = float(prefix_distances_km[-1])

# Consecutive-stop check: iterate over consecutive valid-latlng pairs.
# Precompute distances between consecutive valid stops using NumPy.
valid_rel_indices = [i for i in range(trip_len) if trip_latlngs[i] is not None]
if len(valid_rel_indices) >= 2:
v_lats = lats_arr[valid_rel_indices]
v_lons = lons_arr[valid_rel_indices]
consec_dists = _haversine_km_vectorized(
v_lats[:-1], v_lons[:-1], v_lats[1:], v_lons[1:]
)
for j in range(len(valid_rel_indices) - 1):
start_rel = valid_rel_indices[j]
end_rel = valid_rel_indices[j + 1]
start_idx = trip_start + start_rel
end_idx = trip_start + end_rel

departure_secs = departure_times[start_idx]
arrival_secs = arrival_times[end_idx]
if departure_secs is None or arrival_secs is None:
continue

distance_km = float(consec_dists[j])
speed = get_speed_kph(distance_km, departure_secs, arrival_secs)
if speed > max_speed:
notices.append(
Notice(
code="fast_travel_between_consecutive_stops",
severity=Severity.WARNING,
fields=_build_notice_fields(
int(trip_csv_row_number),
str(trip_id),
str(route_id),
speed,
distance_km,
start_idx,
end_idx,
csv_row_numbers,
stop_sequences,
stop_ids,
departure_times,
arrival_times,
stop_name_by_stop_id,
),
)
)
# --- Add max speed per route_type ---
speed_rows = [{"route_type": k, "max_speed_kph": v} for k, v in MAX_SPEED_KPH.items()]
speed_df = pl.DataFrame(speed_rows).with_columns(
pl.col("route_type").cast(joined["route_type"].dtype),
)
joined = joined.join(speed_df, on="route_type", how="left").with_columns(
pl.col("max_speed_kph").fill_null(DEFAULT_MAX_SPEED_KPH),
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polars join() does not guarantee preserving the existing sort order. The consecutive-stop and far-stop logic relies on rows being ordered by trip_id, stop_sequence before using shift().over("trip_id") and later list-based iteration. Re-sort after the joins (or use maintain_order="left" on the joins) to ensure correct pairing.

Copilot uses AI. Check for mistakes.
Comment on lines +265 to +276
def _haversine_km_expr(
lat1: str, lon1: str, lat2: str, lon2: str,
) -> pl.Expr:
"""Polars expression: haversine distance in km between two lat/lon column pairs."""
R = 6371.0
dlat = (pl.col(lat2) - pl.col(lat1)).radians() / 2
dlon = (pl.col(lon2) - pl.col(lon1)).radians() / 2
a = (
dlat.sin() ** 2
+ pl.col(lat1).radians().cos() * pl.col(lat2).radians().cos() * dlon.sin() ** 2
)
return R * 2 * a.sqrt().arcsin()
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_haversine_km_expr computes a for the haversine formula but does not clamp it into [0, 1]. Due to floating-point error, a can slightly exceed 1 and make arcsin() return NaN/Null, which would silently drop violations. Clamp a (e.g., with .clip(0.0, 1.0)) before taking sqrt/arcsin.

Copilot uses AI. Check for mistakes.
Comment on lines +460 to +468
def test_find_potential_matches_vec_matches_scalar() -> None:
"""Vectorised find_potential_matches returns same candidates as scalar."""
shape_points = build_shape_points(SHAPE_ROWS)
shape_arrays = build_shape_arrays(shape_points)
stop = _make_stop(47.3659, 8.5254)

scalar_matches = find_potential_matches(stop, shape_points, 150.0)
vec_matches = find_potential_matches(stop, shape_points, 150.0, shape_arrays)

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These “vec_matches” assertions won’t exercise the vectorised implementation because find_potential_matches only takes the numpy path when (len(shape_points) - 1) >= _VEC_MIN_SEGMENTS (currently 512). With SHAPE_ROWS (5 points), both calls will run the scalar path, so the test can’t catch vectorisation regressions. Adjust the fixture to exceed the segment threshold (e.g., repeat the shape rows) or parameterize the threshold for testing.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants