Analysis

12: Route Geographic Span vs OTP

Route and Service Drivers

Coverage: 2019-01 to 2025-11 (from otp_monthly).

Built 2026-04-03 20:09 UTC · Commit 7c56b9a

Page Navigation

Analysis Navigation

Data Provenance

flowchart LR
  12_geographic_span(["12: Route Geographic Span vs OTP"])
  t_otp_monthly[("otp_monthly")] --> 12_geographic_span
  01_data_ingestion[["Data Ingestion"]] --> t_otp_monthly
  u1_01_data_ingestion[/"data/routes_by_month.csv"/] --> 01_data_ingestion
  u2_01_data_ingestion[/"data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv"/] --> 01_data_ingestion
  u3_01_data_ingestion[/"data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv"/] --> 01_data_ingestion
  u4_01_data_ingestion[/"data/PRT_Stop_Reference_Lookup_Table.csv"/] --> 01_data_ingestion
  u5_01_data_ingestion[/"data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv"/] --> 01_data_ingestion
  t_route_stops[("route_stops")] --> 12_geographic_span
  01_data_ingestion[["Data Ingestion"]] --> t_route_stops
  t_routes[("routes")] --> 12_geographic_span
  01_data_ingestion[["Data Ingestion"]] --> t_routes
  t_stops[("stops")] --> 12_geographic_span
  01_data_ingestion[["Data Ingestion"]] --> t_stops
  d1_12_geographic_span(("numpy (lib)")) --> 12_geographic_span
  d2_12_geographic_span(("polars (lib)")) --> 12_geographic_span
  d3_12_geographic_span(("scipy (lib)")) --> 12_geographic_span
  classDef page fill:#dbeafe,stroke:#1d4ed8,color:#1e3a8a,stroke-width:2px;
  classDef table fill:#ecfeff,stroke:#0e7490,color:#164e63;
  classDef dep fill:#fff7ed,stroke:#c2410c,color:#7c2d12,stroke-dasharray: 4 2;
  classDef file fill:#eef2ff,stroke:#6366f1,color:#3730a3;
  classDef api fill:#f0fdf4,stroke:#16a34a,color:#14532d;
  classDef pipeline fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95;
  class 12_geographic_span page;
  class t_otp_monthly,t_route_stops,t_routes,t_stops table;
  class d1_12_geographic_span,d2_12_geographic_span,d3_12_geographic_span dep;
  class u1_01_data_ingestion,u2_01_data_ingestion,u3_01_data_ingestion,u4_01_data_ingestion,u5_01_data_ingestion file;
  class 01_data_ingestion pipeline;

Findings

Findings: Route Geographic Span vs OTP

Summary

Geographic span (the maximum distance between any two stops on a route) is a moderate negative predictor of OTP within bus routes (r = -0.38, p < 0.001), but stop count remains the stronger predictor after controlling for the other. Partial correlation analysis disentangles the two: stop count predicts OTP even after controlling for span (partial r = -0.41, p < 0.001), while span's independent contribution is smaller (partial r = -0.23, p = 0.03).

Key Numbers

  • Span vs OTP (bus only, primary): Pearson r = -0.38 (p < 0.001, n = 89), Spearman rho = -0.37 (p < 0.001)
  • Span vs OTP (all routes, secondary): r = -0.32 (p = 0.002, n = 92) -- includes Simpson's paradox risk
  • Stop density vs OTP (bus only): r = 0.04 (p = 0.71) -- not significant
  • Span vs OTP | stop count (bus, partial): r = -0.23 (p = 0.03)
  • Stop count vs OTP | span (bus, partial): r = -0.41 (p < 0.001)
  • Span-stop count collinearity: r = 0.41

Observations

  • The bus-only correlation (r = -0.38) is actually stronger than the all-mode correlation (r = -0.32). The pooled-mode result was muted by Simpson's paradox -- rail routes have moderate span but high OTP, pulling the all-mode trend line toward zero.
  • Span and stop count are moderately correlated (r = 0.41), but not so strongly as to make partial correlation unreliable.
  • After controlling for stop count, span still has a small but significant independent effect -- longer routes face additional challenges beyond just having more stops (longer exposure to traffic, more variance in conditions).
  • However, stop count's partial correlation (-0.41) is nearly twice span's (-0.23), confirming that the number of stops matters more than the distance covered.
  • Stop density (stops per km) shows no correlation with OTP at all (r = 0.04), meaning that tightly-packed stops are no worse than widely-spaced stops once total count and distance are accounted for.

Implication

Both stop count and route distance independently degrade OTP, but stop consolidation is the higher-leverage intervention. Shortening routes would help modestly, but eliminating stops on existing routes would have roughly twice the impact per unit of change.

Caveats

  • Geographic span (max pairwise distance) is a crude proxy for actual route length. GTFS shape data would provide a more accurate route-length measurement.
  • Routes with fewer than 12 months of data are excluded to reduce noise.

Review History

Output

Methods

Methods: Route Geographic Span vs OTP

Question

Does the geographic extent of a route predict on-time performance independently of stop count? Analysis 07 found stop count is the strongest OTP predictor (r = -0.53), but routes with many stops also tend to cover more distance. Disentangling the two could clarify whether the problem is "too many stops" or "too long a route."

Approach

  • For each route, collect all stop coordinates from route_stops joined to stops.
  • Compute geographic span as the maximum haversine distance between any pair of stops on the route (the diameter of the stop set in km).
  • Compute stop density as stops per km of span, to capture how tightly packed stops are.
  • Correlate span, stop density, and stop count separately with average OTP (Pearson and Spearman).
  • Use partial correlation to test whether span predicts OTP after controlling for stop count, and vice versa.
  • Scatter plots for span vs OTP and stop density vs OTP.

Data

Name Description Source
route_stops Links routes to stops prt.db table
stops Lat/lon coordinates prt.db table
otp_monthly Monthly OTP per route prt.db table
routes Mode classification prt.db table

Output

  • output/geographic_span.csv -- per-route span, stop density, stop count, avg OTP
  • output/span_vs_otp.png -- scatter plot of geographic span vs OTP
  • output/density_vs_otp.png -- scatter plot of stop density vs OTP

Source Code

"""Geographic span analysis: does route length predict OTP independently of stop count?"""

import math

import numpy as np
import polars as pl
from scipy import stats

from prt_otp_analysis.common import analysis_dir, correlate, mode_scatter, phase, query_to_polars, run_analysis, save_chart, save_csv, setup_plotting

OUT = analysis_dir(__file__)


def haversine_km(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
    """Return the great-circle distance in km between two lat/lon points."""
    R = 6371.0
    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = (math.sin(dlat / 2) ** 2
         + math.cos(math.radians(lat1)) * math.cos(math.radians(lat2))
         * math.sin(dlon / 2) ** 2)
    return R * 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))


def compute_span(lats: list[float], lons: list[float]) -> float:
    """Return the max pairwise haversine distance (km) among a set of points."""
    max_dist = 0.0
    n = len(lats)
    for i in range(n):
        for j in range(i + 1, n):
            d = haversine_km(lats[i], lons[i], lats[j], lons[j])
            if d > max_dist:
                max_dist = d
    return max_dist


def load_data() -> pl.DataFrame:
    """Load per-route geographic span, stop count, and average OTP."""
    stops_by_route = query_to_polars("""
        SELECT rs.route_id, s.lat, s.lon
        FROM route_stops rs
        JOIN stops s ON rs.stop_id = s.stop_id
        WHERE s.lat IS NOT NULL AND s.lon IS NOT NULL
    """)
    avg_otp = query_to_polars("""
        SELECT o.route_id, r.route_name, r.mode,
               AVG(o.otp) AS avg_otp, COUNT(*) AS months
        FROM otp_monthly o
        JOIN routes r ON o.route_id = r.route_id
        GROUP BY o.route_id
        HAVING COUNT(*) >= 12
    """)
    stop_counts = query_to_polars("""
        SELECT route_id, COUNT(DISTINCT stop_id) AS stop_count
        FROM route_stops
        GROUP BY route_id
    """)

    # Compute geographic span per route
    spans = []
    for route_id in stops_by_route["route_id"].unique().sort().to_list():
        subset = stops_by_route.filter(pl.col("route_id") == route_id)
        lats = subset["lat"].to_list()
        lons = subset["lon"].to_list()
        span_km = compute_span(lats, lons)
        spans.append({"route_id": route_id, "span_km": span_km})

    span_df = pl.DataFrame(spans)
    df = avg_otp.join(stop_counts, on="route_id", how="inner")
    df = df.join(span_df, on="route_id", how="inner")
    df = df.with_columns(
        pl.when(pl.col("span_km") > 0)
        .then(pl.col("stop_count") / pl.col("span_km"))
        .otherwise(None)
        .alias("stop_density")
    )
    return df


def partial_corr(x: list, y: list, z: list) -> tuple[float, float]:
    """Compute partial Pearson correlation of x and y controlling for z."""
    x_arr, y_arr, z_arr = np.array(x), np.array(y), np.array(z)
    slope_xz = np.polyfit(z_arr, x_arr, 1)
    x_resid = x_arr - np.polyval(slope_xz, z_arr)
    slope_yz = np.polyfit(z_arr, y_arr, 1)
    y_resid = y_arr - np.polyval(slope_yz, z_arr)
    return stats.pearsonr(x_resid, y_resid)


def analyze(df: pl.DataFrame) -> dict:
    """Compute correlations between span, stop count, density, and OTP."""
    results = {}
    bus = df.filter(pl.col("mode") == "BUS")

    # --- Bus-only correlations (primary, avoids Simpson's paradox) ---
    bus_span_corr = correlate(bus, "span_km", "avg_otp")
    results["bus_span_r"] = bus_span_corr["pearson_r"]
    results["bus_span_p"] = bus_span_corr["pearson_p"]
    results["bus_span_rho"] = bus_span_corr["spearman_r"]
    results["bus_span_rho_p"] = bus_span_corr["spearman_p"]

    # All-mode (secondary, for reference)
    all_span_corr = correlate(df, "span_km", "avg_otp")
    results["all_span_r"] = all_span_corr["pearson_r"]
    results["all_span_p"] = all_span_corr["pearson_p"]

    # Stop density vs OTP (bus only, excluding zero-span routes)
    bus_dens = bus.filter(pl.col("stop_density").is_not_null())
    dens_corr = correlate(bus_dens, "stop_density", "avg_otp")
    results["density_r"] = dens_corr["pearson_r"]
    results["density_p"] = dens_corr["pearson_p"]

    # Partial: span vs OTP controlling for stop count (bus only)
    r, p = partial_corr(
        bus["span_km"].to_list(), bus["avg_otp"].to_list(), bus["stop_count"].to_list()
    )
    results["span_partial_r"] = r
    results["span_partial_p"] = p

    # Partial: stop count vs OTP controlling for span (bus only)
    r, p = partial_corr(
        bus["stop_count"].to_list(), bus["avg_otp"].to_list(), bus["span_km"].to_list()
    )
    results["stops_partial_r"] = r
    results["stops_partial_p"] = p

    # Span vs stop count (collinearity check)
    span_stops_corr = correlate(bus, "span_km", "stop_count")
    results["span_stops_r"] = span_stops_corr["pearson_r"]

    results["n_all"] = len(df)
    results["n_bus"] = len(bus)
    return results


def make_charts(df: pl.DataFrame, results: dict) -> None:
    """Generate scatter plots for span vs OTP and density vs OTP."""
    plt = setup_plotting()
    # Span vs OTP
    fig, ax = plt.subplots(figsize=(10, 7))
    mode_scatter(ax, df, "span_km", "avg_otp")
    ax.set_xlabel("Geographic Span (km)")
    ax.set_ylabel("Average OTP")
    ax.set_title("Route Geographic Span vs On-Time Performance")
    ax.legend(fontsize=9)
    ax.set_ylim(0, 1)
    save_chart(fig, OUT / "span_vs_otp.png")

    # Density vs OTP (bus only)
    fig, ax = plt.subplots(figsize=(10, 7))
    bus = df.filter(pl.col("mode") == "BUS")
    bus_dens = bus.filter(pl.col("stop_density").is_not_null())
    ax.scatter(bus_dens["stop_density"].to_list(), bus_dens["avg_otp"].to_list(),
               color="#3b82f6", s=40, alpha=0.7, edgecolors="white", linewidths=0.5)
    x, y = bus_dens["stop_density"].to_list(), bus_dens["avg_otp"].to_list()
    slope, intercept = np.polyfit(x, y, 1)
    x_line = [min(x), max(x)]
    ax.plot(x_line, [slope * xi + intercept for xi in x_line],
            color="#1e40af", linewidth=1.5, linestyle="--",
            label=f"BUS trend (r={results['density_r']:.3f})")
    ax.set_xlabel("Stop Density (stops / km)")
    ax.set_ylabel("Average OTP")
    ax.set_title("Stop Density vs On-Time Performance (Bus Only)")
    ax.legend(fontsize=9)
    ax.set_ylim(0, 1)
    save_chart(fig, OUT / "density_vs_otp.png")


@run_analysis(12, "Route Geographic Span vs OTP")
def main() -> None:
    """Entry point: load data, analyze, chart, and save."""
    with phase("Loading data and computing geographic spans"):
        df = load_data()
        print(f"  {len(df)} routes with span, stop count, and OTP data")

    with phase("Analyzing correlations"):
        results = analyze(df)
        print(f"\n  Bus-only (primary):")
        print(f"    Span vs OTP:  Pearson r = {results['bus_span_r']:.4f} (p = {results['bus_span_p']:.4f})")
        print(f"                  Spearman rho = {results['bus_span_rho']:.4f} (p = {results['bus_span_rho_p']:.4f})")
        print(f"    n = {results['n_bus']} bus routes")
        print(f"\n  All-mode (secondary, Simpson's paradox risk):")
        print(f"    Span vs OTP:  Pearson r = {results['all_span_r']:.4f} (p = {results['all_span_p']:.4f})")
        print(f"    n = {results['n_all']} routes")
        print(f"\n  Density vs OTP (bus):          r = {results['density_r']:.4f} (p = {results['density_p']:.4f})")
        print(f"  Span vs OTP | stop count (bus): r = {results['span_partial_r']:.4f} (p = {results['span_partial_p']:.4f})")
        print(f"  Stops vs OTP | span (bus):      r = {results['stops_partial_r']:.4f} (p = {results['stops_partial_p']:.4f})")
        print(f"  Span-stop count collinearity:  r = {results['span_stops_r']:.4f}")

        save_csv(df, OUT / "geographic_span.csv")

    with phase("Generating charts"):
        make_charts(df, results)


if __name__ == "__main__":
    main()

Sources

NameTypeWhy It MattersOwnerFreshnessCaveat
otp_monthly table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
route_stops table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
routes table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
stops table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
numpy dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
polars dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
scipy dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.