Analysis

16: Transfer Hub Performance

Route and Service Drivers

Coverage: 2019-01 to 2025-11 (from otp_monthly).

Built 2026-04-03 20:09 UTC · Commit 7c56b9a

Page Navigation

Analysis Navigation

Data Provenance

flowchart LR
  16_transfer_hub_performance(["16: Transfer Hub Performance"])
  t_otp_monthly[("otp_monthly")] --> 16_transfer_hub_performance
  01_data_ingestion[["Data Ingestion"]] --> t_otp_monthly
  u1_01_data_ingestion[/"data/routes_by_month.csv"/] --> 01_data_ingestion
  u2_01_data_ingestion[/"data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv"/] --> 01_data_ingestion
  u3_01_data_ingestion[/"data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv"/] --> 01_data_ingestion
  u4_01_data_ingestion[/"data/PRT_Stop_Reference_Lookup_Table.csv"/] --> 01_data_ingestion
  u5_01_data_ingestion[/"data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv"/] --> 01_data_ingestion
  t_route_stops[("route_stops")] --> 16_transfer_hub_performance
  01_data_ingestion[["Data Ingestion"]] --> t_route_stops
  t_routes[("routes")] --> 16_transfer_hub_performance
  01_data_ingestion[["Data Ingestion"]] --> t_routes
  t_stops[("stops")] --> 16_transfer_hub_performance
  01_data_ingestion[["Data Ingestion"]] --> t_stops
  d1_16_transfer_hub_performance(("polars (lib)")) --> 16_transfer_hub_performance
  d2_16_transfer_hub_performance(("scipy (lib)")) --> 16_transfer_hub_performance
  classDef page fill:#dbeafe,stroke:#1d4ed8,color:#1e3a8a,stroke-width:2px;
  classDef table fill:#ecfeff,stroke:#0e7490,color:#164e63;
  classDef dep fill:#fff7ed,stroke:#c2410c,color:#7c2d12,stroke-dasharray: 4 2;
  classDef file fill:#eef2ff,stroke:#6366f1,color:#3730a3;
  classDef api fill:#f0fdf4,stroke:#16a34a,color:#14532d;
  classDef pipeline fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95;
  class 16_transfer_hub_performance page;
  class t_otp_monthly,t_route_stops,t_routes,t_stops table;
  class d1_16_transfer_hub_performance,d2_16_transfer_hub_performance dep;
  class u1_01_data_ingestion,u2_01_data_ingestion,u3_01_data_ingestion,u4_01_data_ingestion,u5_01_data_ingestion file;
  class 01_data_ingestion pipeline;

Findings

Findings: Transfer Hub Performance

Summary

Higher-connectivity stops appear to have modestly worse OTP in stop-level data, but this finding does not survive correction for non-independence. At the route level (independent observations), the correlation between stop connectivity and OTP is not significant (r = -0.15, p = 0.16). The apparent "hub penalty" is a composition effect driven by inflated sample size at the stop level.

Key Numbers

Tier Stops Mean OTP Median OTP
Simple (1 route) 3,875 69.5% 69.2%
Medium (2-4 routes) 2,138 66.0% 65.0%
Hub (5+ routes) 196 66.4% 65.2%

Stop-level (n=6,209 -- non-independent, inflated power):

  • Pearson r = -0.17 (p < 0.001)
  • Spearman rho = -0.32 (p < 0.001)

Route-level (n=93 -- independent observations):

  • Pearson r = -0.15 (p = 0.16)

Route-level, bus only (n=90):

  • Pearson r = -0.09 (p = 0.39)

Observations

  • The stop-level correlations (r = -0.17, rho = -0.32) are statistically significant but misleading: the 6,209 "stops" are not independent observations. Stops on the same route share the same underlying OTP, so the effective sample size is closer to ~90 (the number of distinct routes). With n_eff ~ 90, a correlation of r = -0.15 yields p = 0.16, which is not significant.
  • The route-level analysis confirms this: average stop connectivity per route has no significant relationship with route OTP (r = -0.15, p = 0.16). Within bus routes only, the relationship is even weaker (r = -0.09, p = 0.39).
  • The 3.5 pp tier gap (simple 69.5% vs hub 66.4%) is real in the raw data but reflects a composition effect: hubs are served by many routes including poor-performing local bus routes, which drag down the average. The hub location itself is not causing worse OTP.
  • The busiest hub (East Busway Penn Station, 27 routes) actually outperforms the system average (72.1%) because it sits on dedicated right-of-way.

Implication

Being a transfer hub does not independently predict worse OTP. The apparent hub penalty is driven by which routes converge there. Policy should focus on improving the poorly-performing routes themselves, not on the hub locations.

Caveats

  • This analysis uses route-level OTP projected onto stops (ecological fallacy). We don't have stop-level OTP data; a route's on-time performance may vary along its length.
  • The route-level analysis uses "average stop connectivity per route," which is itself an approximation. A more direct test would require stop-level arrival data.

Review History

Output

Methods

Methods: Transfer Hub Performance

Question

Do passengers at major transfer hubs -- stops served by many routes -- experience worse reliability than passengers at simpler stops? This matters because transfer hub passengers are disproportionately transit-dependent and a missed connection at a hub cascades into longer wait times.

Approach

  • Count distinct routes per stop from route_stops to measure connectivity (number of routes serving each stop).
  • For each stop, compute a trip-weighted average OTP across all routes serving it.
  • Classify stops as hubs (5+ routes), medium (2-4 routes), or simple (1 route).
  • Compare OTP distributions across these tiers.
  • Identify the busiest hubs and their OTP.
  • Scatter plot of connectivity vs stop-level OTP.

Data

Name Description Source
route_stops Which routes serve which stops, with trip counts prt.db table
stops Stop names and coordinates prt.db table
otp_monthly Monthly OTP per route prt.db table
routes Mode for context prt.db table

Output

  • output/hub_performance.csv -- per-stop connectivity, OTP, and classification
  • output/connectivity_vs_otp.png -- scatter plot of routes-per-stop vs OTP
  • output/hub_tier_comparison.png -- box plot of OTP by hub tier

Source Code

"""Transfer hub analysis: do high-connectivity stops have worse OTP?"""

import math

import polars as pl

from prt_otp_analysis.common import analysis_dir, correlate, phase, query_to_polars, run_analysis, save_chart, save_csv, setup_plotting, weighted_mean

OUT = analysis_dir(__file__)


def load_data() -> tuple[pl.DataFrame, pl.DataFrame]:
    """Compute per-stop connectivity and trip-weighted OTP, plus route-level aggregation."""
    stop_route_otp = query_to_polars("""
        SELECT rs.stop_id, s.stop_name, s.lat, s.lon, s.muni,
               rs.route_id, rs.trips_wd, r.mode,
               AVG(o.otp) AS route_avg_otp
        FROM route_stops rs
        JOIN stops s ON rs.stop_id = s.stop_id
        JOIN otp_monthly o ON rs.route_id = o.route_id
        JOIN routes r ON rs.route_id = r.route_id
        GROUP BY rs.stop_id, s.stop_name, s.lat, s.lon, s.muni,
                 rs.route_id, rs.trips_wd, r.mode
    """)

    # Per-stop: trip-weighted OTP and route count
    stop_otp = (
        stop_route_otp
        .group_by("stop_id", "stop_name", "lat", "lon", "muni")
        .agg(
            avg_otp=weighted_mean("route_avg_otp", "trips_wd"),
            n_routes=pl.col("route_id").n_unique(),
            total_trips=pl.col("trips_wd").sum(),
        )
    )
    stop_otp = stop_otp.with_columns(
        pl.when(pl.col("n_routes") >= 5).then(pl.lit("hub (5+)"))
        .when(pl.col("n_routes") >= 2).then(pl.lit("medium (2-4)"))
        .otherwise(pl.lit("simple (1)"))
        .alias("tier")
    )
    stop_otp = stop_otp.drop_nulls("avg_otp").filter(pl.col("avg_otp").is_not_nan())

    # Route-level: mean connectivity of stops on each route, plus route OTP
    route_connectivity = (
        stop_route_otp
        .group_by("route_id")
        .agg(
            pl.col("stop_id").n_unique().alias("n_stops_on_route"),
            pl.col("route_avg_otp").first().alias("route_otp"),
            pl.col("mode").first().alias("mode"),
        )
    )
    # Count distinct routes per stop, then average per route
    stop_counts = (
        stop_route_otp
        .group_by("stop_id")
        .agg(pl.col("route_id").n_unique().alias("stop_n_routes"))
    )
    route_avg_connectivity = (
        stop_route_otp.select("route_id", "stop_id")
        .unique()
        .join(stop_counts, on="stop_id", how="left")
        .group_by("route_id")
        .agg(pl.col("stop_n_routes").mean().alias("avg_stop_connectivity"))
    )
    route_level = route_connectivity.join(route_avg_connectivity, on="route_id", how="left")
    route_level = route_level.drop_nulls("route_otp").filter(pl.col("route_otp").is_not_nan())

    return stop_otp, route_level


def analyze(stop_df: pl.DataFrame, route_df: pl.DataFrame) -> dict:
    """Compute tier statistics, stop-level and route-level correlations."""
    clean = stop_df
    results = {}
    results["n_stops"] = len(clean)

    for tier in ["hub (5+)", "medium (2-4)", "simple (1)"]:
        subset = clean.filter(pl.col("tier") == tier)
        key = tier.split(" ")[0]
        results[f"{key}_n"] = len(subset)
        results[f"{key}_mean_otp"] = subset["avg_otp"].mean()
        results[f"{key}_median_otp"] = subset["avg_otp"].median()

    # Stop-level correlation (inflated n, caveat in output)
    stop_corr = correlate(clean, "n_routes", "avg_otp")
    results["stop_connectivity_r"] = stop_corr["pearson_r"]
    results["stop_connectivity_p"] = stop_corr["pearson_p"]
    results["stop_connectivity_rho"] = stop_corr["spearman_r"]
    results["stop_connectivity_rho_p"] = stop_corr["spearman_p"]

    # Route-level correlation (independent observations)
    route_corr = correlate(route_df, "avg_stop_connectivity", "route_otp")
    results["route_connectivity_r"] = route_corr["pearson_r"]
    results["route_connectivity_p"] = route_corr["pearson_p"]
    results["n_routes"] = len(route_df)

    # Bus-only route-level
    bus = route_df.filter(pl.col("mode") == "BUS")
    bus_corr = correlate(bus, "avg_stop_connectivity", "route_otp", min_n=6)
    results["bus_route_connectivity_r"] = bus_corr["pearson_r"]
    results["bus_route_connectivity_p"] = bus_corr["pearson_p"]
    results["n_bus_routes"] = bus_corr["n"]

    return results


def make_charts(df: pl.DataFrame, results: dict) -> None:
    """Generate scatter and box plots."""
    plt = setup_plotting()

    # Connectivity vs OTP scatter
    fig, ax = plt.subplots(figsize=(10, 7))
    tier_colors = {"hub (5+)": "#ef4444", "medium (2-4)": "#f59e0b", "simple (1)": "#3b82f6"}
    for tier, color in tier_colors.items():
        subset = df.filter(pl.col("tier") == tier)
        ax.scatter(subset["n_routes"].to_list(), subset["avg_otp"].to_list(),
                   color=color, label=f"{tier} (n={len(subset)})",
                   s=15, alpha=0.4, edgecolors="none")
    ax.set_xlabel("Number of Routes Serving Stop")
    ax.set_ylabel("Trip-Weighted Average OTP")
    ax.set_title(f"Stop Connectivity vs OTP (stop-level r={results['stop_connectivity_r']:.3f}, "
                 f"route-level r={results['route_connectivity_r']:.3f})")
    ax.legend(fontsize=9)
    ax.set_ylim(0.4, 1.0)
    save_chart(fig, OUT / "connectivity_vs_otp.png")

    # Hub tier box plot
    fig, ax = plt.subplots(figsize=(8, 6))
    tiers = ["simple (1)", "medium (2-4)", "hub (5+)"]
    box_data = [df.filter(pl.col("tier") == t)["avg_otp"].to_list() for t in tiers]
    bp = ax.boxplot(box_data, tick_labels=[f"{t}\n(n={len(d)})" for t, d in zip(tiers, box_data)],
                    patch_artist=True)
    colors = ["#3b82f6", "#f59e0b", "#ef4444"]
    for patch, color in zip(bp["boxes"], colors):
        patch.set_facecolor(color)
        patch.set_alpha(0.6)
    ax.set_ylabel("Trip-Weighted Average OTP")
    ax.set_title("OTP by Stop Connectivity Tier")
    save_chart(fig, OUT / "hub_tier_comparison.png")


@run_analysis(16, "Transfer Hub Performance")
def main() -> None:
    """Entry point: load data, analyze, chart, and save."""
    with phase("Loading data"):
        stop_df, route_df = load_data()
        print(f"  {len(stop_df)} stops with OTP and connectivity data")
        print(f"  {len(route_df)} routes with avg stop connectivity")

    with phase("Analyzing"):
        results = analyze(stop_df, route_df)
        for tier, key in [("Hub (5+)", "hub"), ("Medium (2-4)", "medium"), ("Simple (1)", "simple")]:
            print(f"  {tier}: n={results[f'{key}_n']}, "
                  f"mean OTP={results[f'{key}_mean_otp']:.1%}, "
                  f"median OTP={results[f'{key}_median_otp']:.1%}")

        print(f"\n  Stop-level (n={results['n_stops']}, non-independent -- inflated power):")
        print(f"    Pearson r = {results['stop_connectivity_r']:.4f} (p = {results['stop_connectivity_p']:.4f})")
        print(f"    Spearman rho = {results['stop_connectivity_rho']:.4f} (p = {results['stop_connectivity_rho_p']:.4f})")
        print(f"\n  Route-level (n={results['n_routes']}, independent observations):")
        print(f"    Pearson r = {results['route_connectivity_r']:.4f} (p = {results['route_connectivity_p']:.4f})")
        if not math.isnan(results["bus_route_connectivity_r"]):
            print(f"\n  Route-level, bus only (n={results['n_bus_routes']}):")
            print(f"    Pearson r = {results['bus_route_connectivity_r']:.4f} "
                  f"(p = {results['bus_route_connectivity_p']:.4f})")

        # Top hubs
        print("\nBusiest hubs:")
        hubs = stop_df.filter(pl.col("tier") == "hub (5+)").sort("n_routes", descending=True).head(10)
        for row in hubs.iter_rows(named=True):
            print(f"  {row['stop_name']:<40s} {row['n_routes']} routes, "
                  f"OTP={row['avg_otp']:.1%}, {row['total_trips']} trips/wk")

        save_csv(stop_df, OUT / "hub_performance.csv")

    with phase("Generating charts"):
        make_charts(stop_df, results)


if __name__ == "__main__":
    main()

Sources

NameTypeWhy It MattersOwnerFreshnessCaveat
otp_monthly table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
route_stops table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
routes table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
stops table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
polars dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
scipy dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.