Analysis

03 - Route Ranking

Core OTP Patterns

Coverage: 2019-01 to 2025-11 (from otp_monthly).

Built 2026-04-03 20:09 UTC · Commit 7c56b9a

Page Navigation

Analysis Navigation

Data Provenance

flowchart LR
  03_route_ranking(["03 - Route Ranking"])
  t_otp_monthly[("otp_monthly")] --> 03_route_ranking
  01_data_ingestion[["Data Ingestion"]] --> t_otp_monthly
  u1_01_data_ingestion[/"data/routes_by_month.csv"/] --> 01_data_ingestion
  u2_01_data_ingestion[/"data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv"/] --> 01_data_ingestion
  u3_01_data_ingestion[/"data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv"/] --> 01_data_ingestion
  u4_01_data_ingestion[/"data/PRT_Stop_Reference_Lookup_Table.csv"/] --> 01_data_ingestion
  u5_01_data_ingestion[/"data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv"/] --> 01_data_ingestion
  t_route_stops[("route_stops")] --> 03_route_ranking
  01_data_ingestion[["Data Ingestion"]] --> t_route_stops
  t_routes[("routes")] --> 03_route_ranking
  01_data_ingestion[["Data Ingestion"]] --> t_routes
  d1_03_route_ranking(("matplotlib (lib)")) --> 03_route_ranking
  d2_03_route_ranking(("numpy (lib)")) --> 03_route_ranking
  d3_03_route_ranking(("polars (lib)")) --> 03_route_ranking
  d4_03_route_ranking(("scipy (lib)")) --> 03_route_ranking
  classDef page fill:#dbeafe,stroke:#1d4ed8,color:#1e3a8a,stroke-width:2px;
  classDef table fill:#ecfeff,stroke:#0e7490,color:#164e63;
  classDef dep fill:#fff7ed,stroke:#c2410c,color:#7c2d12,stroke-dasharray: 4 2;
  classDef file fill:#eef2ff,stroke:#6366f1,color:#3730a3;
  classDef api fill:#f0fdf4,stroke:#16a34a,color:#14532d;
  classDef pipeline fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95;
  class 03_route_ranking page;
  class t_otp_monthly,t_route_stops,t_routes table;
  class d1_03_route_ranking,d2_03_route_ranking,d3_03_route_ranking,d4_03_route_ranking dep;
  class u1_01_data_ingestion,u2_01_data_ingestion,u3_01_data_ingestion,u4_01_data_ingestion,u5_01_data_ingestion file;
  class 01_data_ingestion pipeline;

Findings

Findings: Route Ranking

Summary

94 routes had sufficient data (12+ months) to rank. Rankings use trailing 12-month average OTP to reflect current performance, and post-2022 slope to capture recent trajectory without COVID distortion. Slopes now include 95% confidence intervals via scipy.stats.linregress; 51 of 94 slopes are statistically significant (CI excludes zero). 3 routes were flagged as high-volatility. Routes are ranked both overall and within their mode (BUS, RAIL, UNKNOWN).

Regression to the Mean

Caution: Extreme-ranked routes -- both top/bottom performers and most-improving/declining -- are expected to regress toward the mean in subsequent periods. This is a statistical phenomenon, not an operational one: routes that happen to have unusually good or bad stretches will tend to look less extreme next time, even without any intervention. Rankings should be interpreted as snapshots, not predictions. A route appearing at the top or bottom of a list does not necessarily mean it will stay there. Formal empirical Bayes shrinkage could partially correct for this, but is beyond the scope of this analysis. Readers should weight these rankings accordingly, especially for routes with fewer observations or higher volatility.

Top 5 Routes (by trailing 12-month OTP)

Route Mode Mode Rank Recent OTP All-Time OTP Stops
G2 - West Busway BUS 1/89 88.4% 81.7% 24
18 - Manchester BUS 2/89 87.5% 88.4% 43
P1 - East Busway-All Stops BUS 3/89 83.9% 84.5% 24
39 - Brookline BUS 4/89 82.6% 78.9% 69
43 - Bailey BUS 5/89 81.8% 79.5% 65

All top 5 are BUS routes. The 3 RAIL routes rank 9th (SLVR, 78.8%), 12th (BLUE, 77.4%), and 30th (RED, 72.8%) overall.

Bottom 5 Routes (by trailing 12-month OTP)

Route Mode Mode Rank Recent OTP All-Time OTP Stops
71B - Highland Park BUS 89/89 41.9% 58.8% 107
61C - McKeesport-Homestead BUS 88/89 44.8% 56.8% 158
65 - Squirrel Hill BUS 87/89 46.5% 61.5% 70
58 - Greenfield BUS 86/89 49.8% 60.8% 102
61B - Braddock-Swissvale BUS 85/89 50.1% 58.4% 137

Post-COVID Trends (2022 onward)

Slopes are computed via OLS with standard errors. Only statistically significant slopes (95% CI excludes zero) are listed below. 43 of 94 routes have slopes that are not statistically significant -- their trend cannot be distinguished from flat.

Most improving (statistically significant):

Route Slope (pp/yr) 95% CI Months
P78 - Oakmont Flyer +6.6 [+5.0, +8.2] 47
71D - Hamilton +4.1 [+2.5, +5.7] 47
O1 - Ross Flyer +3.4 [+2.1, +4.6] 47
39 - Brookline +3.2 [+0.9, +5.5] 45
G31 - Bridgeville Flyer +2.6 [+1.6, +3.6] 47

Most declining (statistically significant):

Route Slope (pp/yr) 95% CI Months
65 - Squirrel Hill -8.3 [-11.7, -4.9] 46
71B - Highland Park -7.2 [-8.8, -5.5] 47
P12 - Holiday Park Flyer -5.8 [-7.4, -4.2] 47
81 - Oak Hill -5.8 [-7.3, -4.3] 47
61C - McKeesport-Homestead -5.4 [-6.6, -4.2] 47

Notable non-significant slopes: SWL (Outbound to SHJ) has a point estimate of -10.4 pp/yr but its 95% CI is [-34.7, +13.8] due to having only 13 observations over a 21-month span -- the apparent steep decline cannot be statistically distinguished from zero.

Observation Span

Most routes with slopes have the full 47-month post-2022 span. Two routes have notably narrow observation windows:

  • SWL (Outbound to SHJ): 13 observations over 21 months
  • P2 (East Busway Short): 21 observations over 21 months (significant slope, but over less than half the full window)

Routes with narrow spans may have slopes that are less representative of sustained trends.

Observations

  • Using trailing 12-month OTP shifts the rankings compared to all-time averages: G2 (West Busway) has improved markedly and now leads, while 71B (Highland Park) and 65 (Squirrel Hill) have deteriorated significantly.
  • The post-COVID slope isolates recent trajectory without the structural break caused by the 2020 COVID ridership drop, which dominated full-period slopes.
  • High-volatility routes (std > 2x median) include SWL, 15 (Charles), and 65 (Squirrel Hill), which have extreme month-to-month swings.
  • 4 routes were excluded from ranking for having fewer than 12 months of data.
  • 89 of the 94 ranked routes are BUS, 3 are RAIL, and 2 are UNKNOWN mode. Within-mode ranks are provided in the CSV to allow fair comparisons.

Missing Stop Count Data

5 routes lack stop count data because they have no entries in the route_stops table: SWL (Outbound to SHJ), 37 (Castle Shannon), 42 (Potomac), P2 (East Busway Short), and RLSH (Red Line Shuttle). Stop counts for these routes are reported as null.

Caveats

  • Stop counts come from current data; historical stop counts may have differed.
  • Post-COVID slope assumes a roughly linear trajectory from 2022 onward, which may not hold for all routes.
  • Regression to the mean (see dedicated section above) means that top/bottom rankings and most-improving/declining lists overstate the persistence of extreme performance. These lists identify current outliers, not necessarily future ones.
  • Routes with narrow observation spans (SWL, P2) have less reliable slope estimates even when statistically significant.

Review History

Output

Methods

Methods: Route Ranking

Question

Which routes are the best and worst performers? Which are improving or declining? Which are most volatile?

Approach

  • Compute per-route summary stats: mean OTP, standard deviation, min, max.
  • Compute trailing 12-month average OTP as "recent performance" metric.
  • Fit a simple linear slope (OTP vs. time) per route for the post-2022 period only (POST_COVID_START = "2022-01") to quantify recent trend direction without COVID distortion. Slopes are computed via scipy.stats.linregress, which provides standard errors.
  • For each slope, compute a 95% confidence interval and flag whether it is statistically significant (CI excludes zero).
  • A zero-variance guard prevents division-by-zero when all observations fall in the same month.
  • Compute actual observation span (first-to-last month range) alongside observation count to identify routes with narrow windows.
  • Rank routes three ways: by recent average OTP, by trend slope, and by volatility (std dev). Rankings are computed both overall and within mode (BUS, RAIL, etc.).
  • Flag routes with high volatility that may warrant anomaly investigation.

Data

Name Description Source
otp_monthly Monthly OTP per route prt.db table
routes Route metadata (name, mode) prt.db table
route_stops Stop count as a complexity proxy (5 routes lack entries in this table) prt.db table

Output

  • output/route_ranking.csv -- per-route summary stats, slopes with SEs/CIs, overall and within-mode rankings
  • output/top_bottom_routes.png -- best/worst performers chart

Source Code

"""Route ranking by average OTP, trend slope, and volatility."""

import numpy as np
import polars as pl
from scipy import stats

from prt_otp_analysis.common import MODE_COLORS, Z_CRITICAL_95, analysis_dir, phase, query_to_polars, run_analysis, save_chart, save_csv, setup_plotting

OUT = analysis_dir(__file__)

MIN_MONTHS = 12  # minimum months of data to include in rankings
POST_COVID_START = "2022-01"  # start of post-COVID period for slope calculation


def load_data() -> tuple[pl.DataFrame, pl.DataFrame]:
    """Load OTP data with route metadata, and stop counts per route."""
    otp = query_to_polars("""
        SELECT o.route_id, o.month, o.otp, r.route_name, r.mode
        FROM otp_monthly o
        JOIN routes r ON o.route_id = r.route_id
    """)
    stop_counts = query_to_polars("""
        SELECT route_id, COUNT(DISTINCT stop_id) AS stop_count
        FROM route_stops
        GROUP BY route_id
    """)
    return otp, stop_counts


def analyze(otp: pl.DataFrame, stop_counts: pl.DataFrame) -> pl.DataFrame:
    """Compute per-route summary stats, post-COVID slope, and rankings."""
    # Per-route all-time summary stats
    summary = (
        otp.group_by(["route_id", "route_name", "mode"])
        .agg(
            months=pl.col("otp").count(),
            mean_otp=pl.col("otp").mean(),
            std_otp=pl.col("otp").std(),
            min_otp=pl.col("otp").min(),
            max_otp=pl.col("otp").max(),
        )
        .sort("route_id")
    )

    # Trailing 12-month average OTP
    all_months = otp.select("month").unique().sort("month")["month"].to_list()
    trailing_start = all_months[-12] if len(all_months) >= 12 else all_months[0]
    recent_avg = (
        otp.filter(pl.col("month") >= trailing_start)
        .group_by("route_id")
        .agg(recent_mean_otp=pl.col("otp").mean())
    )
    summary = summary.join(recent_avg, on="route_id", how="left")

    # Post-COVID slope (per year) using scipy.stats.linregress for SEs and CIs
    post_covid = otp.filter(pl.col("month") >= POST_COVID_START)
    pc_months = post_covid.select("month").unique().sort("month")
    pc_months = pc_months.with_row_index("time_idx")
    post_covid = post_covid.join(pc_months, on="month")

    # Compute actual time span (in months) for each route's post-COVID observations
    span_df = (
        post_covid.group_by("route_id")
        .agg(
            slope_months=pl.col("otp").count(),
            first_month=pl.col("month").min(),
            last_month=pl.col("month").max(),
            time_idx_min=pl.col("time_idx").min(),
            time_idx_max=pl.col("time_idx").max(),
        )
        .with_columns(
            obs_span_months=(pl.col("time_idx_max") - pl.col("time_idx_min") + 1).cast(pl.Int64),
        )
    )

    # Only keep routes with enough post-COVID data
    span_df = span_df.filter(pl.col("slope_months") >= MIN_MONTHS)

    # Compute slope, stderr, and CI per route using linregress (with zero-variance guard)
    slope_records = []
    for route_id in span_df["route_id"].to_list():
        route_data = post_covid.filter(pl.col("route_id") == route_id).sort("time_idx")
        x = route_data["time_idx"].to_numpy().astype(float)
        y = route_data["otp"].to_numpy().astype(float)

        # Zero-variance guard: if all time_idx values are the same, slope is undefined
        if np.var(x) == 0:
            slope_records.append({
                "route_id": route_id,
                "slope": 0.0,
                "slope_stderr": float("nan"),
                "slope_ci_lo": float("nan"),
                "slope_ci_hi": float("nan"),
                "slope_significant": False,
            })
            continue

        result = stats.linregress(x, y)
        slope_per_month = result.slope
        stderr_per_month = result.stderr
        # Convert to per-year units
        slope_yr = slope_per_month * 12
        stderr_yr = stderr_per_month * 12
        ci_lo = slope_yr - Z_CRITICAL_95 * stderr_yr
        ci_hi = slope_yr + Z_CRITICAL_95 * stderr_yr
        # Significant if 95% CI excludes zero
        significant = (ci_lo > 0) or (ci_hi < 0)

        slope_records.append({
            "route_id": route_id,
            "slope": slope_yr,
            "slope_stderr": stderr_yr,
            "slope_ci_lo": ci_lo,
            "slope_ci_hi": ci_hi,
            "slope_significant": significant,
        })

    slope_df = pl.DataFrame(slope_records)
    slope_df = slope_df.join(
        span_df.select("route_id", "slope_months", "obs_span_months", "first_month", "last_month"),
        on="route_id",
        how="left",
    )

    summary = summary.join(
        slope_df.select(
            "route_id", "slope", "slope_stderr", "slope_ci_lo", "slope_ci_hi",
            "slope_significant", "slope_months", "obs_span_months",
        ),
        on="route_id",
        how="left",
    )

    # Join stop counts
    summary = summary.join(stop_counts, on="route_id", how="left")

    # Flag limited-data routes
    summary = summary.with_columns(
        limited_data=pl.col("months") < MIN_MONTHS,
    )

    # Flag high-volatility routes (std > 2x median std across all routes)
    median_std = summary.filter(~pl.col("limited_data"))["std_otp"].median()
    summary = summary.with_columns(
        high_volatility=pl.col("std_otp") > (2 * median_std),
    )

    # Rankings (only for routes with sufficient data)
    rankable = summary.filter(~pl.col("limited_data"))
    avg_ranks = rankable.select("route_id", pl.col("recent_mean_otp").rank(descending=True).alias("rank_avg"))
    slope_ranks = rankable.filter(pl.col("slope").is_not_null()).select(
        "route_id", pl.col("slope").rank(descending=True).alias("rank_slope")
    )
    vol_ranks = rankable.select("route_id", pl.col("std_otp").rank(descending=False).alias("rank_volatility"))

    # Within-mode rank (rank routes within their mode group)
    mode_avg_ranks = (
        rankable.with_columns(
            mode_rank=pl.col("recent_mean_otp").rank(descending=True).over("mode")
        )
        .select("route_id", "mode_rank")
    )

    summary = (
        summary
        .join(avg_ranks, on="route_id", how="left")
        .join(slope_ranks, on="route_id", how="left")
        .join(vol_ranks, on="route_id", how="left")
        .join(mode_avg_ranks, on="route_id", how="left")
    )

    return summary.sort("rank_avg", nulls_last=True)


def make_chart(df: pl.DataFrame) -> None:
    """Generate top/bottom routes bar charts."""
    plt = setup_plotting()
    from matplotlib.patches import Patch

    rankable = df.filter(~pl.col("limited_data"))

    fig, axes = plt.subplots(1, 2, figsize=(16, 8))

    # Left: Top 10 and Bottom 10 by recent average OTP
    ax = axes[0]
    has_recent = rankable.filter(pl.col("recent_mean_otp").is_not_null())
    top10 = has_recent.sort("recent_mean_otp", descending=True).head(10)
    bottom10 = has_recent.sort("recent_mean_otp").head(10).sort("recent_mean_otp", descending=True)
    combined = pl.concat([top10, bottom10])

    labels = [f"{r} - {n}" for r, n in zip(combined["route_id"].to_list(), combined["route_name"].to_list())]
    values = combined["recent_mean_otp"].to_list()
    colors = [MODE_COLORS.get(m, "#9ca3af") for m in combined["mode"].to_list()]

    y_pos = range(len(labels))
    ax.barh(y_pos, values, color=colors)
    ax.set_yticks(y_pos)
    ax.set_yticklabels(labels, fontsize=8)
    ax.set_xlabel("Average OTP (trailing 12 months)")
    ax.set_title("Top 10 & Bottom 10 Routes by Recent OTP")
    ax.invert_yaxis()
    ax.set_xlim(0, 1)

    # Right: Top 10 improving and Top 10 declining by post-COVID slope
    ax = axes[1]
    has_slope = rankable.filter(pl.col("slope").is_not_null())
    improving = has_slope.sort("slope", descending=True).head(10)
    declining = has_slope.sort("slope").head(10).sort("slope", descending=True)
    combined2 = pl.concat([improving, declining])

    labels2 = [f"{r} - {n}" for r, n in zip(combined2["route_id"].to_list(), combined2["route_name"].to_list())]
    values2 = combined2["slope"].to_list()
    colors2 = ["#22c55e" if v >= 0 else "#ef4444" for v in values2]

    y_pos2 = range(len(labels2))
    ax.barh(y_pos2, values2, color=colors2)
    ax.set_yticks(y_pos2)
    ax.set_yticklabels(labels2, fontsize=8)
    ax.set_xlabel("Slope (OTP change per year, post-2022)")
    ax.set_title("Top 10 Improving & Declining Routes (post-COVID)")
    ax.invert_yaxis()
    ax.axvline(0, color="black", linewidth=0.5)

    legend_patches = [Patch(facecolor=c, label=m) for m, c in MODE_COLORS.items() if m != "UNKNOWN"]
    axes[0].legend(handles=legend_patches, loc="lower right", fontsize=8)

    save_chart(fig, OUT / "top_bottom_routes.png")


@run_analysis(3, "Route Ranking")
def main() -> None:
    """Entry point: load data, analyze, chart, and save."""
    with phase("Loading data"):
        otp, stop_counts = load_data()
        print(f"  {len(otp):,} OTP observations, {len(stop_counts)} routes with stop data")

    with phase("Analyzing"):
        result = analyze(otp, stop_counts)
        rankable = result.filter(~pl.col("limited_data"))
        limited = result.filter(pl.col("limited_data"))
        print(f"  {len(rankable)} routes ranked ({MIN_MONTHS}+ months of data)")
        print(f"  {len(limited)} routes excluded (fewer than {MIN_MONTHS} months)")
        hv = result.filter(pl.col("high_volatility"))
        print(f"  {len(hv)} high-volatility routes flagged")
        print(f"  Slope period: {POST_COVID_START} onward (per-year units)")

        # Report slope significance
        has_slope = result.filter(pl.col("slope").is_not_null())
        sig_slopes = has_slope.filter(pl.col("slope_significant") == True)
        print(f"  {len(sig_slopes)} of {len(has_slope)} slopes are statistically significant (95% CI excludes zero)")

        # Report routes with null stop counts
        null_stops = result.filter(pl.col("stop_count").is_null())
        if len(null_stops) > 0:
            ids = null_stops["route_id"].to_list()
            print(f"  {len(null_stops)} routes lack stop count data: {', '.join(str(r) for r in ids)}")

        # Report modes
        for mode in sorted(rankable["mode"].unique().to_list()):
            mode_count = len(rankable.filter(pl.col("mode") == mode))
            print(f"  Mode {mode}: {mode_count} routes ranked")

    with phase("Saving CSV"):
        save_csv(result, OUT / "route_ranking.csv")

    with phase("Generating chart"):
        make_chart(result)


if __name__ == "__main__":
    main()

Sources

NameTypeWhy It MattersOwnerFreshnessCaveat
otp_monthly table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
route_stops table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
routes table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
matplotlib dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
numpy dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
polars dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
scipy dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.