Analysis

03 - Route Ranking

Core OTP Patterns

Coverage: 2019-01 to 2025-11 (from otp_monthly).

Built 2026-06-15 11:52 UTC · Commit e5cf673

Page Navigation

Analysis Navigation

Data Provenance

flowchart LR
  03_route_ranking(["03 - Route Ranking"])
  t_otp_monthly[("otp_monthly")] --> 03_route_ranking
  01_data_ingestion[["Data Ingestion"]] --> t_otp_monthly
  u1_01_data_ingestion[/"data/routes_by_month.csv"/] --> 01_data_ingestion
  u2_01_data_ingestion[/"data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv"/] --> 01_data_ingestion
  u3_01_data_ingestion[/"data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv"/] --> 01_data_ingestion
  u4_01_data_ingestion[/"data/PRT_Stop_Reference_Lookup_Table.csv"/] --> 01_data_ingestion
  u5_01_data_ingestion[/"data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv"/] --> 01_data_ingestion
  t_route_stops[("route_stops")] --> 03_route_ranking
  01_data_ingestion[["Data Ingestion"]] --> t_route_stops
  t_routes[("routes")] --> 03_route_ranking
  01_data_ingestion[["Data Ingestion"]] --> t_routes
  d1_03_route_ranking(("matplotlib (lib)")) --> 03_route_ranking
  d2_03_route_ranking(("numpy (lib)")) --> 03_route_ranking
  d3_03_route_ranking(("polars (lib)")) --> 03_route_ranking
  d4_03_route_ranking(("scipy (lib)")) --> 03_route_ranking
  classDef page fill:#dbeafe,stroke:#1d4ed8,color:#1e3a8a,stroke-width:2px;
  classDef table fill:#ecfeff,stroke:#0e7490,color:#164e63;
  classDef dep fill:#fff7ed,stroke:#c2410c,color:#7c2d12,stroke-dasharray: 4 2;
  classDef file fill:#eef2ff,stroke:#6366f1,color:#3730a3;
  classDef api fill:#f0fdf4,stroke:#16a34a,color:#14532d;
  classDef pipeline fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95;
  class 03_route_ranking page;
  class t_otp_monthly,t_route_stops,t_routes table;
  class d1_03_route_ranking,d2_03_route_ranking,d3_03_route_ranking,d4_03_route_ranking dep;
  class u1_01_data_ingestion,u2_01_data_ingestion,u3_01_data_ingestion,u4_01_data_ingestion,u5_01_data_ingestion file;
  class 01_data_ingestion pipeline;

Findings

Findings: Route Ranking

Summary

94 routes had sufficient data (12+ months) to rank. Rankings use trailing 12-month average OTP to reflect current performance, and post-2022 slope to capture recent trajectory without COVID distortion. Slopes now include 95% confidence intervals via scipy.stats.linregress; 51 of 94 slopes are statistically significant (CI excludes zero). 3 routes were flagged as high-volatility. Routes are ranked both overall and within their mode (BUS, RAIL, UNKNOWN).

Regression to the Mean

Caution: Extreme-ranked routes -- both top/bottom performers and most-improving/declining -- are expected to regress toward the mean in subsequent periods. This is a statistical phenomenon, not an operational one: routes that happen to have unusually good or bad stretches will tend to look less extreme next time, even without any intervention. Rankings should be interpreted as snapshots, not predictions. A route appearing at the top or bottom of a list does not necessarily mean it will stay there. Formal empirical Bayes shrinkage could partially correct for this, but is beyond the scope of this analysis. Readers should weight these rankings accordingly, especially for routes with fewer observations or higher volatility.

Top 5 Routes (by trailing 12-month OTP)

Route	Mode	Mode Rank	Recent OTP	All-Time OTP	Stops
G2 - West Busway	BUS	1/89	88.4%	81.7%	24
18 - Manchester	BUS	2/89	87.5%	88.4%	43
P1 - East Busway-All Stops	BUS	3/89	83.9%	84.5%	24
39 - Brookline	BUS	4/89	82.6%	78.9%	69
43 - Bailey	BUS	5/89	81.8%	79.5%	65

All top 5 are BUS routes. The 3 RAIL routes rank 9th (SLVR, 78.8%), 12th (BLUE, 77.4%), and 30th (RED, 72.8%) overall.

Bottom 5 Routes (by trailing 12-month OTP)

Route	Mode	Mode Rank	Recent OTP	All-Time OTP	Stops
71B - Highland Park	BUS	89/89	41.9%	58.8%	107
61C - McKeesport-Homestead	BUS	88/89	44.8%	56.8%	158
65 - Squirrel Hill	BUS	87/89	46.5%	61.5%	70
58 - Greenfield	BUS	86/89	49.8%	60.8%	102
61B - Braddock-Swissvale	BUS	85/89	50.1%	58.4%	137

Post-COVID Trends (2022 onward)

Slopes are computed via OLS with standard errors. Only statistically significant slopes (95% CI excludes zero) are listed below. 43 of 94 routes have slopes that are not statistically significant -- their trend cannot be distinguished from flat.

Most improving (statistically significant):

Route	Slope (pp/yr)	95% CI	Months
P78 - Oakmont Flyer	+6.6	[+5.0, +8.2]	47
71D - Hamilton	+4.1	[+2.5, +5.7]	47
O1 - Ross Flyer	+3.4	[+2.1, +4.6]	47
39 - Brookline	+3.2	[+0.9, +5.5]	45
G31 - Bridgeville Flyer	+2.6	[+1.6, +3.6]	47

Most declining (statistically significant):

Route	Slope (pp/yr)	95% CI	Months
65 - Squirrel Hill	-8.3	[-11.7, -4.9]	46
71B - Highland Park	-7.2	[-8.8, -5.5]	47
P12 - Holiday Park Flyer	-5.8	[-7.4, -4.2]	47
81 - Oak Hill	-5.8	[-7.3, -4.3]	47
61C - McKeesport-Homestead	-5.4	[-6.6, -4.2]	47

Notable non-significant slopes: SWL (Outbound to SHJ) has a point estimate of -10.4 pp/yr but its 95% CI is [-34.7, +13.8] due to having only 13 observations over a 21-month span -- the apparent steep decline cannot be statistically distinguished from zero.

Observation Span

Most routes with slopes have the full 47-month post-2022 span. Two routes have notably narrow observation windows:

SWL (Outbound to SHJ): 13 observations over 21 months
P2 (East Busway Short): 21 observations over 21 months (significant slope, but over less than half the full window)

Routes with narrow spans may have slopes that are less representative of sustained trends.

Observations

Using trailing 12-month OTP shifts the rankings compared to all-time averages: G2 (West Busway) has improved markedly and now leads, while 71B (Highland Park) and 65 (Squirrel Hill) have deteriorated significantly.
The post-COVID slope isolates recent trajectory without the structural break caused by the 2020 COVID ridership drop, which dominated full-period slopes.
High-volatility routes (std > 2x median) include SWL, 15 (Charles), and 65 (Squirrel Hill), which have extreme month-to-month swings.
4 routes were excluded from ranking for having fewer than 12 months of data.
89 of the 94 ranked routes are BUS, 3 are RAIL, and 2 are UNKNOWN mode. Within-mode ranks are provided in the CSV to allow fair comparisons.

Missing Stop Count Data

5 routes lack stop count data because they have no entries in the route_stops table: SWL (Outbound to SHJ), 37 (Castle Shannon), 42 (Potomac), P2 (East Busway Short), and RLSH (Red Line Shuttle). Stop counts for these routes are reported as null.

Caveats

Stop counts come from current data; historical stop counts may have differed.
Post-COVID slope assumes a roughly linear trajectory from 2022 onward, which may not hold for all routes.
Regression to the mean (see dedicated section above) means that top/bottom rankings and most-improving/declining lists overstate the persistence of extreme performance. These lists identify current outliers, not necessarily future ones.
Routes with narrow observation spans (SWL, P2) have less reliable slope estimates even when statistically significant.

Review History

2026-02-11: RED-TEAM-REPORTS/2026-02-11-analyses-01-05-07-11.md — 7 issues (1 significant). Added RTM caveat, slope standard errors and 95% CIs, observation span tracking, within-mode rankings, zero-variance guard, updated METHODS.md for post-2022 period, and documented null stop counts.

Output

image top_bottom_routes.png
best/worst performers chart.

No interactive outputs declared.

data route_ranking.csv

per-route summary stats, slopes with SEs/CIs, overall and within-mode rankings.

Preview CSV

Expand to load preview.

Methods

Methods: Route Ranking

Question

Which routes are the best and worst performers? Which are improving or declining? Which are most volatile?

Approach

Compute per-route summary stats: mean OTP, standard deviation, min, max.
Compute trailing 12-month average OTP as "recent performance" metric.
Fit a simple linear slope (OTP vs. time) per route for the post-2022 period only (POST_COVID_START = "2022-01") to quantify recent trend direction without COVID distortion. Slopes are computed via scipy.stats.linregress, which provides standard errors.
For each slope, compute a 95% confidence interval and flag whether it is statistically significant (CI excludes zero).
A zero-variance guard prevents division-by-zero when all observations fall in the same month.
Compute actual observation span (first-to-last month range) alongside observation count to identify routes with narrow windows.
Rank routes three ways: by recent average OTP, by trend slope, and by volatility (std dev). Rankings are computed both overall and within mode (BUS, RAIL, etc.).
Flag routes with high volatility that may warrant anomaly investigation.

Data

Name	Description	Source
`otp_monthly`	Monthly OTP per route	`prt.db` table
`routes`	Route metadata (name, mode)	`prt.db` table
`route_stops`	Stop count as a complexity proxy (5 routes lack entries in this table)	`prt.db` table

Output

output/route_ranking.csv -- per-route summary stats, slopes with SEs/CIs, overall and within-mode rankings
output/top_bottom_routes.png -- best/worst performers chart

  Source Code
    
      
      """Route ranking by average OTP, trend slope, and volatility."""

import numpy as np
import polars as pl
from scipy import stats

from prt_otp_analysis.common import MODE_COLORS, Z_CRITICAL_95, analysis_dir, phase, query_to_polars, run_analysis, save_chart, save_csv, setup_plotting

OUT = analysis_dir(__file__)

MIN_MONTHS = 12  # minimum months of data to include in rankings
POST_COVID_START = "2022-01"  # start of post-COVID period for slope calculation


def load_data() -> tuple[pl.DataFrame, pl.DataFrame]:
    """Load OTP data with route metadata, and stop counts per route."""
    otp = query_to_polars("""
        SELECT o.route_id, o.month, o.otp, r.route_name, r.mode
        FROM otp_monthly o
        JOIN routes r ON o.route_id = r.route_id
    """)
    stop_counts = query_to_polars("""
        SELECT route_id, COUNT(DISTINCT stop_id) AS stop_count
        FROM route_stops
        GROUP BY route_id
    """)
    return otp, stop_counts


def analyze(otp: pl.DataFrame, stop_counts: pl.DataFrame) -> pl.DataFrame:
    """Compute per-route summary stats, post-COVID slope, and rankings."""
    # Per-route all-time summary stats
    summary = (
        otp.group_by(["route_id", "route_name", "mode"])
        .agg(
            months=pl.col("otp").count(),
            mean_otp=pl.col("otp").mean(),
            std_otp=pl.col("otp").std(),
            min_otp=pl.col("otp").min(),
            max_otp=pl.col("otp").max(),
        )
        .sort("route_id")
    )

    # Trailing 12-month average OTP
    all_months = otp.select("month").unique().sort("month")["month"].to_list()
    trailing_start = all_months[-12] if len(all_months) >= 12 else all_months[0]
    recent_avg = (
        otp.filter(pl.col("month") >= trailing_start)
        .group_by("route_id")
        .agg(recent_mean_otp=pl.col("otp").mean())
    )
    summary = summary.join(recent_avg, on="route_id", how="left")

    # Post-COVID slope (per year) using scipy.stats.linregress for SEs and CIs
    post_covid = otp.filter(pl.col("month") >= POST_COVID_START)
    pc_months = post_covid.select("month").unique().sort("month")
    pc_months = pc_months.with_row_index("time_idx")
    post_covid = post_covid.join(pc_months, on="month")

    # Compute actual time span (in months) for each route's post-COVID observations
    span_df = (
        post_covid.group_by("route_id")
        .agg(
            slope_months=pl.col("otp").count(),
            first_month=pl.col("month").min(),
            last_month=pl.col("month").max(),
            time_idx_min=pl.col("time_idx").min(),
            time_idx_max=pl.col("time_idx").max(),
        )
        .with_columns(
            obs_span_months=(pl.col("time_idx_max") - pl.col("time_idx_min") + 1).cast(pl.Int64),
        )
    )

    # Only keep routes with enough post-COVID data
    span_df = span_df.filter(pl.col("slope_months") >= MIN_MONTHS)

    # Compute slope, stderr, and CI per route using linregress (with zero-variance guard)
    slope_records = []
    for route_id in span_df["route_id"].to_list():
        route_data = post_covid.filter(pl.col("route_id") == route_id).sort("time_idx")
        x = route_data["time_idx"].to_numpy().astype(float)
        y = route_data["otp"].to_numpy().astype(float)

        # Zero-variance guard: if all time_idx values are the same, slope is undefined
        if np.var(x) == 0:
            slope_records.append({
                "route_id": route_id,
                "slope": 0.0,
                "slope_stderr": float("nan"),
                "slope_ci_lo": float("nan"),
                "slope_ci_hi": float("nan"),
                "slope_significant": False,
            })
            continue

        result = stats.linregress(x, y)
        slope_per_month = result.slope
        stderr_per_month = result.stderr
        # Convert to per-year units
        slope_yr = slope_per_month * 12
        stderr_yr = stderr_per_month * 12
        ci_lo = slope_yr - Z_CRITICAL_95 * stderr_yr
        ci_hi = slope_yr + Z_CRITICAL_95 * stderr_yr
        # Significant if 95% CI excludes zero
        significant = (ci_lo > 0) or (ci_hi < 0)

        slope_records.append({
            "route_id": route_id,
            "slope": slope_yr,
            "slope_stderr": stderr_yr,
            "slope_ci_lo": ci_lo,
            "slope_ci_hi": ci_hi,
            "slope_significant": significant,
        })

    slope_df = pl.DataFrame(slope_records)
    slope_df = slope_df.join(
        span_df.select("route_id", "slope_months", "obs_span_months", "first_month", "last_month"),
        on="route_id",
        how="left",
    )

    summary = summary.join(
        slope_df.select(
            "route_id", "slope", "slope_stderr", "slope_ci_lo", "slope_ci_hi",
            "slope_significant", "slope_months", "obs_span_months",
        ),
        on="route_id",
        how="left",
    )

    # Join stop counts
    summary = summary.join(stop_counts, on="route_id", how="left")

    # Flag limited-data routes
    summary = summary.with_columns(
        limited_data=pl.col("months") < MIN_MONTHS,
    )

    # Flag high-volatility routes (std > 2x median std across all routes)
    median_std = summary.filter(~pl.col("limited_data"))["std_otp"].median()
    summary = summary.with_columns(
        high_volatility=pl.col("std_otp") > (2 * median_std),
    )

    # Rankings (only for routes with sufficient data)
    rankable = summary.filter(~pl.col("limited_data"))
    avg_ranks = rankable.select("route_id", pl.col("recent_mean_otp").rank(descending=True).alias("rank_avg"))
    slope_ranks = rankable.filter(pl.col("slope").is_not_null()).select(
        "route_id", pl.col("slope").rank(descending=True).alias("rank_slope")
    )
    vol_ranks = rankable.select("route_id", pl.col("std_otp").rank(descending=False).alias("rank_volatility"))

    # Within-mode rank (rank routes within their mode group)
    mode_avg_ranks = (
        rankable.with_columns(
            mode_rank=pl.col("recent_mean_otp").rank(descending=True).over("mode")
        )
        .select("route_id", "mode_rank")
    )

    summary = (
        summary
        .join(avg_ranks, on="route_id", how="left")
        .join(slope_ranks, on="route_id", how="left")
        .join(vol_ranks, on="route_id", how="left")
        .join(mode_avg_ranks, on="route_id", how="left")
    )

    return summary.sort("rank_avg", nulls_last=True)


def make_chart(df: pl.DataFrame) -> None:
    """Generate top/bottom routes bar charts."""
    plt = setup_plotting()
    from matplotlib.patches import Patch

    rankable = df.filter(~pl.col("limited_data"))

    fig, axes = plt.subplots(1, 2, figsize=(16, 8))

    # Left: Top 10 and Bottom 10 by recent average OTP
    ax = axes[0]
    has_recent = rankable.filter(pl.col("recent_mean_otp").is_not_null())
    top10 = has_recent.sort("recent_mean_otp", descending=True).head(10)
    bottom10 = has_recent.sort("recent_mean_otp").head(10).sort("recent_mean_otp", descending=True)
    combined = pl.concat([top10, bottom10])

    labels = [f"{r} - {n}" for r, n in zip(combined["route_id"].to_list(), combined["route_name"].to_list())]
    values = combined["recent_mean_otp"].to_list()
    colors = [MODE_COLORS.get(m, "#9ca3af") for m in combined["mode"].to_list()]

    y_pos = range(len(labels))
    ax.barh(y_pos, values, color=colors)
    ax.set_yticks(y_pos)
    ax.set_yticklabels(labels, fontsize=8)
    ax.set_xlabel("Average OTP (trailing 12 months)")
    ax.set_title("Top 10 & Bottom 10 Routes by Recent OTP")
    ax.invert_yaxis()
    ax.set_xlim(0, 1)

    # Right: Top 10 improving and Top 10 declining by post-COVID slope
    ax = axes[1]
    has_slope = rankable.filter(pl.col("slope").is_not_null())
    improving = has_slope.sort("slope", descending=True).head(10)
    declining = has_slope.sort("slope").head(10).sort("slope", descending=True)
    combined2 = pl.concat([improving, declining])

    labels2 = [f"{r} - {n}" for r, n in zip(combined2["route_id"].to_list(), combined2["route_name"].to_list())]
    values2 = combined2["slope"].to_list()
    colors2 = ["#22c55e" if v >= 0 else "#ef4444" for v in values2]

    y_pos2 = range(len(labels2))
    ax.barh(y_pos2, values2, color=colors2)
    ax.set_yticks(y_pos2)
    ax.set_yticklabels(labels2, fontsize=8)
    ax.set_xlabel("Slope (OTP change per year, post-2022)")
    ax.set_title("Top 10 Improving & Declining Routes (post-COVID)")
    ax.invert_yaxis()
    ax.axvline(0, color="black", linewidth=0.5)

    legend_patches = [Patch(facecolor=c, label=m) for m, c in MODE_COLORS.items() if m != "UNKNOWN"]
    axes[0].legend(handles=legend_patches, loc="lower right", fontsize=8)

    save_chart(fig, OUT / "top_bottom_routes.png")


@run_analysis(3, "Route Ranking")
def main() -> None:
    """Entry point: load data, analyze, chart, and save."""
    with phase("Loading data"):
        otp, stop_counts = load_data()
        print(f"  {len(otp):,} OTP observations, {len(stop_counts)} routes with stop data")

    with phase("Analyzing"):
        result = analyze(otp, stop_counts)
        rankable = result.filter(~pl.col("limited_data"))
        limited = result.filter(pl.col("limited_data"))
        print(f"  {len(rankable)} routes ranked ({MIN_MONTHS}+ months of data)")
        print(f"  {len(limited)} routes excluded (fewer than {MIN_MONTHS} months)")
        hv = result.filter(pl.col("high_volatility"))
        print(f"  {len(hv)} high-volatility routes flagged")
        print(f"  Slope period: {POST_COVID_START} onward (per-year units)")

        # Report slope significance
        has_slope = result.filter(pl.col("slope").is_not_null())
        sig_slopes = has_slope.filter(pl.col("slope_significant") == True)
        print(f"  {len(sig_slopes)} of {len(has_slope)} slopes are statistically significant (95% CI excludes zero)")

        # Report routes with null stop counts
        null_stops = result.filter(pl.col("stop_count").is_null())
        if len(null_stops) > 0:
            ids = null_stops["route_id"].to_list()
            print(f"  {len(null_stops)} routes lack stop count data: {', '.join(str(r) for r in ids)}")

        # Report modes
        for mode in sorted(rankable["mode"].unique().to_list()):
            mode_count = len(rankable.filter(pl.col("mode") == mode))
            print(f"  Mode {mode}: {mode_count} routes ranked")

    with phase("Saving CSV"):
        save_csv(result, OUT / "route_ranking.csv")

    with phase("Generating chart"):
        make_chart(result)


if __name__ == "__main__":
    main()

    

    

Sources

Name	Type	Why It Matters	Owner	Freshness	Caveat
otp_monthly	table	Primary analytical table used in this page's computations.	Produced by Data Ingestion.	Updated when the producing pipeline step is rerun.	Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5) file data/routes_by_month.csv — Monthly route OTP source table in wide format. file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications. file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts. file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes. file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
route_stops	table	Primary analytical table used in this page's computations.	Produced by Data Ingestion.	Updated when the producing pipeline step is rerun.	Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5) file data/routes_by_month.csv — Monthly route OTP source table in wide format. file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications. file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts. file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes. file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
routes	table	Primary analytical table used in this page's computations.	Produced by Data Ingestion.	Updated when the producing pipeline step is rerun.	Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5) file data/routes_by_month.csv — Monthly route OTP source table in wide format. file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications. file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts. file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes. file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
matplotlib	dependency	Runtime dependency required for this page's pipeline or analysis code.	Open-source Python ecosystem maintainers.	Version pinned by project environment until dependency updates are applied.	Library updates may change behavior or defaults.
numpy	dependency	Runtime dependency required for this page's pipeline or analysis code.	Open-source Python ecosystem maintainers.	Version pinned by project environment until dependency updates are applied.	Library updates may change behavior or defaults.
polars	dependency	Runtime dependency required for this page's pipeline or analysis code.	Open-source Python ecosystem maintainers.	Version pinned by project environment until dependency updates are applied.	Library updates may change behavior or defaults.
scipy	dependency	Runtime dependency required for this page's pipeline or analysis code.	Open-source Python ecosystem maintainers.	Version pinned by project environment until dependency updates are applied.	Library updates may change behavior or defaults.