Analysis

38 - Downtown Recovery Gap

Ridership & Recovery

Coverage: 2017-01 to 2024-10 (from ridership_monthly).

Built 2026-04-03 20:09 UTC · Commit 7c56b9a

Page Navigation

Analysis Navigation

Data Provenance

flowchart LR
  38_downtown_recovery(["38 - Downtown Recovery Gap"])
  f1_38_downtown_recovery[/"data/bus-stop-usage/wprdc_stop_data.csv"/] --> 38_downtown_recovery
  t_ridership_monthly[("ridership_monthly")] --> 38_downtown_recovery
  01_data_ingestion[["Data Ingestion"]] --> t_ridership_monthly
  u1_01_data_ingestion[/"data/routes_by_month.csv"/] --> 01_data_ingestion
  u2_01_data_ingestion[/"data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv"/] --> 01_data_ingestion
  u3_01_data_ingestion[/"data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv"/] --> 01_data_ingestion
  u4_01_data_ingestion[/"data/PRT_Stop_Reference_Lookup_Table.csv"/] --> 01_data_ingestion
  u5_01_data_ingestion[/"data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv"/] --> 01_data_ingestion
  d1_38_downtown_recovery(("numpy (lib)")) --> 38_downtown_recovery
  d2_38_downtown_recovery(("polars (lib)")) --> 38_downtown_recovery
  d3_38_downtown_recovery(("scipy (lib)")) --> 38_downtown_recovery
  classDef page fill:#dbeafe,stroke:#1d4ed8,color:#1e3a8a,stroke-width:2px;
  classDef table fill:#ecfeff,stroke:#0e7490,color:#164e63;
  classDef dep fill:#fff7ed,stroke:#c2410c,color:#7c2d12,stroke-dasharray: 4 2;
  classDef file fill:#eef2ff,stroke:#6366f1,color:#3730a3;
  classDef api fill:#f0fdf4,stroke:#16a34a,color:#14532d;
  classDef pipeline fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95;
  class 38_downtown_recovery page;
  class t_ridership_monthly table;
  class d1_38_downtown_recovery,d2_38_downtown_recovery,d3_38_downtown_recovery dep;
  class f1_38_downtown_recovery,u1_01_data_ingestion,u2_01_data_ingestion,u3_01_data_ingestion,u4_01_data_ingestion,u5_01_data_ingestion file;
  class 01_data_ingestion pipeline;

Findings

Findings: Downtown Recovery Gap

Summary

Downtown dependence is a statistically significant predictor of worse ridership recovery, partially supporting PRT's claim that weak downtown business activity drags system-wide ridership. Routes with higher shares of pre-pandemic downtown boardings show worse recovery trajectories. However, the effect is modest (Spearman rho = -0.29) and explains only a fraction of the system-wide gap — even routes with minimal downtown exposure remain well below 2019 levels. Decomposing by service type reveals that the downtown effect is primarily a commuter/express route phenomenon: commuter routes recovered to just 34% of 2019 levels vs. 59% for local routes. Within local routes alone, downtown dependence is not a significant predictor of recovery.

Key Numbers

  • Spearman correlation between downtown boardings share and 2024 recovery: rho = -0.29, p = 0.005.
  • High downtown dependence (top tercile, 31 routes, median 53% downtown share): median 2024 recovery = 53.4% of 2019.
  • Medium downtown dependence (32 routes, median 41% downtown share): median recovery = 47.3%.
  • Low downtown dependence (31 routes, median 15% downtown share): median recovery = 63.8%.
  • Kruskal-Wallis across terciles: H = 8.64, p = 0.013.
  • Only the Medium vs Low pairwise comparison is significant after Bonferroni correction (p = 0.02). High vs Low approaches significance (p = 0.11) but does not clear the threshold.

By Service Type

  • Local routes (n=69): median recovery = 58.6%, median downtown share = 41.1%.
  • Limited routes (n=4): median recovery = 51.4%, median downtown share = 42.8%.
  • Commuter/Express routes (n=21): median recovery = 33.7%, median downtown share = 41.6%.
  • Kruskal-Wallis across service types: H = 12.71, p = 0.002 — recovery differs significantly by service type.
  • Within-group Spearman (downtown share vs recovery): Local rho = -0.19 (p = 0.11, NS); Commuter/Express rho = -0.56 (p = 0.008).
  • Partial Spearman controlling for service type: rho = -0.26, p = 0.010 — downtown dependence still predicts worse recovery even after accounting for route type.

Observations

  • PRT's claim is directionally correct but insufficient as a full explanation. The negative correlation confirms that downtown-oriented routes have recovered less ridership. But the effect is modest — downtown dependence alone accounts for roughly 8% of the variance in recovery (rho-squared ~ 0.08).
  • Even non-downtown routes are far from recovery. The low-dependence tercile sits at median 63.8% of 2019, meaning routes with almost no downtown exposure still lost over a third of their riders. This is not purely a downtown story.
  • The medium tercile recovered worst, not the high tercile. This appears driven by commuter express routes (flyers like P12, P67, P71, P76) and express services (Y1, O1, G3) that fall in the medium range for downtown stop share but serve exactly the demographic most likely to shift to remote work. These routes lost 70-85% of ridership.
  • Large local routes in the high tercile held up better than expected. Routes like 6, 12, 17, 81, 83 have >50% downtown share but recovered to 65-79% — likely because they serve diverse trip purposes (shopping, medical, transfers) beyond commuting.
  • The best-recovering routes are high-ridership trunk lines with low downtown shares: 61A (84%), 61C (79%), 71B (80%), 59 (90%). These serve corridor travel that is less sensitive to office occupancy.
  • The trajectory chart shows all three groups tracking closely until mid-2021, after which the low-dependence group pulls ahead. This timing aligns with the return of non-commute travel (errands, school, medical) while office occupancy remained depressed.

Service Type Decomposition

  • Commuter/express routes recovered far less than local routesmedian 33.7% vs 58.6% of 2019 levels. This is statistically significant (Kruskal-Wallis H = 12.71, p = 0.002) and represents the single largest explanatory split in the data.
  • Downtown dependence and service type are partially confounded but not redundant. Commuter/express routes have similar median downtown shares to local routes (~41%), but their recovery is 25pp worse. The downtown effect operates through route type — commuter routes serve downtown commuters specifically, while local routes serve downtown for diverse purposes.
  • Within local routes, downtown dependence has no significant effect (rho = -0.19, p = 0.11). Among local bus routes, having more downtown stops does not meaningfully predict worse recovery. The downtown recovery gap is primarily a commuter route phenomenon.
  • Within commuter/express routes, downtown dependence is strongly predictive (rho = -0.56, p = 0.008). Among commuter routes, those with higher downtown share recovered even less — suggesting that peak-direction downtown express services were hit hardest.
  • After controlling for service type, downtown dependence still predicts recovery (partial rho = -0.26, p = 0.010), but the effect is attenuated from rho = -0.29 to -0.26. Service type explains part but not all of the downtown dependence effect.
  • The faceted trajectory chart reveals starkly different recovery shapes. Local routes show a gradual climb back toward 60-70% across all terciles. Commuter/express routes show a much flatter recovery, with high-downtown-dependence commuter routes plateauing around 30-40%.

Caveats

  • Downtown share is computed from pre-pandemic stop-level data (FY2019), not current patterns. Routes may have shifted service since 2019.
  • The stop-level CSV provides boarding counts, not trip purpose. A downtown boarding could be a commuter, a shopper, a transfer, or a medical visitor. We cannot isolate the "office worker" effect directly.
  • Ecological inference limitation. Route-level associations do not prove that individual downtown commuters failed to return — only that routes serving more downtown stops recovered less ridership in aggregate.
  • Ridership data ends Oct 2024. More recent recovery may differ.
  • The 2 km downtown radius captures the Golden Triangle and adjacent areas (e.g., Strip District edge, North Shore). A tighter boundary around the CBD might sharpen the signal.

Validation

  1. Data source verified. Downtown scores computed from wprdc_stop_data.csv (stop-level ridership with coordinates). Recovery trajectories from ridership_monthly (route-level weekday ridership). Both checked against prior analyses (33, 21).
  2. Geographic scope matches. Downtown centroid and 2 km radius match analysis 33's definition. Pre-pandemic baseline (2019) matches analysis 21 and 37.
  3. Null handling. Routes with zero 2019 baseline excluded from indexing. Stop records with null lat/lon excluded from downtown scoring.
  4. Aggregates sanity-checked. System-wide recovery around 55-60% is consistent with analysis 37's finding of 57.6% via NTD data. Individual route recoveries match analysis 21's findings.
  5. Direction of effects checked. More downtown dependence associates with worse recovery — consistent with the remote-work hypothesis. Low-dependence routes recovering better is consistent with known patterns of non-commute ridership being more resilient.
  6. Small-sample routes present. Some routes (O5, 7, 71, 18) have very low baselines (<200 avg daily riders). These contribute equally to tercile medians despite less reliable estimates.

Output

Methods

Methods: Downtown Recovery Gap

Question

Does downtown-dependent ridership explain Pittsburgh's poor system-wide recovery relative to peer cities? PRT claims that weak downtown business recovery is a primary driver of the system's lagging ridership. If true, routes that depend heavily on downtown stops should show worse recovery trajectories than routes serving other parts of the network.

Approach

  1. Compute downtown-dependence score per route. Using pre-pandemic stop-level ridership, calculate the fraction of each route's weekday boardings that occur at downtown stops (within 2 km of the Golden Triangle centroid at 40.4406, -79.9959 — matching analysis 33's definition).
  2. Classify routes into terciles by downtown-dependence share: high (top third), medium, and low (bottom third). This avoids arbitrary cutoffs and ensures roughly equal group sizes.
  3. Build monthly ridership recovery trajectories. Using ridership_monthly (weekday data, 2017-01 through 2024-10), index each route's ridership to its 2019 monthly average. Aggregate indexed ridership by downtown-dependence tercile to produce three recovery curves.
  4. Statistical test. Compare 2024 recovery ratios across the three groups using Kruskal-Wallis (non-parametric, no normality assumption). If significant, apply pairwise Mann-Whitney with Bonferroni correction.
  5. Scatter plot. Show the relationship between downtown-dependence share (continuous) and 2024 recovery ratio at the route level, with Spearman correlation.
  6. Service type segmentation. Classify each route by service type using route ID naming conventions (classify_bus_route): local (plain numbers), limited (L-suffix), and commuter/express (P/O/G-prefix flyers, X-suffix express, busway). Produce faceted recovery trajectories and scatter plots by service type.
  7. Service type as covariate. Test whether downtown dependence predicts recovery after controlling for service type using: (a) within-group Spearman correlations for each service type, (b) Kruskal-Wallis across service types, and (c) partial Spearman correlation residualizing both downtown share and recovery on service type rank.

Data

  • data/bus-stop-usage/wprdc_stop_data.csv — stop-level ridership with lat/lon. Filtered to time_period == 'Pre-pandemic' and serviceday == 'Weekday'. Used to compute downtown-dependence scores.
  • ridership_monthly table — route-level monthly ridership. Filtered to day_type == 'WEEKDAY'. Used for recovery trajectories.
  • Downtown centroid: (40.4406, -79.9959), radius 2 km (Haversine).

Output

  • output/recovery_trajectories.png — Monthly indexed ridership (2019=100) for high/medium/low downtown-dependence terciles.
  • output/scatter_dt_share_vs_recovery.png — Route-level scatter of downtown share vs. 2024 recovery ratio.
  • output/recovery_by_service_type.png — Recovery trajectories faceted by service type (local, limited, commuter/express).
  • output/scatter_by_service_type.png — Downtown share vs. recovery scatter colored by service type.
  • output/route_downtown_scores.csv — Per-route downtown-dependence scores, service type, and recovery ratios.
  • output/statistical_tests.csv — Kruskal-Wallis and pairwise test results.
  • output/service_type_tests.csv — Within-group correlations, Kruskal-Wallis by service type, and partial correlation results.

Source Code

"""Test whether downtown-dependent routes explain Pittsburgh's poor ridership recovery."""

import math

import numpy as np
import polars as pl
from scipy import stats

from prt_otp_analysis.common import DATA_DIR, PRE_COVID_BASELINE_YEAR, analysis_dir, classify_bus_route, correlate, get_db, phase, query_to_polars, run_analysis, save_chart, save_csv, setup_plotting, weighted_mean

OUT = analysis_dir(__file__)

# Downtown Pittsburgh centroid (matches analysis 33)
DT_LAT, DT_LON = 40.4406, -79.9959
DT_RADIUS_KM = 2.0

BASELINE_YEAR = PRE_COVID_BASELINE_YEAR

# Commuter-type subtypes (flyer, express, busway) vs local vs limited
COMMUTER_SUBTYPES = {"express", "flyer", "busway"}


def add_service_type(df: pl.DataFrame) -> pl.DataFrame:
    """Add subtype and service_type columns based on route_id naming conventions."""
    return df.with_columns(
        pl.col("route_id")
        .map_elements(classify_bus_route, return_dtype=pl.Utf8)
        .alias("subtype"),
    ).with_columns(
        pl.when(pl.col("subtype").is_in(COMMUTER_SUBTYPES))
        .then(pl.lit("Commuter/Express"))
        .when(pl.col("subtype") == "limited")
        .then(pl.lit("Limited"))
        .otherwise(pl.lit("Local"))
        .alias("service_type"),
    )


def haversine_km(lat: float, lon: float) -> float:
    """Return distance in km from downtown centroid."""
    R = 6371
    rlat1, rlat2 = math.radians(DT_LAT), math.radians(lat)
    dlat = math.radians(lat - DT_LAT)
    dlon = math.radians(lon - DT_LON)
    a = math.sin(dlat / 2) ** 2 + math.cos(rlat1) * math.cos(rlat2) * math.sin(dlon / 2) ** 2
    return R * 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))


def compute_downtown_scores() -> pl.DataFrame:
    """Compute downtown-dependence score for each route using stop-level data."""
    csv_path = DATA_DIR / "bus-stop-usage" / "wprdc_stop_data.csv"
    df = pl.read_csv(csv_path, null_values=["NA", ""])

    # Pre-pandemic weekday boardings by stop and route
    pre = df.filter(
        (pl.col("time_period") == "Pre-pandemic")
        & (pl.col("serviceday") == "Weekday")
    )
    stop_route = (
        pre.group_by(["stop_id", "route_name", "latitude", "longitude"])
        .agg(pl.col("avg_ons").mean().alias("avg_boardings"))
        .drop_nulls(subset=["latitude", "longitude", "avg_boardings"])
    )

    # Classify stops as downtown or not
    dists = [
        haversine_km(lat, lon)
        for lat, lon in zip(
            stop_route["latitude"].to_list(), stop_route["longitude"].to_list()
        )
    ]
    stop_route = stop_route.with_columns(
        pl.Series("dist_km", dists),
    ).with_columns(
        (pl.col("dist_km") < DT_RADIUS_KM).alias("is_downtown"),
    )

    # Downtown boardings share per route
    stop_route = stop_route.with_columns(
        (pl.col("avg_boardings") * pl.col("is_downtown").cast(pl.Float64)).alias(
            "dt_boardings"
        )
    )
    route_scores = (
        stop_route.group_by("route_name")
        .agg(
            pl.col("avg_boardings").sum().alias("total_boardings"),
            pl.col("dt_boardings").sum().alias("dt_boardings"),
            pl.col("stop_id").n_unique().alias("n_stops"),
        )
        .with_columns(
            (pl.col("dt_boardings") / pl.col("total_boardings")).alias("dt_share")
        )
        .rename({"route_name": "route_id"})
        .sort("dt_share", descending=True)
    )
    return route_scores


def load_ridership() -> pl.DataFrame:
    """Load monthly weekday ridership from database."""
    return query_to_polars(
        "SELECT route_id, month, avg_riders, route_name "
        "FROM ridership_monthly WHERE day_type = 'WEEKDAY'"
    )


def compute_recovery(
    ridership: pl.DataFrame, route_scores: pl.DataFrame
) -> pl.DataFrame:
    """Compute 2019-indexed ridership and merge with downtown scores."""
    # Compute 2019 baseline per route
    baseline = (
        ridership.filter(pl.col("month").str.starts_with(BASELINE_YEAR))
        .group_by("route_id")
        .agg(pl.col("avg_riders").mean().alias("baseline_2019"))
    )

    # Join baseline and compute index
    df = ridership.join(baseline, on="route_id", how="inner")
    df = df.filter(pl.col("baseline_2019") > 0).with_columns(
        (pl.col("avg_riders") / pl.col("baseline_2019") * 100).alias("indexed")
    )

    # Merge downtown scores
    df = df.join(
        route_scores.select("route_id", "dt_share", "total_boardings"),
        on="route_id",
        how="inner",
    )
    return df


def assign_terciles(route_scores: pl.DataFrame) -> pl.DataFrame:
    """Assign downtown-dependence terciles."""
    thirds = route_scores["dt_share"].quantile(1 / 3), route_scores["dt_share"].quantile(2 / 3)
    return route_scores.with_columns(
        pl.when(pl.col("dt_share") <= thirds[0])
        .then(pl.lit("Low"))
        .when(pl.col("dt_share") <= thirds[1])
        .then(pl.lit("Medium"))
        .otherwise(pl.lit("High"))
        .alias("dt_tercile")
    )


def plot_trajectories(df: pl.DataFrame) -> None:
    """Plot recovery trajectories by downtown-dependence tercile."""
    plt = setup_plotting()

    # Aggregate: ridership-weighted mean index per tercile per month
    monthly = (
        df.group_by(["month", "dt_tercile"])
        .agg(
            weighted_index=weighted_mean("indexed", "avg_riders"),
            n_routes=pl.col("route_id").n_unique(),
        )
        .sort("month")
    )

    fig, ax = plt.subplots(figsize=(12, 6))

    colors = {"High": "#dc2626", "Medium": "#f59e0b", "Low": "#2563eb"}
    for tercile in ["High", "Medium", "Low"]:
        sub = monthly.filter(pl.col("dt_tercile") == tercile).sort("month")
        n = sub["n_routes"].max()
        ax.plot(
            sub["month"].to_list(),
            sub["weighted_index"].to_list(),
            label=f"{tercile} downtown dependence (n={n})",
            color=colors[tercile],
            linewidth=2,
        )

    ax.axhline(100, color="gray", linewidth=0.8, linestyle="--", alpha=0.5)
    ax.axvspan("2020-03", "2020-06", alpha=0.08, color="gray", label="Initial lockdown")

    # Thin x-axis labels
    ticks = [m for m in monthly["month"].unique().sort().to_list() if m.endswith("-01")]
    ax.set_xticks(ticks)
    ax.set_xticklabels([m[:4] for m in ticks], rotation=0)

    ax.set_ylabel("Indexed weekday ridership (2019 = 100)")
    ax.set_xlabel("")
    ax.set_title("Ridership Recovery by Downtown Dependence")
    ax.legend(loc="lower right")
    ax.set_ylim(0, 130)

    save_chart(fig, OUT / "recovery_trajectories.png", dpi=150)


def plot_trajectories_by_service_type(df: pl.DataFrame) -> None:
    """Plot recovery trajectories faceted by service type."""
    plt = setup_plotting()

    service_types = ["Local", "Limited", "Commuter/Express"]
    fig, axes = plt.subplots(1, 3, figsize=(18, 6), sharey=True)

    colors = {"High": "#dc2626", "Medium": "#f59e0b", "Low": "#2563eb"}

    for ax, stype in zip(axes, service_types):
        sub_df = df.filter(pl.col("service_type") == stype)
        n_routes = sub_df["route_id"].n_unique()

        monthly = (
            sub_df.group_by(["month", "dt_tercile"])
            .agg(
                weighted_index=weighted_mean("indexed", "avg_riders"),
                n_routes=pl.col("route_id").n_unique(),
            )
            .sort("month")
        )

        for tercile in ["High", "Medium", "Low"]:
            tsub = monthly.filter(pl.col("dt_tercile") == tercile).sort("month")
            if len(tsub) == 0:
                continue
            n = tsub["n_routes"].max()
            ax.plot(
                tsub["month"].to_list(),
                tsub["weighted_index"].to_list(),
                label=f"{tercile} (n={n})",
                color=colors[tercile],
                linewidth=2,
            )

        ax.axhline(100, color="gray", linewidth=0.8, linestyle="--", alpha=0.5)
        ax.axvspan("2020-03", "2020-06", alpha=0.08, color="gray")
        ticks = [m for m in monthly["month"].unique().sort().to_list() if m.endswith("-01")]
        ax.set_xticks(ticks)
        ax.set_xticklabels([m[:4] for m in ticks], rotation=0, fontsize=8)
        ax.set_title(f"{stype} routes (n={n_routes})")
        ax.legend(loc="lower right", fontsize=8)
        ax.set_ylim(0, 130)

    axes[0].set_ylabel("Indexed weekday ridership (2019 = 100)")
    fig.suptitle("Recovery by Downtown Dependence — Faceted by Service Type", fontsize=14)
    fig.tight_layout()
    save_chart(fig, OUT / "recovery_by_service_type.png", dpi=150)


def plot_scatter(route_df: pl.DataFrame) -> None:
    """Scatter plot of downtown share vs 2024 recovery."""
    plt = setup_plotting()

    fig, ax = plt.subplots(figsize=(8, 6))

    x = route_df["dt_share"].to_numpy() * 100
    y = route_df["recovery_2024"].to_numpy()
    sizes = np.clip(route_df["baseline_2019"].to_numpy() / 100, 10, 200)

    ax.scatter(x, y, s=sizes, alpha=0.5, color="#3b82f6", edgecolors="white", linewidth=0.5)

    # Regression line
    mask = np.isfinite(x) & np.isfinite(y)
    if mask.sum() > 5:
        corr = correlate(route_df, "dt_share", "recovery_2024")
        slope, intercept = np.polyfit(x[mask], y[mask], 1)
        x_line = np.linspace(x[mask].min(), x[mask].max(), 100)
        ax.plot(x_line, slope * x_line + intercept, color="#dc2626", linewidth=1.5,
                linestyle="--", label=f"Spearman ρ = {corr['spearman_r']:.2f} (p = {corr['spearman_p']:.3f})")

    ax.axhline(100, color="gray", linewidth=0.8, linestyle="--", alpha=0.5)
    ax.set_xlabel("Downtown boardings share (% of route total)")
    ax.set_ylabel("2024 ridership as % of 2019")
    ax.set_title("Downtown Dependence vs. Ridership Recovery")
    ax.legend()

    save_chart(fig, OUT / "scatter_dt_share_vs_recovery.png", dpi=150)


def plot_scatter_by_service_type(route_df: pl.DataFrame) -> None:
    """Scatter plot of downtown share vs recovery, colored by service type."""
    plt = setup_plotting()

    fig, ax = plt.subplots(figsize=(9, 6))

    type_colors = {"Local": "#2563eb", "Limited": "#f59e0b", "Commuter/Express": "#dc2626"}

    for stype, color in type_colors.items():
        sub = route_df.filter(pl.col("service_type") == stype)
        if len(sub) == 0:
            continue
        x = sub["dt_share"].to_numpy() * 100
        y = sub["recovery_2024"].to_numpy()
        sizes = np.clip(sub["baseline_2019"].to_numpy() / 100, 10, 200)
        ax.scatter(
            x, y, s=sizes, alpha=0.6, color=color, edgecolors="white",
            linewidth=0.5, label=f"{stype} (n={len(sub)})",
        )

    ax.axhline(100, color="gray", linewidth=0.8, linestyle="--", alpha=0.5)
    ax.set_xlabel("Downtown boardings share (% of route total)")
    ax.set_ylabel("2024 ridership as % of 2019")
    ax.set_title("Downtown Dependence vs. Recovery — By Service Type")
    ax.legend()

    save_chart(fig, OUT / "scatter_by_service_type.png", dpi=150)


def run_tests(route_df: pl.DataFrame) -> pl.DataFrame:
    """Kruskal-Wallis and pairwise Mann-Whitney tests on recovery by tercile."""
    results = []

    groups = {}
    for t in ["High", "Medium", "Low"]:
        vals = route_df.filter(pl.col("dt_tercile") == t)["recovery_2024"].drop_nulls().to_list()
        groups[t] = vals

    if all(len(v) >= 3 for v in groups.values()):
        h_stat, kw_p = stats.kruskal(*groups.values())
        results.append({
            "test": "Kruskal-Wallis",
            "comparison": "High vs Medium vs Low",
            "statistic": round(h_stat, 3),
            "p_value": round(kw_p, 4),
            "significant": kw_p < 0.05,
        })

        pairs = [("High", "Medium"), ("High", "Low"), ("Medium", "Low")]
        for a, b in pairs:
            u_stat, mw_p = stats.mannwhitneyu(groups[a], groups[b], alternative="two-sided")
            adj_p = min(mw_p * 3, 1.0)
            results.append({
                "test": "Mann-Whitney (Bonferroni)",
                "comparison": f"{a} vs {b}",
                "statistic": round(u_stat, 1),
                "p_value": round(adj_p, 4),
                "significant": adj_p < 0.05,
            })

    return pl.DataFrame(results)


def run_service_type_analysis(route_df: pl.DataFrame) -> pl.DataFrame:
    """Test downtown dependence effect within each service type and overall with covariate."""
    results = []

    # Within-group Spearman correlations
    for stype in ["Local", "Limited", "Commuter/Express"]:
        sub = route_df.filter(pl.col("service_type") == stype).drop_nulls(
            subset=["dt_share", "recovery_2024"]
        )
        if len(sub) >= 5:
            r, p = stats.spearmanr(sub["dt_share"].to_numpy(), sub["recovery_2024"].to_numpy())
            results.append({
                "test": "Spearman (within group)",
                "group": stype,
                "n": len(sub),
                "statistic": round(r, 3),
                "p_value": round(p, 4),
                "significant": p < 0.05,
            })

    # Kruskal-Wallis across service types (is recovery different by type?)
    groups_by_type = {}
    for stype in ["Local", "Limited", "Commuter/Express"]:
        vals = route_df.filter(pl.col("service_type") == stype)["recovery_2024"].drop_nulls().to_list()
        if len(vals) >= 3:
            groups_by_type[stype] = vals

    if len(groups_by_type) >= 2:
        h_stat, kw_p = stats.kruskal(*groups_by_type.values())
        results.append({
            "test": "Kruskal-Wallis (service type)",
            "group": " vs ".join(groups_by_type.keys()),
            "n": sum(len(v) for v in groups_by_type.values()),
            "statistic": round(h_stat, 3),
            "p_value": round(kw_p, 4),
            "significant": kw_p < 0.05,
        })

    # Partial correlation: rank-based approach
    # Encode service_type as ordinal for partial correlation
    type_rank = {"Local": 0, "Limited": 1, "Commuter/Express": 2}
    valid_df = route_df.drop_nulls(subset=["dt_share", "recovery_2024"]).with_columns(
        pl.col("service_type").replace_strict(type_rank).cast(pl.Float64).alias("type_rank")
    )
    if len(valid_df) >= 10:
        # Partial Spearman: residualize both dt_share and recovery on service type rank
        dt_vals = valid_df["dt_share"].to_numpy()
        rec_vals = valid_df["recovery_2024"].to_numpy()
        type_vals = valid_df["type_rank"].to_numpy()

        # Rank-based residuals
        dt_resid = dt_vals - np.array([np.mean(dt_vals[type_vals == t]) for t in type_vals])
        rec_resid = rec_vals - np.array([np.mean(rec_vals[type_vals == t]) for t in type_vals])

        r_partial, p_partial = stats.spearmanr(dt_resid, rec_resid)
        results.append({
            "test": "Partial Spearman (controlling service type)",
            "group": "All",
            "n": len(valid_df),
            "statistic": round(r_partial, 3),
            "p_value": round(p_partial, 4),
            "significant": p_partial < 0.05,
        })

    return pl.DataFrame(results)


@run_analysis(38, "Downtown Recovery Gap")
def main() -> None:
    """Entry point."""

    # Step 1: Downtown-dependence scores + service type classification
    with phase("Computing downtown-dependence scores from stop-level data"):
        route_scores = compute_downtown_scores()
        route_scores = assign_terciles(route_scores)
        route_scores = add_service_type(route_scores)
        print(f"  {len(route_scores)} routes scored")
        for t in ["High", "Medium", "Low"]:
            sub = route_scores.filter(pl.col("dt_tercile") == t)
            print(f"  {t:6s}: n={len(sub)}, median dt_share={sub['dt_share'].median():.1%}")
        print("\n  Service type breakdown:")
        for stype in ["Local", "Limited", "Commuter/Express"]:
            sub = route_scores.filter(pl.col("service_type") == stype)
            if len(sub) > 0:
                print(f"  {stype:20s}: n={len(sub)}, median dt_share={sub['dt_share'].median():.1%}")

    # Step 2: Load ridership and compute recovery
    with phase("Loading monthly ridership"):
        ridership = load_ridership()
        df = compute_recovery(ridership, route_scores)
        # Merge tercile labels and service type
        df = df.join(
            route_scores.select("route_id", "dt_tercile", "service_type"),
            on="route_id",
            how="left",
        )
        print(f"  {df['route_id'].n_unique()} routes with both ridership and downtown scores")

        # Step 3: Compute 2024 recovery ratio per route
        recovery_2024 = (
            df.filter(pl.col("month").str.starts_with("2024"))
            .group_by("route_id")
            .agg(pl.col("indexed").mean().alias("recovery_2024"))
        )
        route_df = route_scores.join(recovery_2024, on="route_id", how="inner")
        # Also get baseline for scatter sizing
        baseline = (
            ridership.filter(pl.col("month").str.starts_with(BASELINE_YEAR))
            .group_by("route_id")
            .agg(pl.col("avg_riders").mean().alias("baseline_2019"))
        )
        route_df = route_df.join(baseline, on="route_id", how="inner")

        print("\nRecovery by tercile (2024 avg as % of 2019):")
        for t in ["High", "Medium", "Low"]:
            sub = route_df.filter(pl.col("dt_tercile") == t)
            med = sub["recovery_2024"].median()
            mean = sub["recovery_2024"].mean()
            print(f"  {t:6s}: median={med:.1f}%, mean={mean:.1f}%, n={len(sub)}")

        # Step 4: Statistical tests
        print("\nStatistical tests:")
        test_results = run_tests(route_df)
        for row in test_results.iter_rows(named=True):
            sig = "*" if row["significant"] else ""
            print(f"  {row['test']:30s} {row['comparison']:25s} "
                  f"stat={row['statistic']:.3f}, p={row['p_value']:.4f} {sig}")

        # Spearman correlation (continuous)
        corr = correlate(route_df, "dt_share", "recovery_2024")
        print(f"\n  Spearman (dt_share vs recovery): ρ={corr['spearman_r']:.3f}, p={corr['spearman_p']:.4f}")

    # Step 4b: Service type analysis
    with phase("Analyzing recovery by service type"):
        print("\nRecovery by service type (2024 avg as % of 2019):")
        for stype in ["Local", "Limited", "Commuter/Express"]:
            sub = route_df.filter(pl.col("service_type") == stype)
            if len(sub) > 0:
                med = sub["recovery_2024"].median()
                mean = sub["recovery_2024"].mean()
                dt_med = sub["dt_share"].median()
                print(
                    f"  {stype:20s}: median recovery={med:.1f}%, "
                    f"mean={mean:.1f}%, median dt_share={dt_med:.1%}, n={len(sub)}"
                )

        service_type_results = run_service_type_analysis(route_df)
        print("\nService type statistical tests:")
        for row in service_type_results.iter_rows(named=True):
            sig = "*" if row["significant"] else ""
            print(
                f"  {row['test']:45s} {row['group']:20s} "
                f"n={row['n']:3d}, stat={row['statistic']:.3f}, "
                f"p={row['p_value']:.4f} {sig}"
            )

    # Step 5: Charts
    with phase("Generating charts"):
        plot_trajectories(df)
        plot_scatter(route_df)
        plot_trajectories_by_service_type(df)
        plot_scatter_by_service_type(route_df)

    # Step 6: Save CSV
    with phase("Saving CSVs"):
        sorted_route_df = route_df.sort("dt_share", descending=True)
        save_csv(sorted_route_df, OUT / "route_downtown_scores.csv")
        save_csv(test_results, OUT / "statistical_tests.csv")
        save_csv(service_type_results, OUT / "service_type_tests.csv")


if __name__ == "__main__":
    main()

Sources

NameTypeWhy It MattersOwnerFreshnessCaveat
data/bus-stop-usage/wprdc_stop_data.csv file Stop-level ridership with lat/lon, used to compute downtown-dependence scores per route from pre-pandemic weekday boardings. Local project data owner not specified. Snapshot file; refresh by rerunning its pipeline step. May lag upstream source updates.
ridership_monthly table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
numpy dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
polars dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
scipy dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.