Analysis

26 - Ridership in Multivariate OTP Model

Ridership and External Factors

Coverage: 2017-01 to 2025-11 (from otp_monthly, ridership_monthly).

Built 2026-04-03 20:09 UTC · Commit 7c56b9a

Page Navigation

Analysis Navigation

Data Provenance

flowchart LR
  26_ridership_multivariate(["26 - Ridership in Multivariate OTP Model"])
  t_otp_monthly[("otp_monthly")] --> 26_ridership_multivariate
  01_data_ingestion[["Data Ingestion"]] --> t_otp_monthly
  u1_01_data_ingestion[/"data/routes_by_month.csv"/] --> 01_data_ingestion
  u2_01_data_ingestion[/"data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv"/] --> 01_data_ingestion
  u3_01_data_ingestion[/"data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv"/] --> 01_data_ingestion
  u4_01_data_ingestion[/"data/PRT_Stop_Reference_Lookup_Table.csv"/] --> 01_data_ingestion
  u5_01_data_ingestion[/"data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv"/] --> 01_data_ingestion
  t_ridership_monthly[("ridership_monthly")] --> 26_ridership_multivariate
  01_data_ingestion[["Data Ingestion"]] --> t_ridership_monthly
  t_route_stops[("route_stops")] --> 26_ridership_multivariate
  01_data_ingestion[["Data Ingestion"]] --> t_route_stops
  t_routes[("routes")] --> 26_ridership_multivariate
  01_data_ingestion[["Data Ingestion"]] --> t_routes
  t_stops[("stops")] --> 26_ridership_multivariate
  01_data_ingestion[["Data Ingestion"]] --> t_stops
  d1_26_ridership_multivariate(("numpy (lib)")) --> 26_ridership_multivariate
  d2_26_ridership_multivariate(("polars (lib)")) --> 26_ridership_multivariate
  d3_26_ridership_multivariate(("scipy (lib)")) --> 26_ridership_multivariate
  classDef page fill:#dbeafe,stroke:#1d4ed8,color:#1e3a8a,stroke-width:2px;
  classDef table fill:#ecfeff,stroke:#0e7490,color:#164e63;
  classDef dep fill:#fff7ed,stroke:#c2410c,color:#7c2d12,stroke-dasharray: 4 2;
  classDef file fill:#eef2ff,stroke:#6366f1,color:#3730a3;
  classDef api fill:#f0fdf4,stroke:#16a34a,color:#14532d;
  classDef pipeline fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95;
  class 26_ridership_multivariate page;
  class t_otp_monthly,t_ridership_monthly,t_route_stops,t_routes,t_stops table;
  class d1_26_ridership_multivariate,d2_26_ridership_multivariate,d3_26_ridership_multivariate dep;
  class u1_01_data_ingestion,u2_01_data_ingestion,u3_01_data_ingestion,u4_01_data_ingestion,u5_01_data_ingestion file;
  class 01_data_ingestion pipeline;

Findings

Findings: Ridership in Multivariate OTP Model

Summary

Adding log-transformed average weekday ridership to the Analysis 18 multivariate model does not significantly improve explanatory power. The F-test for the ridership term is not significant (F = 2.53, p = 0.116), and R² increases by only 1.5 pp (from 0.499 to 0.514). Ridership is not collinear with stop count or span (VIF = 1.73), but it is largely redundant with the existing predictors -- once route structure is controlled for, knowing how many people ride a route tells you almost nothing additional about its OTP.

Key Numbers

Model	R²	Adj R²	n
Base (6 features, Analysis 18 replication)	0.499	0.464	92
Expanded (+ log_riders)	0.514	0.473	92
Bus-only base (5 features)	0.392	0.355	89
Bus-only expanded (+ log_riders)	0.410	0.367	89
Ridership-only (log_riders + is_rail)	0.232	0.215	92

F-test for log_riders (all routes): F = 2.53, p = 0.116
F-test for log_riders (bus only): F = 2.55, p = 0.114
log_riders beta weight: -0.16 (weakly negative, not significant)
VIF for log_riders: 1.73 (no multicollinearity concern)
Ridership-only model explains just 23.2% of variance (vs 49.9% for structural model)

Expanded model coefficients

Feature	Beta Weight	p-value
stop_count	-0.43	<0.001
span_km	-0.47	<0.001
is_rail	+0.43	<0.001
n_munis	+0.35	0.005
log_riders	-0.16	0.116
is_premium_bus	+0.08	0.511
weekend_ratio	+0.06	0.605

Observations

The base model replicates Analysis 18's results closely (R² = 0.499 here vs 0.472 in Analysis 18; the small difference is because this analysis restricts to routes with ridership data, dropping 0 routes from the 92-route sample).
Log ridership has a weakly negative coefficient (beta = -0.16): higher-ridership routes tend to have slightly worse OTP after controls, but the effect is not statistically significant. This is consistent with Analysis 19's finding that ridership-weighted OTP is slightly higher than trip-weighted -- the two observations are not contradictory because the relationship is weak and confounded.
Log ridership is surprisingly uncorrelated with stop count (r = +0.15, p = 0.16) and span (r = -0.00, p = 1.00). It is most strongly correlated with weekend_ratio (r = +0.57) -- routes with more riders tend to have relatively more weekend service.
The ridership-only model (log_riders + is_rail, R² = 0.232) explains less than half the variance of the structural model (0.499), confirming that ridership is a weak proxy for the real drivers of OTP (stop count and span).
All VIFs remain below 3.0 in the expanded model, so multicollinearity is not a concern.

Discussion

This is a clean null result. Ridership does not add meaningful information to the OTP model once route structure is accounted for. The practical implication is that PRT does not need to worry that high ridership per se degrades OTP -- the routes with poor OTP happen to have high ridership because they are long, many-stop local bus corridors, not because passenger volumes cause delays. This is consistent with Analysis 10's finding that trip frequency does not predict OTP.

The weak negative direction of the ridership coefficient (-0.16 beta) is suggestive but not significant, and it could reflect residual confounding (high-ridership routes serve congested urban corridors) rather than a causal ridership-to-delay mechanism.

Caveats

Ridership is measured as average daily weekday riders per route, not per-trip load. Per-trip crowding data would better test the hypothesis that passenger volumes cause delays through dwell time.
The log transform assumes diminishing marginal effects; a linear specification also fails to reach significance (not shown).
The sample is restricted to 92 routes with both OTP and ridership data; 6 OTP-only routes are excluded.
As with Analysis 18, the model's unexplained 50% of variance likely requires operational data (traffic, staffing, schedule slack) not available in this dataset.

Output

image coefficient_comparison.png
beta weight comparison between base and expanded models.

image partial_residual.png
partial residual plot for log_ridership.

No interactive outputs declared.

data model_comparison.csv

side-by-side regression results (base vs expanded vs bus-only).

Preview CSV

Expand to load preview.

data vif_table.csv

VIF values for expanded model.

Preview CSV

Expand to load preview.

Methods

Methods: Ridership in Multivariate OTP Model

Question

Does average ridership add explanatory power to the Analysis 18 OLS model (stop count, span, mode, n_munis, premium, weekend ratio) or is it collinear with existing predictors?

Approach

Replicate the Analysis 18 six-feature OLS model on routes with ridership data available.
Add log-transformed average weekday ridership as a seventh predictor. Log transform is used because ridership is right-skewed and the relationship with OTP is expected to be diminishing (doubling from 100 to 200 riders matters more than from 5,000 to 5,100).
Compare adjusted R² between the six-feature and seven-feature models using a nested F-test.
Check VIF for the expanded model to assess multicollinearity with stop count and span.
If ridership is significant, test a reduced model (ridership + mode only) to see if ridership proxies for stop count.
Report beta weights, p-values, and model comparison statistics.
Repeat with bus-only subset.

Data

Name	Description	Source
`otp_monthly`	route_id, month, otp (averaged to route-level)	`prt.db` table
`ridership_monthly`	route_id, month, avg_riders (averaged across all months); filtered to day_type='WEEKDAY'	`prt.db` table
`route_stops`	route_id, stop_id for stop counts	`prt.db` table
`stops`	lat, lon for geographic span computation	`prt.db` table
`routes`	route_id, mode for subtype classification	`prt.db` table

Notes: Overlap period (Jan 2019 -- Oct 2024); exclude routes with fewer than 12 months of paired data.

Output

output/model_comparison.csv -- side-by-side regression results (base vs expanded vs bus-only)
output/vif_table.csv -- VIF values for expanded model
output/coefficient_comparison.png -- beta weight comparison between base and expanded models
output/partial_residual.png -- partial residual plot for log_ridership

  Source Code
    
      
      """Analysis 26: Test whether ridership adds explanatory power to the Analysis 18 multivariate OTP model."""

import math

import numpy as np
import polars as pl
from scipy import stats

from prt_otp_analysis.common import (
    analysis_dir,
    classify_bus_route,
    correlate,
    phase,
    query_to_polars,
    run_analysis,
    save_chart,
    save_csv,
    setup_plotting,
)

OUT = analysis_dir(__file__)

MIN_MONTHS = 12


# ---------------------------------------------------------------------------
# Helpers (replicated from Analysis 18 to maintain independence)
# ---------------------------------------------------------------------------

def haversine_km(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
    """Return the great-circle distance in km between two lat/lon points."""
    R = 6371.0
    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = (math.sin(dlat / 2) ** 2
         + math.cos(math.radians(lat1)) * math.cos(math.radians(lat2))
         * math.sin(dlon / 2) ** 2)
    return R * 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))


def compute_span(lats: list[float], lons: list[float]) -> float:
    """Return the max pairwise haversine distance (km) among a set of points."""
    max_dist = 0.0
    n = len(lats)
    for i in range(n):
        for j in range(i + 1, n):
            d = haversine_km(lats[i], lons[i], lats[j], lons[j])
            if d > max_dist:
                max_dist = d
    return max_dist


def compute_vif(X_raw: np.ndarray, feature_names: list[str]) -> dict[str, float]:
    """Compute Variance Inflation Factor for each predictor."""
    n, k = X_raw.shape
    vifs = {}
    for j in range(k):
        y_j = X_raw[:, j]
        X_other = np.delete(X_raw, j, axis=1)
        X_other = np.column_stack([np.ones(n), X_other])
        beta, _, _, _ = np.linalg.lstsq(X_other, y_j, rcond=None)
        y_hat = X_other @ beta
        ss_res = np.sum((y_j - y_hat) ** 2)
        ss_tot = np.sum((y_j - np.mean(y_j)) ** 2)
        r2_j = 1 - ss_res / ss_tot if ss_tot > 0 else 0.0
        vifs[feature_names[j]] = 1.0 / (1.0 - r2_j) if r2_j < 1.0 else float("inf")
    return vifs


def fit_ols(y: np.ndarray, X_raw: np.ndarray, feature_names: list[str]) -> dict:
    """Fit OLS regression and return results dict."""
    n, k = X_raw.shape
    X = np.column_stack([np.ones(n), X_raw])

    beta, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
    y_hat = X @ beta
    residuals = y - y_hat

    ss_res = np.sum(residuals ** 2)
    ss_tot = np.sum((y - np.mean(y)) ** 2)
    r_squared = 1 - ss_res / ss_tot
    adj_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - k - 1)
    mse = ss_res / (n - k - 1)

    XtX_inv = np.linalg.pinv(X.T @ X)
    se = np.sqrt(np.diag(XtX_inv) * mse)
    t_vals = beta / se
    p_vals = [2 * (1 - stats.t.cdf(abs(t), df=n - k - 1)) for t in t_vals]

    # Standardized beta weights
    x_stds = np.std(X_raw, axis=0, ddof=1)
    y_std = np.std(y, ddof=1)
    beta_weights = beta[1:] * x_stds / y_std

    return {
        "r_squared": r_squared,
        "adj_r_squared": adj_r_squared,
        "ss_res": ss_res,
        "n": n,
        "k": k,
        "features": ["intercept"] + feature_names,
        "coefficients": beta.tolist(),
        "std_errors": se.tolist(),
        "t_values": t_vals.tolist(),
        "p_values": p_vals,
        "beta_weights": [None] + beta_weights.tolist(),
        "y_hat": y_hat,
        "residuals": residuals,
    }


def f_test_nested(base: dict, full: dict) -> tuple[float, float]:
    """F-test comparing nested models. Returns (F_stat, p_value)."""
    k_diff = full["k"] - base["k"]
    n = full["n"]
    f_stat = ((base["ss_res"] - full["ss_res"]) / k_diff) / (full["ss_res"] / (n - full["k"] - 1))
    f_p = 1 - stats.f.cdf(f_stat, k_diff, n - full["k"] - 1)
    return f_stat, f_p


def print_model(results: dict, label: str) -> None:
    """Print formatted model results."""
    print(f"\n  {label}:")
    print(f"  R2 = {results['r_squared']:.4f}, Adj R2 = {results['adj_r_squared']:.4f}, "
          f"n = {results['n']}, k = {results['k']}")
    print(f"  {'Feature':<20s} {'Coeff':>10s} {'Std Err':>10s} {'p-value':>10s} {'Beta':>10s}")
    print(f"  {'-'*60}")
    for i, feat in enumerate(results["features"]):
        coeff = results["coefficients"][i]
        se = results["std_errors"][i]
        p = results["p_values"][i]
        beta = results["beta_weights"][i]
        beta_str = f"{beta:>10.4f}" if beta is not None else f"{'--':>10s}"
        sig = "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else ""
        print(f"  {feat:<20s} {coeff:>10.6f} {se:>10.6f} {p:>10.4f} {beta_str} {sig}")


# ---------------------------------------------------------------------------
# Data loading
# ---------------------------------------------------------------------------

def load_features() -> pl.DataFrame:
    """Assemble all features including ridership for the regression model."""
    # Route-level average OTP (from paired months only)
    avg_otp = query_to_polars("""
        SELECT o.route_id, rt.route_name, rt.mode,
               AVG(o.otp) AS avg_otp, COUNT(*) AS months
        FROM otp_monthly o
        JOIN ridership_monthly r
            ON o.route_id = r.route_id AND o.month = r.month
            AND r.day_type = 'WEEKDAY'
        JOIN routes rt ON o.route_id = rt.route_id
        WHERE r.avg_riders IS NOT NULL
        GROUP BY o.route_id
        HAVING COUNT(*) >= 12
    """)

    # Average weekday ridership
    avg_riders = query_to_polars("""
        SELECT route_id, AVG(avg_riders) AS avg_riders
        FROM ridership_monthly
        WHERE day_type = 'WEEKDAY' AND avg_riders IS NOT NULL
        GROUP BY route_id
    """)

    # Structural features
    stop_counts = query_to_polars("""
        SELECT route_id, COUNT(DISTINCT stop_id) AS stop_count
        FROM route_stops GROUP BY route_id
    """)
    trips = query_to_polars("""
        SELECT route_id,
               MAX(trips_wd) AS max_wd,
               MAX(trips_sa) AS max_sa,
               MAX(trips_su) AS max_su
        FROM route_stops GROUP BY route_id
    """)
    munis = query_to_polars("""
        SELECT rs.route_id, COUNT(DISTINCT s.muni) AS n_munis
        FROM route_stops rs
        JOIN stops s ON rs.stop_id = s.stop_id
        WHERE s.muni IS NOT NULL AND s.muni != '0'
        GROUP BY rs.route_id
    """)
    stops_by_route = query_to_polars("""
        SELECT rs.route_id, s.lat, s.lon
        FROM route_stops rs
        JOIN stops s ON rs.stop_id = s.stop_id
        WHERE s.lat IS NOT NULL AND s.lon IS NOT NULL
    """)
    spans = []
    for route_id in stops_by_route["route_id"].unique().sort().to_list():
        subset = stops_by_route.filter(pl.col("route_id") == route_id)
        span_km = compute_span(subset["lat"].to_list(), subset["lon"].to_list())
        spans.append({"route_id": route_id, "span_km": span_km})
    span_df = pl.DataFrame(spans)

    # Assemble
    df = avg_otp
    df = df.join(avg_riders, on="route_id", how="inner")
    df = df.join(stop_counts, on="route_id", how="left")
    df = df.join(trips, on="route_id", how="left")
    df = df.join(munis, on="route_id", how="left")
    df = df.join(span_df, on="route_id", how="left")

    # Derived features
    df = df.with_columns(
        pl.when(pl.col("max_wd") > 0)
        .then((pl.col("max_sa") + pl.col("max_su")) / (2.0 * pl.col("max_wd")))
        .otherwise(0.0)
        .alias("weekend_ratio"),
    )
    df = df.with_columns(
        pl.when(pl.col("mode") == "RAIL").then(1.0).otherwise(0.0).alias("is_rail"),
    )
    df = df.with_columns(
        pl.when(pl.col("mode") == "BUS")
        .then(pl.col("route_id").map_elements(classify_bus_route, return_dtype=pl.Utf8))
        .otherwise(pl.lit("non_bus"))
        .alias("bus_subtype"),
    )
    df = df.with_columns(
        pl.when(pl.col("bus_subtype").is_in(["busway", "flyer", "express", "limited"]))
        .then(1.0)
        .otherwise(0.0)
        .alias("is_premium_bus"),
    )
    df = df.with_columns(
        pl.col("avg_riders").log().alias("log_riders"),
    )

    df = df.drop_nulls(subset=["stop_count", "span_km", "weekend_ratio", "n_munis", "avg_riders"])

    return df


# ---------------------------------------------------------------------------
# Charts
# ---------------------------------------------------------------------------

def make_coefficient_chart(base: dict, expanded: dict) -> None:
    """Compare standardized coefficients between base and expanded models."""
    plt = setup_plotting()
    fig, ax = plt.subplots(figsize=(10, 7))

    # Get shared features (skip intercept)
    base_feats = base["features"][1:]
    exp_feats = expanded["features"][1:]
    base_betas = {f: b for f, b in zip(base_feats, base["beta_weights"][1:])}
    exp_betas = {f: b for f, b in zip(exp_feats, expanded["beta_weights"][1:])}
    exp_pvals = {f: p for f, p in zip(exp_feats, expanded["p_values"][1:])}

    all_feats = exp_feats  # expanded has all features
    y_pos = np.arange(len(all_feats))
    width = 0.35

    base_vals = [base_betas.get(f, 0.0) for f in all_feats]
    exp_vals = [exp_betas[f] for f in all_feats]

    bars1 = ax.barh(y_pos + width / 2, base_vals, width, label="Base (6 features)",
                     color="#9ca3af", alpha=0.7)
    bars2 = ax.barh(y_pos - width / 2, exp_vals, width, label="Expanded (+ridership)",
                     color="#2563eb", alpha=0.7)

    # Significance markers for expanded model
    for i, f in enumerate(all_feats):
        p = exp_pvals[f]
        marker = "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else ""
        val = exp_vals[i]
        ax.text(val + 0.01 if val >= 0 else val - 0.01, i - width / 2, marker,
                ha="left" if val >= 0 else "right", va="center", fontsize=9)

    ax.set_yticks(y_pos)
    ax.set_yticklabels(all_feats)
    ax.axvline(0, color="black", linewidth=0.5)
    ax.set_xlabel("Standardized Coefficient (Beta Weight)")
    ax.set_title(f"Model Comparison: Base R2={base['r_squared']:.3f} vs "
                 f"Expanded R2={expanded['r_squared']:.3f}")
    ax.legend(loc="lower right", fontsize=9)
    ax.invert_yaxis()

    save_chart(fig, OUT / "coefficient_comparison.png")


def make_partial_residual_chart(df: pl.DataFrame, base: dict) -> None:
    """Partial residual plot: base model residuals vs log_riders."""
    plt = setup_plotting()
    fig, ax = plt.subplots(figsize=(8, 6))

    residuals = base["residuals"]
    log_riders = df["log_riders"].to_numpy()

    ax.scatter(log_riders, residuals, alpha=0.5, s=30, color="#2563eb",
               edgecolors="white", linewidth=0.5)

    # Trend line
    slope, intercept, r, p, _ = stats.linregress(log_riders, residuals)
    x_line = np.array([log_riders.min(), log_riders.max()])
    ax.plot(x_line, slope * x_line + intercept, color="#e11d48", linewidth=1.5,
            linestyle="--", alpha=0.7, label=f"r={r:.3f}, p={p:.4f}")

    ax.axhline(0, color="gray", linewidth=0.5, alpha=0.5)
    ax.set_xlabel("Log(Average Weekday Ridership)")
    ax.set_ylabel("Base Model Residual (OTP)")
    ax.set_title("Partial Residual: Does Ridership Explain Remaining OTP Variance?")
    ax.legend(fontsize=9)

    save_chart(fig, OUT / "partial_residual.png")


# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------

@run_analysis(26, "Ridership in Multivariate OTP Model")
def main() -> None:
    """Entry point: load features, fit models, compare, chart, and save."""

    with phase("Loading and assembling features"):
        df = load_features()
        n_rail = len(df.filter(pl.col("is_rail") == 1.0))
        n_bus = len(df.filter(pl.col("mode") == "BUS"))
        print(f"  {len(df)} routes with complete feature set ({n_bus} BUS, {n_rail} RAIL)")
        print(f"  Ridership range: {df['avg_riders'].min():.0f} -- {df['avg_riders'].max():.0f}")
        print(f"  Log ridership range: {df['log_riders'].min():.2f} -- {df['log_riders'].max():.2f}")

    y = df["avg_otp"].to_numpy()

    # --- Model 1: Base (Analysis 18 replication, 6 features) ---
    base_features = ["stop_count", "span_km", "is_rail", "is_premium_bus",
                     "weekend_ratio", "n_munis"]
    X_base = np.column_stack([df[f].to_numpy().astype(float) for f in base_features])
    with phase("Fitting base model (6 features, Analysis 18 replication)"):
        base = fit_ols(y, X_base, base_features)
        print_model(base, "Base model (6 features)")

    # --- Model 2: Expanded (+ log_riders) ---
    exp_features = base_features + ["log_riders"]
    X_exp = np.column_stack([df[f].to_numpy().astype(float) for f in exp_features])
    with phase("Fitting expanded model (+ log_riders)"):
        expanded = fit_ols(y, X_exp, exp_features)
        print_model(expanded, "Expanded model (+ log_riders)")

        # F-test
        f_stat, f_p = f_test_nested(base, expanded)
        print(f"\n  F-test for log_riders: F = {f_stat:.3f}, p = {f_p:.4f}")
        print(f"  R2 change: {base['r_squared']:.4f} -> {expanded['r_squared']:.4f} "
              f"(+{expanded['r_squared'] - base['r_squared']:.4f})")
        print(f"  Adj R2 change: {base['adj_r_squared']:.4f} -> {expanded['adj_r_squared']:.4f} "
              f"(+{expanded['adj_r_squared'] - base['adj_r_squared']:.4f})")
        if f_p < 0.05:
            print("  => Ridership IS significant after controlling for structural features.")
        else:
            print("  => Ridership is NOT significant after controlling for structural features.")

        # --- VIF for expanded model ---
        print("\n--- VIF (Expanded Model) ---")
        vifs = compute_vif(X_exp, exp_features)
        for feat, vif in vifs.items():
            flag = " ** HIGH" if vif > 5 else ""
            print(f"  {feat:<20s} VIF = {vif:.2f}{flag}")

    # --- Model 3: Ridership-only (log_riders + is_rail) ---
    rider_features = ["log_riders", "is_rail"]
    X_rider = np.column_stack([df[f].to_numpy().astype(float) for f in rider_features])
    with phase("Fitting ridership-only model (log_riders + is_rail)"):
        rider_only = fit_ols(y, X_rider, rider_features)
        print_model(rider_only, "Ridership-only model")

    # --- Model 4: Bus-only expanded ---
    bus_df = df.filter(pl.col("mode") == "BUS")
    y_bus = bus_df["avg_otp"].to_numpy()
    bus_base_feats = ["stop_count", "span_km", "is_premium_bus", "weekend_ratio", "n_munis"]
    bus_exp_feats = bus_base_feats + ["log_riders"]

    X_bus_base = np.column_stack([bus_df[f].to_numpy().astype(float) for f in bus_base_feats])
    X_bus_exp = np.column_stack([bus_df[f].to_numpy().astype(float) for f in bus_exp_feats])

    with phase(f"Fitting bus-only models ({len(bus_df)} routes)"):
        bus_base = fit_ols(y_bus, X_bus_base, bus_base_feats)
        print_model(bus_base, "Bus-only base (5 features)")

        bus_expanded = fit_ols(y_bus, X_bus_exp, bus_exp_feats)
        print_model(bus_expanded, "Bus-only expanded (+ log_riders)")

        f_bus, fp_bus = f_test_nested(bus_base, bus_expanded)
        print(f"\n  Bus-only F-test for log_riders: F = {f_bus:.3f}, p = {fp_bus:.4f}")
        print(f"  R2 change: {bus_base['r_squared']:.4f} -> {bus_expanded['r_squared']:.4f} "
              f"(+{bus_expanded['r_squared'] - bus_base['r_squared']:.4f})")

        # --- Correlation between ridership and existing predictors ---
        print("\n--- Correlations: log_riders vs structural features ---")
        for feat in base_features:
            corr = correlate(df, feat, "log_riders")
            sig = "*" if corr["pearson_p"] < 0.05 else ""
            print(f"  log_riders vs {feat:<20s}: r = {corr['pearson_r']:+.3f}, p = {corr['pearson_p']:.4f} {sig}")

    with phase("Saving CSVs"):
        # Model comparison
        rows = []
        for model, label in [(base, "base_6feat"), (expanded, "expanded_7feat"),
                              (rider_only, "ridership_only"), (bus_base, "bus_base"),
                              (bus_expanded, "bus_expanded")]:
            for i, feat in enumerate(model["features"]):
                rows.append({
                    "model": label,
                    "feature": feat,
                    "coefficient": model["coefficients"][i],
                    "std_error": model["std_errors"][i],
                    "p_value": model["p_values"][i],
                    "beta_weight": model["beta_weights"][i] if model["beta_weights"][i] is not None else float("nan"),
                    "r_squared": model["r_squared"],
                    "adj_r_squared": model["adj_r_squared"],
                    "n": model["n"],
                })
        model_comparison_df = pl.DataFrame(rows)
        save_csv(model_comparison_df, OUT / "model_comparison.csv")

        # VIF table
        vif_rows = [{"feature": f, "vif": v} for f, v in vifs.items()]
        vif_df = pl.DataFrame(vif_rows)
        save_csv(vif_df, OUT / "vif_table.csv")

    with phase("Generating charts"):
        make_coefficient_chart(base, expanded)
        make_partial_residual_chart(df, base)


if __name__ == "__main__":
    main()

    

    

Sources

Name	Type	Why It Matters	Owner	Freshness	Caveat
otp_monthly	table	Primary analytical table used in this page's computations.	Produced by Data Ingestion.	Updated when the producing pipeline step is rerun.	Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5) file data/routes_by_month.csv — Monthly route OTP source table in wide format. file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications. file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts. file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes. file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
ridership_monthly	table	Primary analytical table used in this page's computations.	Produced by Data Ingestion.	Updated when the producing pipeline step is rerun.	Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5) file data/routes_by_month.csv — Monthly route OTP source table in wide format. file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications. file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts. file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes. file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
route_stops	table	Primary analytical table used in this page's computations.	Produced by Data Ingestion.	Updated when the producing pipeline step is rerun.	Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5) file data/routes_by_month.csv — Monthly route OTP source table in wide format. file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications. file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts. file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes. file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
routes	table	Primary analytical table used in this page's computations.	Produced by Data Ingestion.	Updated when the producing pipeline step is rerun.	Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5) file data/routes_by_month.csv — Monthly route OTP source table in wide format. file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications. file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts. file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes. file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
stops	table	Primary analytical table used in this page's computations.	Produced by Data Ingestion.	Updated when the producing pipeline step is rerun.	Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5) file data/routes_by_month.csv — Monthly route OTP source table in wide format. file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications. file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts. file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes. file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
numpy	dependency	Runtime dependency required for this page's pipeline or analysis code.	Open-source Python ecosystem maintainers.	Version pinned by project environment until dependency updates are applied.	Library updates may change behavior or defaults.
polars	dependency	Runtime dependency required for this page's pipeline or analysis code.	Open-source Python ecosystem maintainers.	Version pinned by project environment until dependency updates are applied.	Library updates may change behavior or defaults.
scipy	dependency	Runtime dependency required for this page's pipeline or analysis code.	Open-source Python ecosystem maintainers.	Version pinned by project environment until dependency updates are applied.	Library updates may change behavior or defaults.