Analysis

18: Multivariate OTP Model

Route and Service Drivers

Coverage: 2019-01 to 2025-11 (from otp_monthly).

Built 2026-04-03 20:09 UTC · Commit 7c56b9a

Page Navigation

Analysis Navigation

Data Provenance

flowchart LR
  18_multivariate_model(["18: Multivariate OTP Model"])
  t_otp_monthly[("otp_monthly")] --> 18_multivariate_model
  01_data_ingestion[["Data Ingestion"]] --> t_otp_monthly
  u1_01_data_ingestion[/"data/routes_by_month.csv"/] --> 01_data_ingestion
  u2_01_data_ingestion[/"data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv"/] --> 01_data_ingestion
  u3_01_data_ingestion[/"data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv"/] --> 01_data_ingestion
  u4_01_data_ingestion[/"data/PRT_Stop_Reference_Lookup_Table.csv"/] --> 01_data_ingestion
  u5_01_data_ingestion[/"data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv"/] --> 01_data_ingestion
  t_route_stops[("route_stops")] --> 18_multivariate_model
  01_data_ingestion[["Data Ingestion"]] --> t_route_stops
  t_routes[("routes")] --> 18_multivariate_model
  01_data_ingestion[["Data Ingestion"]] --> t_routes
  t_stops[("stops")] --> 18_multivariate_model
  01_data_ingestion[["Data Ingestion"]] --> t_stops
  d1_18_multivariate_model(("numpy (lib)")) --> 18_multivariate_model
  d2_18_multivariate_model(("polars (lib)")) --> 18_multivariate_model
  d3_18_multivariate_model(("scipy (lib)")) --> 18_multivariate_model
  classDef page fill:#dbeafe,stroke:#1d4ed8,color:#1e3a8a,stroke-width:2px;
  classDef table fill:#ecfeff,stroke:#0e7490,color:#164e63;
  classDef dep fill:#fff7ed,stroke:#c2410c,color:#7c2d12,stroke-dasharray: 4 2;
  classDef file fill:#eef2ff,stroke:#6366f1,color:#3730a3;
  classDef api fill:#f0fdf4,stroke:#16a34a,color:#14532d;
  classDef pipeline fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95;
  class 18_multivariate_model page;
  class t_otp_monthly,t_route_stops,t_routes,t_stops table;
  class d1_18_multivariate_model,d2_18_multivariate_model,d3_18_multivariate_model dep;
  class u1_01_data_ingestion,u2_01_data_ingestion,u3_01_data_ingestion,u4_01_data_ingestion,u5_01_data_ingestion file;
  class 01_data_ingestion pipeline;

Findings

Findings: Multivariate OTP Model

Summary

A six-feature OLS model explains 47.2% of OTP variance (adjusted R² = 0.435) across 92 routes. Three features are significant: stop count, geographic span, and rail mode. The model confirms that structural route characteristics explain nearly half of all performance variation, with the remaining 53% attributable to unmeasured factors (traffic, staffing, weather, schedule design).

Key Numbers

  • R² = 0.472, Adjusted R² = 0.435
  • n = 92 routes, k = 6 features
Feature Coefficient p-value Beta Weight Significant?
stop_count -0.0006 <0.001 -0.49 ***
span_km -0.0049 <0.001 -0.47 ***
is_rail +0.133 <0.001 +0.34 ***
n_munis +0.009 0.001 +0.41 **
is_premium_bus +0.007 0.70 +0.05
weekend_ratio -0.006 0.79 -0.03

Observations

  • Stop count (beta = -0.49) and geographic span (beta = -0.47) are the two strongest predictors, with nearly equal standardized effects. Each additional stop costs roughly -0.06 pp OTP; each additional km of span costs roughly -0.49 pp.
  • Rail mode (beta = +0.34) provides a +13.3 pp OTP advantage over bus after controlling for structural factors. This is partly inherent to rail (dedicated right-of-way) and not fully captured by other features.
  • Number of municipalities has a surprising positive coefficient (+0.009 per muni, beta = +0.41). This is likely a suppressor effect: after controlling for stop count and span, routes that serve more municipalities tend to be express/busway routes that skip stops, which perform better.
  • Premium bus (busway/flyer/express/limited) is not significant once stop count and span are controlled -- these routes' advantage is fully explained by having fewer stops and shorter spans, not by their service type per se.
  • Weekend ratio is not significant, confirming the null finding from Analysis 17.

Model Interpretation

The 47% R² means that knowing just a route's stop count, length, and mode gets you almost halfway to predicting its OTP. The unexplained 53% likely reflects:

  • Traffic conditions and road geometry
  • Schedule design (running time padding, layover adequacy)
  • Driver staffing and experience
  • Passenger volume and dwell time variation
  • Weather and construction effects

These factors would require operational data not present in this dataset.

Caveats

  • OLS assumes linear relationships and independent errors. Some non-linearity is visible in the residuals.
  • The n_munis coefficient is a suppressor and should not be interpreted as "more municipalities = better OTP."
  • With only 92 observations and 6 predictors, the model is at the edge of stable estimation.

Review History

Output

Methods

Methods: Multivariate OTP Model

Question

How much of OTP variation can we explain with available structural variables? Which factors matter most when all are considered simultaneously? Previous analyses found effects for stop count (Analysis 07), mode (Analysis 02), and geographic span (Analysis 12), but these were tested individually. A multivariate model quantifies relative importance and reveals how much variance remains unexplained.

Approach

  • For each route with OTP data, assemble features: stop count, geographic span (km), mode (BUS/RAIL dummy), bus subtype (local/limited/express/busway/flyer dummies), weekend service ratio, and number of municipalities served.
  • Fit an OLS regression with average OTP as the dependent variable.
  • Report R-squared, adjusted R-squared, and per-feature coefficients with p-values.
  • Use standardized coefficients (beta weights) to compare relative importance across features with different scales.
  • Generate a coefficient plot and a predicted-vs-actual scatter plot.

Data

Name Description Source
otp_monthly Monthly OTP per route (averaged) prt.db table
route_stops Stop count, trip counts per route prt.db table
stops Lat/lon for span calculation, muni for jurisdiction count prt.db table
routes Mode classification prt.db table

Output

  • output/model_coefficients.csv -- feature, coefficient, std error, p-value, beta weight
  • output/coefficient_plot.png -- horizontal bar chart of standardized coefficients
  • output/predicted_vs_actual.png -- scatter plot with 1:1 line

Source Code

"""Multivariate OLS model combining structural predictors of OTP."""

import math

import numpy as np
import polars as pl
from scipy import stats

from prt_otp_analysis.common import (
    analysis_dir,
    classify_bus_route,
    phase,
    query_to_polars,
    run_analysis,
    save_chart,
    save_csv,
    setup_plotting,
)

OUT = analysis_dir(__file__)


def haversine_km(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
    """Return the great-circle distance in km between two lat/lon points."""
    R = 6371.0
    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = (math.sin(dlat / 2) ** 2
         + math.cos(math.radians(lat1)) * math.cos(math.radians(lat2))
         * math.sin(dlon / 2) ** 2)
    return R * 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))


def compute_span(lats: list[float], lons: list[float]) -> float:
    """Return the max pairwise haversine distance (km) among a set of points."""
    max_dist = 0.0
    n = len(lats)
    for i in range(n):
        for j in range(i + 1, n):
            d = haversine_km(lats[i], lons[i], lats[j], lons[j])
            if d > max_dist:
                max_dist = d
    return max_dist


def load_features() -> pl.DataFrame:
    """Assemble all features for the regression model."""
    avg_otp = query_to_polars("""
        SELECT o.route_id, r.route_name, r.mode,
               AVG(o.otp) AS avg_otp, COUNT(*) AS months
        FROM otp_monthly o
        JOIN routes r ON o.route_id = r.route_id
        GROUP BY o.route_id
        HAVING COUNT(*) >= 12
    """)
    stop_counts = query_to_polars("""
        SELECT route_id, COUNT(DISTINCT stop_id) AS stop_count
        FROM route_stops
        GROUP BY route_id
    """)
    trips = query_to_polars("""
        SELECT route_id,
               MAX(trips_wd) AS max_wd,
               MAX(trips_sa) AS max_sa,
               MAX(trips_su) AS max_su
        FROM route_stops
        GROUP BY route_id
    """)
    munis = query_to_polars("""
        SELECT rs.route_id, COUNT(DISTINCT s.muni) AS n_munis
        FROM route_stops rs
        JOIN stops s ON rs.stop_id = s.stop_id
        WHERE s.muni IS NOT NULL AND s.muni != '0'
        GROUP BY rs.route_id
    """)
    stops_by_route = query_to_polars("""
        SELECT rs.route_id, s.lat, s.lon
        FROM route_stops rs
        JOIN stops s ON rs.stop_id = s.stop_id
        WHERE s.lat IS NOT NULL AND s.lon IS NOT NULL
    """)
    spans = []
    for route_id in stops_by_route["route_id"].unique().sort().to_list():
        subset = stops_by_route.filter(pl.col("route_id") == route_id)
        span_km = compute_span(subset["lat"].to_list(), subset["lon"].to_list())
        spans.append({"route_id": route_id, "span_km": span_km})
    span_df = pl.DataFrame(spans)

    df = avg_otp
    df = df.join(stop_counts, on="route_id", how="left")
    df = df.join(trips, on="route_id", how="left")
    df = df.join(munis, on="route_id", how="left")
    df = df.join(span_df, on="route_id", how="left")

    df = df.with_columns(
        pl.when(pl.col("max_wd") > 0)
        .then((pl.col("max_sa") + pl.col("max_su")) / (2.0 * pl.col("max_wd")))
        .otherwise(0.0)
        .alias("weekend_ratio")
    )
    df = df.with_columns(
        pl.when(pl.col("mode") == "RAIL").then(1.0).otherwise(0.0).alias("is_rail")
    )
    df = df.with_columns(
        pl.when(pl.col("mode") == "BUS")
        .then(pl.col("route_id").map_elements(classify_bus_route, return_dtype=pl.Utf8))
        .otherwise(pl.lit("non_bus"))
        .alias("bus_subtype")
    )
    df = df.with_columns(
        pl.when(pl.col("bus_subtype").is_in(["busway", "flyer", "express", "limited"]))
        .then(1.0)
        .otherwise(0.0)
        .alias("is_premium_bus")
    )
    df = df.drop_nulls(subset=["stop_count", "span_km", "weekend_ratio", "n_munis"])

    return df


def compute_vif(X_raw: np.ndarray, feature_names: list[str]) -> dict[str, float]:
    """Compute Variance Inflation Factor for each predictor."""
    n, k = X_raw.shape
    vifs = {}
    for j in range(k):
        y_j = X_raw[:, j]
        X_other = np.delete(X_raw, j, axis=1)
        X_other = np.column_stack([np.ones(n), X_other])
        beta, residuals, _, _ = np.linalg.lstsq(X_other, y_j, rcond=None)
        y_hat = X_other @ beta
        ss_res = np.sum((y_j - y_hat) ** 2)
        ss_tot = np.sum((y_j - np.mean(y_j)) ** 2)
        r2_j = 1 - ss_res / ss_tot if ss_tot > 0 else 0.0
        vifs[feature_names[j]] = 1.0 / (1.0 - r2_j) if r2_j < 1.0 else float("inf")
    return vifs


def fit_ols(y: np.ndarray, X_raw: np.ndarray, feature_names: list[str]) -> tuple[dict, np.ndarray]:
    """Fit OLS regression using lstsq for numerical stability."""
    n, k = X_raw.shape
    X = np.column_stack([np.ones(n), X_raw])

    # OLS via least squares (more stable than matrix inversion)
    beta, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
    y_hat = X @ beta
    residuals = y - y_hat

    ss_res = np.sum(residuals ** 2)
    ss_tot = np.sum((y - np.mean(y)) ** 2)
    r_squared = 1 - ss_res / ss_tot
    adj_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - k - 1)
    mse = ss_res / (n - k - 1)

    # Standard errors via (X'X)^-1 * MSE (using pseudoinverse for stability)
    XtX_inv = np.linalg.pinv(X.T @ X)
    se = np.sqrt(np.diag(XtX_inv) * mse)
    t_vals = beta / se
    p_vals = [2 * (1 - stats.t.cdf(abs(t), df=n - k - 1)) for t in t_vals]

    # Standardized beta weights (using sample std, ddof=1)
    x_stds = np.std(X_raw, axis=0, ddof=1)
    y_std = np.std(y, ddof=1)
    beta_weights = beta[1:] * x_stds / y_std

    results = {
        "r_squared": r_squared,
        "adj_r_squared": adj_r_squared,
        "n": n,
        "k": k,
        "features": ["intercept"] + feature_names,
        "coefficients": beta.tolist(),
        "std_errors": se.tolist(),
        "t_values": t_vals.tolist(),
        "p_values": p_vals,
        "beta_weights": [None] + beta_weights.tolist(),
    }
    return results, y_hat


def print_model(results: dict, label: str) -> None:
    """Print formatted model results."""
    print(f"\n  {label}:")
    print(f"  R² = {results['r_squared']:.4f}, Adj R² = {results['adj_r_squared']:.4f}, "
          f"n = {results['n']}, k = {results['k']}")
    print(f"  {'Feature':<20s} {'Coeff':>10s} {'Std Err':>10s} {'p-value':>10s} {'Beta':>10s}")
    print(f"  {'-'*60}")
    for i, feat in enumerate(results["features"]):
        coeff = results["coefficients"][i]
        se = results["std_errors"][i]
        p = results["p_values"][i]
        beta = results["beta_weights"][i]
        beta_str = f"{beta:>10.4f}" if beta is not None else f"{'--':>10s}"
        sig = "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else ""
        print(f"  {feat:<20s} {coeff:>10.6f} {se:>10.6f} {p:>10.4f} {beta_str} {sig}")


def make_charts(df: pl.DataFrame, results: dict, y_hat: np.ndarray) -> None:
    """Generate coefficient plot and predicted vs actual scatter."""
    plt = setup_plotting()

    # Coefficient plot (standardized)
    fig, ax = plt.subplots(figsize=(10, 6))
    features = results["features"][1:]
    betas = [b for b in results["beta_weights"][1:]]
    p_vals = results["p_values"][1:]
    colors = ["#22c55e" if p < 0.05 else "#9ca3af" for p in p_vals]

    y_pos = range(len(features))
    ax.barh(y_pos, betas, color=colors, edgecolor="white")
    ax.set_yticks(y_pos)
    ax.set_yticklabels(features)
    ax.axvline(0, color="black", linewidth=0.5)
    ax.set_xlabel("Standardized Coefficient (Beta Weight)")
    ax.set_title(f"Multivariate OTP Model: Feature Importance\n"
                 f"R²={results['r_squared']:.3f}, Adj R²={results['adj_r_squared']:.3f}, n={results['n']}")

    for i, (b, p) in enumerate(zip(betas, p_vals)):
        marker = "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else ""
        ax.text(b + 0.01 if b >= 0 else b - 0.01, i, marker,
                ha="left" if b >= 0 else "right", va="center", fontsize=10)

    ax.invert_yaxis()
    save_chart(fig, OUT / "coefficient_plot.png")

    # Predicted vs actual
    fig, ax = plt.subplots(figsize=(8, 8))
    y_actual = df["avg_otp"].to_list()
    ax.scatter(y_hat, y_actual, s=30, alpha=0.6, color="#3b82f6", edgecolors="white", linewidths=0.5)
    lims = [min(min(y_hat), min(y_actual)) - 0.02, max(max(y_hat), max(y_actual)) + 0.02]
    ax.plot(lims, lims, "--", color="#dc2626", linewidth=1, label="Perfect prediction")
    ax.set_xlabel("Predicted OTP")
    ax.set_ylabel("Actual OTP")
    ax.set_title(f"Predicted vs Actual OTP (R²={results['r_squared']:.3f})")
    ax.legend()
    ax.set_xlim(lims)
    ax.set_ylim(lims)
    ax.set_aspect("equal")
    save_chart(fig, OUT / "predicted_vs_actual.png")


@run_analysis(18, "Multivariate OTP Model")
def main() -> None:
    """Entry point: load features, fit models, report, and visualize."""
    with phase("Loading and assembling features"):
        df = load_features()
        n_rail = len(df.filter(pl.col("is_rail") == 1.0))
        n_bus = len(df.filter(pl.col("mode") == "BUS"))
        print(f"  {len(df)} routes with complete feature set ({n_bus} BUS, {n_rail} RAIL)")

    # --- Full model (all 6 features) ---
    full_features = ["stop_count", "span_km", "is_rail", "is_premium_bus",
                     "weekend_ratio", "n_munis"]
    y = np.array(df["avg_otp"].to_list())
    X_full = np.column_stack([np.array(df[f].to_list(), dtype=float) for f in full_features])

    with phase("Computing VIF (Variance Inflation Factors)"):
        vifs = compute_vif(X_full, full_features)
        for feat, vif in vifs.items():
            flag = " ** HIGH" if vif > 5 else ""
            print(f"  {feat:<20s} VIF = {vif:.2f}{flag}")

    with phase("Fitting full model (6 features)"):
        full_results, y_hat_full = fit_ols(y, X_full, full_features)
        print_model(full_results, "Full model")

    # --- Reduced model (without n_munis suppressor) ---
    reduced_features = ["stop_count", "span_km", "is_rail", "is_premium_bus", "weekend_ratio"]
    X_reduced = np.column_stack([np.array(df[f].to_list(), dtype=float) for f in reduced_features])
    with phase("Fitting reduced model (without n_munis)"):
        reduced_results, _ = fit_ols(y, X_reduced, reduced_features)
        print_model(reduced_results, "Reduced model (no n_munis)")

    # --- Bus-only model ---
    bus_df = df.filter(pl.col("mode") == "BUS")
    bus_features = ["stop_count", "span_km", "is_premium_bus", "weekend_ratio", "n_munis"]
    y_bus = np.array(bus_df["avg_otp"].to_list())
    X_bus = np.column_stack([np.array(bus_df[f].to_list(), dtype=float) for f in bus_features])
    with phase(f"Fitting bus-only model ({len(bus_df)} routes)"):
        bus_results, _ = fit_ols(y_bus, X_bus, bus_features)
        print_model(bus_results, "Bus-only model")

        # Save full model coefficients
        coeff_df = pl.DataFrame({
            "feature": full_results["features"],
            "coefficient": full_results["coefficients"],
            "std_error": full_results["std_errors"],
            "t_value": full_results["t_values"],
            "p_value": full_results["p_values"],
            "beta_weight": [b if b is not None else float("nan") for b in full_results["beta_weights"]],
        })
        save_csv(coeff_df, OUT / "model_coefficients.csv")

    with phase("Generating charts"):
        make_charts(df, full_results, y_hat_full)


if __name__ == "__main__":
    main()

Sources

NameTypeWhy It MattersOwnerFreshnessCaveat
otp_monthly table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
route_stops table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
routes table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
stops table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
numpy dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
polars dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
scipy dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.