Analysis

13: Cross-Route Correlation Clustering

Route and Service Drivers

Coverage: 2019-01 to 2025-11 (from otp_monthly).

Built 2026-04-03 20:09 UTC · Commit 7c56b9a

Page Navigation

Analysis Navigation

Data Provenance

flowchart LR
  13_correlation_clustering(["13: Cross-Route Correlation Clustering"])
  t_otp_monthly[("otp_monthly")] --> 13_correlation_clustering
  01_data_ingestion[["Data Ingestion"]] --> t_otp_monthly
  u1_01_data_ingestion[/"data/routes_by_month.csv"/] --> 01_data_ingestion
  u2_01_data_ingestion[/"data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv"/] --> 01_data_ingestion
  u3_01_data_ingestion[/"data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv"/] --> 01_data_ingestion
  u4_01_data_ingestion[/"data/PRT_Stop_Reference_Lookup_Table.csv"/] --> 01_data_ingestion
  u5_01_data_ingestion[/"data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv"/] --> 01_data_ingestion
  t_route_stops[("route_stops")] --> 13_correlation_clustering
  01_data_ingestion[["Data Ingestion"]] --> t_route_stops
  t_routes[("routes")] --> 13_correlation_clustering
  01_data_ingestion[["Data Ingestion"]] --> t_routes
  d1_13_correlation_clustering(("numpy (lib)")) --> 13_correlation_clustering
  d2_13_correlation_clustering(("polars (lib)")) --> 13_correlation_clustering
  d3_13_correlation_clustering(("scipy (lib)")) --> 13_correlation_clustering
  classDef page fill:#dbeafe,stroke:#1d4ed8,color:#1e3a8a,stroke-width:2px;
  classDef table fill:#ecfeff,stroke:#0e7490,color:#164e63;
  classDef dep fill:#fff7ed,stroke:#c2410c,color:#7c2d12,stroke-dasharray: 4 2;
  classDef file fill:#eef2ff,stroke:#6366f1,color:#3730a3;
  classDef api fill:#f0fdf4,stroke:#16a34a,color:#14532d;
  classDef pipeline fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95;
  class 13_correlation_clustering page;
  class t_otp_monthly,t_route_stops,t_routes table;
  class d1_13_correlation_clustering,d2_13_correlation_clustering,d3_13_correlation_clustering dep;
  class u1_01_data_ingestion,u2_01_data_ingestion,u3_01_data_ingestion,u4_01_data_ingestion,u5_01_data_ingestion file;
  class 01_data_ingestion pipeline;

Findings

Findings: Cross-Route Correlation Clustering

Summary

After detrending (subtracting system-wide monthly mean OTP), hierarchical clustering on pairwise OTP correlations produced 6 clusters (selected by silhouette score optimization). Detrending removes the dominant COVID/seasonal signal that otherwise causes all routes to appear positively correlated, revealing genuine differential behavior.

Key Numbers

Cluster Routes Avg OTP Avg Stops Character
1 23 66.8% 118 Lowest-performing cluster, high stop count
2 9 74.0% 114 Best-performing cluster
3 27 71.8% 105 Largest cluster, above-average performance
4 11 69.4% 106 Mid-performing
5 13 67.5% 108 Below-average
6 10 68.9% 146 Highest stop count
  • Silhouette scores ranged from 0.13 to 0.18 (k=6 was optimal at 0.178)
  • 0 route pairs excluded for insufficient overlap

Methodology

  • Detrending: System-wide monthly mean OTP is subtracted from each route's OTP before computing correlations. This ensures clusters reflect differential route behavior rather than common system-wide trends (COVID, seasonal cycles).
  • Linkage: Average linkage (valid for non-Euclidean correlation distance), replacing the original Ward's method.
  • Cluster count: Selected by silhouette score optimization over k=3..10, replacing the hardcoded k=6.
  • Imputation: Route pairs with insufficient overlap are imputed with median correlation rather than 0.0.

Observations

  • With detrending, the clusters are more evenly sized and differentiated by OTP level.
  • The highest stop-count cluster (Cluster 6, 146 avg stops) does not have the worst OTP (68.9%), suggesting that after removing system-wide trends, route complexity alone doesn't determine co-movement.
  • Low silhouette scores (0.13-0.18) indicate moderate cluster separation -- routes don't form tightly distinct groups. This is expected given that detrended OTP residuals have limited signal.
  • Without depot or corridor data, the clusters remain descriptive -- we can see which routes move together but cannot explain why.

Caveats

  • Silhouette scores below 0.25 indicate weak cluster structure. The 6-cluster solution is the best available, but route co-movement may be more continuous than discrete.
  • All modes are pooled. With only ~3 rail routes, stratification by mode is not feasible.

Review History

Output

Methods

Methods: Cross-Route Correlation Clustering

Question

Which routes' OTP values move together over time? Identifying co-moving clusters could reveal shared causal factors -- same corridor, same depot, same traffic bottleneck, or common external shocks.

Approach

  • Build a route x month matrix of OTP values from otp_monthly.
  • Filter to routes with at least 36 months of data to ensure meaningful correlations.
  • Compute pairwise Pearson correlation between every pair of routes' monthly OTP series (using only overlapping months).
  • Convert correlation to distance (1 - r) and apply Ward's hierarchical clustering via scipy.cluster.hierarchy.
  • Cut the dendrogram to produce a manageable number of clusters (target ~5-8).
  • Characterize each cluster by mode, average OTP, stop count, and geographic location.
  • Visualize with a dendrogram and a clustered correlation heatmap.

Data

Name Description Source
otp_monthly Monthly OTP per route (time series) prt.db table
routes Mode and name for labeling prt.db table
route_stops Stop count per route for cluster characterization prt.db table
stops Geography for cluster characterization prt.db table

Notes: route_stops is joined to stops to derive stop count and geographic location per route.

Output

  • output/cluster_membership.csv -- route-to-cluster assignment with cluster characteristics
  • output/dendrogram.png -- hierarchical clustering dendrogram
  • output/correlation_heatmap.png -- pairwise correlation matrix ordered by cluster

Source Code

"""Cluster routes by detrended OTP time-series correlation to find co-moving groups."""

import numpy as np
import polars as pl
from scipy.cluster.hierarchy import dendrogram, fcluster, linkage
from scipy.spatial.distance import squareform

from prt_otp_analysis.common import analysis_dir, phase, query_to_polars, run_analysis, save_chart, save_csv, setup_plotting

OUT = analysis_dir(__file__)

MIN_MONTHS = 36
MIN_OVERLAP = 24


def load_data() -> tuple[pl.DataFrame, pl.DataFrame]:
    """Load OTP time series and route metadata."""
    otp = query_to_polars("""
        SELECT o.route_id, r.route_name, r.mode, o.month, o.otp
        FROM otp_monthly o
        JOIN routes r ON o.route_id = r.route_id
    """)
    stop_counts = query_to_polars("""
        SELECT route_id, COUNT(DISTINCT stop_id) AS stop_count
        FROM route_stops
        GROUP BY route_id
    """)
    return otp, stop_counts


def detrend(otp: pl.DataFrame) -> pl.DataFrame:
    """Subtract the system-wide monthly mean OTP from each route's OTP."""
    monthly_mean = (
        otp.group_by("month")
        .agg(pl.col("otp").mean().alias("system_otp"))
    )
    otp = otp.join(monthly_mean, on="month", how="left")
    otp = otp.with_columns(
        (pl.col("otp") - pl.col("system_otp")).alias("otp_detrended")
    )
    return otp


def build_correlation_matrix(otp: pl.DataFrame) -> tuple[np.ndarray, list[str], list[str], int]:
    """Build a pairwise Pearson correlation matrix on detrended OTP."""
    # Filter to routes with sufficient months
    route_months = otp.group_by("route_id").len().filter(pl.col("len") >= MIN_MONTHS)
    valid_routes = sorted(route_months["route_id"].to_list())

    otp_filtered = otp.filter(pl.col("route_id").is_in(valid_routes))

    # Pivot on detrended values
    pivot = otp_filtered.pivot(on="month", index="route_id", values="otp_detrended")
    route_ids = pivot["route_id"].to_list()

    month_cols = [c for c in pivot.columns if c != "route_id"]
    matrix = pivot.select(month_cols).to_numpy()

    n = len(route_ids)
    corr = np.full((n, n), np.nan)
    n_excluded = 0

    for i in range(n):
        for j in range(i, n):
            mask = ~np.isnan(matrix[i]) & ~np.isnan(matrix[j])
            if mask.sum() >= MIN_OVERLAP:
                r = np.corrcoef(matrix[i][mask], matrix[j][mask])[0, 1]
                corr[i, j] = r
                corr[j, i] = r
            else:
                # Leave as NaN instead of imputing 0.0
                n_excluded += 1
        corr[i, i] = 1.0

    # For clustering, replace remaining NaN pairs with median correlation
    median_corr = np.nanmedian(corr[np.triu_indices(n, k=1)])
    nan_mask = np.isnan(corr)
    corr[nan_mask] = median_corr

    return corr, route_ids, month_cols, n_excluded


def silhouette_score(dist_matrix: np.ndarray, labels: np.ndarray) -> float:
    """Compute mean silhouette score from a precomputed distance matrix."""
    n = len(labels)
    scores = np.zeros(n)
    for i in range(n):
        own_cluster = labels[i]
        own_mask = labels == own_cluster
        if own_mask.sum() <= 1:
            scores[i] = 0.0
            continue
        a_i = dist_matrix[i, own_mask].sum() / (own_mask.sum() - 1)
        b_i = np.inf
        for c in np.unique(labels):
            if c == own_cluster:
                continue
            other_mask = labels == c
            b_c = dist_matrix[i, other_mask].mean()
            b_i = min(b_i, b_c)
        scores[i] = (b_i - a_i) / max(a_i, b_i) if max(a_i, b_i) > 0 else 0.0
    return scores.mean()


def cluster_routes(corr: np.ndarray) -> tuple[np.ndarray, np.ndarray, int, dict]:
    """Cluster with average linkage; pick k by silhouette score."""
    dist = np.clip(1 - corr, 0, 2)
    np.fill_diagonal(dist, 0)
    condensed = squareform(dist)
    linkage_matrix = linkage(condensed, method="average")

    # Find best k via silhouette score
    sil_scores = {}
    for k in range(3, 11):
        labels_k = fcluster(linkage_matrix, t=k, criterion="maxclust")
        sil_scores[k] = silhouette_score(dist, labels_k)

    best_k = max(sil_scores, key=sil_scores.get)
    labels = fcluster(linkage_matrix, t=best_k, criterion="maxclust")

    return labels, linkage_matrix, best_k, sil_scores


def make_dendrogram(linkage_matrix: np.ndarray, route_ids: list[str],
                    route_meta: pl.DataFrame) -> None:
    """Generate a dendrogram of route clusters."""
    plt = setup_plotting()

    labels = []
    for rid in route_ids:
        row = route_meta.filter(pl.col("route_id") == rid)
        if len(row) > 0:
            labels.append(f"{rid}")
        else:
            labels.append(rid)

    fig, ax = plt.subplots(figsize=(16, 8))
    dendrogram(linkage_matrix, labels=labels, leaf_rotation=90, leaf_font_size=7, ax=ax)
    ax.set_title("Route OTP Clustering (Average Linkage on Detrended Correlation Distance)")
    ax.set_ylabel("Distance (1 - r)")
    save_chart(fig, OUT / "dendrogram.png")


def make_heatmap(corr: np.ndarray, route_ids: list[str], labels: np.ndarray) -> None:
    """Generate a correlation heatmap ordered by cluster assignment."""
    plt = setup_plotting()

    order = np.argsort(labels)
    corr_sorted = corr[np.ix_(order, order)]
    sorted_ids = [route_ids[i] for i in order]

    fig, ax = plt.subplots(figsize=(14, 12))
    im = ax.imshow(corr_sorted, cmap="RdBu_r", vmin=-1, vmax=1, aspect="auto")
    ax.set_title("Pairwise Detrended OTP Correlation (Ordered by Cluster)")

    tick_positions = list(range(0, len(sorted_ids), 5))
    ax.set_xticks(tick_positions)
    ax.set_xticklabels([sorted_ids[i] for i in tick_positions], rotation=90, fontsize=6)
    ax.set_yticks(tick_positions)
    ax.set_yticklabels([sorted_ids[i] for i in tick_positions], fontsize=6)

    fig.colorbar(im, ax=ax, label="Pearson r (detrended)", shrink=0.8)
    save_chart(fig, OUT / "correlation_heatmap.png")


@run_analysis(13, "Cross-Route Correlation Clustering")
def main() -> None:
    """Entry point: load, detrend, correlate, cluster, and visualize."""
    with phase("Loading data"):
        otp, stop_counts = load_data()

    with phase("Detrending (subtracting system-wide monthly mean)"):
        otp = detrend(otp)

    with phase("Building detrended correlation matrix"):
        corr, route_ids, month_cols, n_excluded = build_correlation_matrix(otp)
        print(f"  {len(route_ids)} routes with {MIN_MONTHS}+ months of data")
        print(f"  Correlation matrix: {corr.shape}")
        print(f"  {n_excluded} route pairs excluded for <{MIN_OVERLAP} overlapping months (imputed with median)")

    with phase("Clustering routes (silhouette-optimized k)"):
        labels, linkage_matrix, best_k, sil_scores = cluster_routes(corr)
        print(f"  Silhouette scores by k:")
        for k, s in sorted(sil_scores.items()):
            marker = " <-- best" if k == best_k else ""
            print(f"    k={k}: {s:.4f}{marker}")
        print(f"  Selected k={best_k}")

        # Build metadata for output
        route_meta = query_to_polars("""
            SELECT r.route_id, r.route_name, r.mode
            FROM routes r
        """)
        avg_otp = otp.group_by("route_id").agg(pl.col("otp").mean().alias("avg_otp"))

        membership = pl.DataFrame({
            "route_id": route_ids,
            "cluster": labels.tolist(),
        })
        membership = membership.join(route_meta, on="route_id", how="left")
        membership = membership.join(avg_otp, on="route_id", how="left")
        membership = membership.join(stop_counts, on="route_id", how="left")
        membership = membership.sort(["cluster", "route_id"])

        print("\nCluster summary:")
        for c in sorted(membership["cluster"].unique().to_list()):
            cluster = membership.filter(pl.col("cluster") == c)
            n = len(cluster)
            avg = cluster["avg_otp"].mean()
            modes = cluster["mode"].value_counts().sort("count", descending=True)
            top_mode = modes["mode"][0] if len(modes) > 0 else "?"
            avg_stops = cluster["stop_count"].mean()
            print(f"  Cluster {c}: {n} routes, avg OTP={avg:.1%}, "
                  f"primary mode={top_mode}, avg stops={avg_stops:.0f}")

        save_csv(membership, OUT / "cluster_membership.csv")

    with phase("Generating charts"):
        make_dendrogram(linkage_matrix, route_ids, route_meta)
        make_heatmap(corr, route_ids, labels)


if __name__ == "__main__":
    main()

Sources

NameTypeWhy It MattersOwnerFreshnessCaveat
otp_monthly table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
route_stops table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
routes table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
numpy dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
polars dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
scipy dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.