Analysis

34 - Ridership Concentration (Pareto)

Equity and Strategic Planning

Coverage: 2019-01 to 2025-11 (from otp_monthly).

Built 2026-04-03 20:09 UTC · Commit 7c56b9a

Page Navigation

Analysis Navigation

Data Provenance

flowchart LR
  34_ridership_concentration(["34 - Ridership Concentration (Pareto)"])
  f1_34_ridership_concentration[/"data/bus-stop-usage/wprdc_stop_data.csv"/] --> 34_ridership_concentration
  t_otp_monthly[("otp_monthly")] --> 34_ridership_concentration
  01_data_ingestion[["Data Ingestion"]] --> t_otp_monthly
  u1_01_data_ingestion[/"data/routes_by_month.csv"/] --> 01_data_ingestion
  u2_01_data_ingestion[/"data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv"/] --> 01_data_ingestion
  u3_01_data_ingestion[/"data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv"/] --> 01_data_ingestion
  u4_01_data_ingestion[/"data/PRT_Stop_Reference_Lookup_Table.csv"/] --> 01_data_ingestion
  u5_01_data_ingestion[/"data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv"/] --> 01_data_ingestion
  t_routes[("routes")] --> 34_ridership_concentration
  01_data_ingestion[["Data Ingestion"]] --> t_routes
  d1_34_ridership_concentration(("numpy (lib)")) --> 34_ridership_concentration
  d2_34_ridership_concentration(("polars (lib)")) --> 34_ridership_concentration
  d3_34_ridership_concentration(("scipy (lib)")) --> 34_ridership_concentration
  classDef page fill:#dbeafe,stroke:#1d4ed8,color:#1e3a8a,stroke-width:2px;
  classDef table fill:#ecfeff,stroke:#0e7490,color:#164e63;
  classDef dep fill:#fff7ed,stroke:#c2410c,color:#7c2d12,stroke-dasharray: 4 2;
  classDef file fill:#eef2ff,stroke:#6366f1,color:#3730a3;
  classDef api fill:#f0fdf4,stroke:#16a34a,color:#14532d;
  classDef pipeline fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95;
  class 34_ridership_concentration page;
  class t_otp_monthly,t_routes table;
  class d1_34_ridership_concentration,d2_34_ridership_concentration,d3_34_ridership_concentration dep;
  class f1_34_ridership_concentration,u1_01_data_ingestion,u2_01_data_ingestion,u3_01_data_ingestion,u4_01_data_ingestion,u5_01_data_ingestion file;
  class 01_data_ingestion pipeline;

Findings

Findings: Ridership Concentration (Pareto)

Summary

PRT ridership is extremely concentrated: just 2% of stops serve 50% of all weekday riders, and 14% of stops serve 80%. The system-wide Gini coefficient is 0.82, indicating very high inequality in stop-level usage. However, per-route ridership concentration (Gini) has essentially zero correlation with that route's OTP (r = -0.016, p = 0.88), meaning whether a route's riders are clustered at a few stops or spread evenly has no bearing on schedule reliability.

Key Numbers

  • 2.2% of stops serve 50% of ridership
  • 13.9% of stops serve 80% of ridership
  • 27.9% of stops serve 90% of ridership
  • System-wide Gini = 0.824
  • Per-route Gini range: 0.338 - 0.890 (median 0.649)
  • 95 routes with >= 3 stops analyzed
  • 90 routes matched to OTP data
  • Gini vs OTP (bus-only): Pearson r = -0.016 (p = 0.879), Spearman rho = 0.103 (p = 0.339)

Observations

  • The Pareto curve is steep: the top ~150 stops (out of 6,700+) account for half of all weekday boardings and alightings. This is more extreme than a classic 80/20 rule -- it's closer to a 2/50 pattern.
  • Most stops see very little usage: the median stop handles only ~7 riders/day, while the top stops see 2,000-5,800/day. The bottom 70% of stops collectively serve only 10% of ridership.
  • Route-level concentration varies widely: some routes have Gini as low as 0.34 (relatively even usage across stops) while others reach 0.89 (nearly all ridership at a few stops). Flyer/express routes tend to have higher Gini since ridership clusters at downtown endpoints.
  • Concentration does not predict OTP. The scatter plot shows no trend at all -- the regression line is essentially flat. Routes with highly concentrated ridership perform no better or worse than those with evenly distributed usage.

Discussion

The extreme system-wide concentration (Gini = 0.82) reinforces the stop consolidation finding from Analysis 31: most stops contribute very little ridership, and removing the lowest-usage ones would affect few riders while potentially improving OTP by reducing stop count.

The null result for Gini vs OTP is notable. One might hypothesize that routes with concentrated ridership would have better OTP (less dwell time at most stops), but this doesn't hold. This suggests that dwell time at individual stops is not a dominant factor in OTP variance -- the time cost of stopping (deceleration, door opening, acceleration) matters more than the time cost of boarding passengers. This aligns with the Analysis 07 finding that raw stop count, not passenger volume, drives OTP.

The 2/50 concentration ratio has resource allocation implications: if PRT focused infrastructure investment (shelters, real-time signs, ADA upgrades) on just 150 stops, it would reach half of all riders. The current shelter coverage of 7% (Analysis 32) suggests significant room to target the highest-impact locations.

Caveats

  • Stop-level usage data is from FY2019; current patterns may differ, especially post-pandemic.
  • Gini is computed from stop-route combinations, not physical stops. Routes sharing physical stops may inflate the apparent concentration.
  • The OTP data covers a longer time range (2019-2025) than the usage snapshot (2019), so the correlation compares static usage structure against time-averaged OTP.
  • Very short routes (< 3 stops) are excluded from the Gini analysis, which drops a few incline and shuttle routes.

Output

Methods

Methods: Ridership Concentration (Pareto)

Question

How concentrated is ridership across stops? What fraction of stops serves 80% of riders, and does ridership concentration on a route correlate with that route's OTP?

Approach

  • Aggregate pre-pandemic weekday stop-level ridership (datekeys 201909, 202001) to physical-stop level and per-route level.
  • System-wide Pareto: sort all stops by usage, compute cumulative share, and find the fraction of stops that serve 50%, 80%, and 90% of total ridership.
  • Per-route Gini coefficient: for each route, compute the Gini coefficient of stop-level usage as a concentration metric (0 = perfectly even, 1 = all ridership at one stop).
  • Join per-route Gini with route-level average OTP from the database and test for correlation (Pearson, Spearman).
  • Generate a system-wide Pareto curve and a scatter plot of Gini vs OTP by route.

Data

Name Description Source
wprdc_stop_data.csv Stop-level boardings/alightings Local CSV (data/bus-stop-usage/)
otp_monthly Monthly OTP per route prt.db table
routes Route name and mode prt.db table

Output

  • output/pareto_system.csv -- cumulative ridership share by stop rank
  • output/route_gini.csv -- per-route Gini coefficient and OTP
  • output/pareto_curve.png -- system-wide Pareto curve
  • output/gini_vs_otp.png -- scatter plot of route Gini vs average OTP

Source Code

"""Quantify ridership concentration across stops and test whether it correlates with OTP."""

import polars as pl

from prt_otp_analysis.common import DATA_DIR, analysis_dir, correlate, gini, mode_scatter, phase, query_to_polars, run_analysis, save_chart, save_csv, setup_plotting

OUT = analysis_dir(__file__)


def load_stop_usage() -> pl.DataFrame:
    """Load pre-pandemic weekday stop-route usage from the WPRDC CSV."""
    csv_path = DATA_DIR / "bus-stop-usage" / "wprdc_stop_data.csv"
    df = pl.read_csv(csv_path, null_values=["NA", ""])
    df = df.filter(
        (pl.col("time_period") == "Pre-pandemic")
        & (pl.col("serviceday") == "Weekday")
    )
    # Average across datekeys per stop-route, then compute total usage
    usage = (
        df.group_by(["stop_id", "route_name"])
        .agg(
            pl.col("avg_ons").mean().alias("avg_ons"),
            pl.col("avg_offs").mean().alias("avg_offs"),
            pl.col("stop_name").first().alias("stop_name"),
            pl.col("latitude").first().alias("lat"),
            pl.col("longitude").first().alias("lon"),
        )
        .with_columns(
            (pl.col("avg_ons") + pl.col("avg_offs")).alias("avg_daily_usage")
        )
    )
    return usage


def system_pareto(usage: pl.DataFrame) -> pl.DataFrame:
    """Compute system-wide Pareto curve at the physical-stop level."""
    # Aggregate to physical stop
    per_stop = (
        usage.group_by("stop_id")
        .agg(
            pl.col("avg_daily_usage").sum(),
            pl.col("stop_name").first(),
        )
        .sort("avg_daily_usage", descending=True)
    )

    total = per_stop["avg_daily_usage"].sum()
    cum = per_stop["avg_daily_usage"].cum_sum()
    n = len(per_stop)

    pareto = per_stop.with_columns(
        (pl.Series("rank", range(1, n + 1)) / n * 100).alias("pct_stops"),
        (cum / total * 100).alias("cum_pct_ridership"),
    )
    return pareto


def route_gini(usage: pl.DataFrame) -> pl.DataFrame:
    """Compute Gini coefficient per route from stop-level usage."""
    routes = usage["route_name"].unique().to_list()
    rows = []
    for rt in routes:
        sub = usage.filter(pl.col("route_name") == rt)
        vals = sub["avg_daily_usage"].drop_nulls().to_list()
        if len(vals) < 3:
            continue
        g = gini(vals)
        rows.append({
            "route_name": rt,
            "gini": g,
            "n_stops": len(vals),
            "total_usage": sum(vals),
            "max_stop_usage": max(vals),
        })
    return pl.DataFrame(rows)


def load_route_otp() -> pl.DataFrame:
    """Load average OTP per route from the database."""
    return query_to_polars("""
        SELECT o.route_id, r.route_name, r.mode,
               AVG(o.otp) AS avg_otp
        FROM otp_monthly o
        JOIN routes r ON o.route_id = r.route_id
        GROUP BY o.route_id
        HAVING COUNT(*) >= 12
    """)


def make_charts(pareto: pl.DataFrame, gini_otp: pl.DataFrame) -> None:
    """Generate Pareto curve and Gini vs OTP scatter plot."""
    plt = setup_plotting()

    # --- System-wide Pareto curve ---
    fig, ax = plt.subplots(figsize=(10, 7))
    pct_stops = pareto["pct_stops"].to_list()
    cum_rider = pareto["cum_pct_ridership"].to_list()

    ax.plot(pct_stops, cum_rider, color="#3b82f6", linewidth=2)
    ax.plot([0, 100], [0, 100], color="#94a3b8", linewidth=1, linestyle="--", label="Perfect equality")

    # Find key thresholds
    for target in [50, 80, 90]:
        for i, cr in enumerate(cum_rider):
            if cr >= target:
                ax.axhline(target, color="#d1d5db", linewidth=0.5)
                ax.axvline(pct_stops[i], color="#d1d5db", linewidth=0.5)
                ax.plot(pct_stops[i], target, "o", color="#ef4444", markersize=8)
                ax.annotate(f"{pct_stops[i]:.0f}% of stops -> {target}% of riders",
                            xy=(pct_stops[i], target),
                            xytext=(pct_stops[i] + 5, target - 5),
                            fontsize=9, color="#ef4444")
                break

    ax.set_xlabel("% of Stops (ranked by usage)")
    ax.set_ylabel("Cumulative % of Ridership")
    ax.set_title("System-Wide Ridership Pareto Curve")
    ax.set_xlim(0, 100)
    ax.set_ylim(0, 100)
    ax.legend(loc="lower right")
    save_chart(fig, OUT / "pareto_curve.png")

    # --- Gini vs OTP scatter ---
    fig, ax = plt.subplots(figsize=(10, 7))
    mode_scatter(ax, gini_otp, "gini", "avg_otp")
    ax.set_xlabel("Gini Coefficient (ridership concentration)")
    ax.set_ylabel("Average OTP")
    ax.set_title("Route-Level Ridership Concentration vs On-Time Performance")
    ax.legend(fontsize=9)
    ax.set_ylim(0, 1)
    save_chart(fig, OUT / "gini_vs_otp.png")


@run_analysis(34, "Ridership Concentration (Pareto)")
def main() -> None:
    """Entry point: load data, compute Pareto and Gini, correlate with OTP."""

    with phase("Loading stop-level usage (pre-pandemic weekday)"):
        usage = load_stop_usage()
        print(f"  {len(usage):,} stop-route combinations")

    with phase("Computing system-wide Pareto curve"):
        pareto = system_pareto(usage)
        cum = pareto["cum_pct_ridership"].to_list()
        pct = pareto["pct_stops"].to_list()
        for target in [50, 80, 90]:
            for i, cr in enumerate(cum):
                if cr >= target:
                    print(f"  {target}% of ridership served by top {pct[i]:.1f}% of stops")
                    break

        sys_gini = gini(
            usage.group_by("stop_id").agg(pl.col("avg_daily_usage").sum())["avg_daily_usage"].to_list()
        )
        print(f"  System-wide Gini: {sys_gini:.3f}")

    with phase("Computing per-route Gini coefficients"):
        route_g = route_gini(usage)
        print(f"  {len(route_g)} routes with >= 3 stops")
        print(f"  Gini range: {route_g['gini'].min():.3f} - {route_g['gini'].max():.3f}")
        print(f"  Median Gini: {route_g['gini'].median():.3f}")

    with phase("Loading route OTP"):
        route_otp = load_route_otp()

        # Join Gini with OTP (CSV route_name = DB route_id)
        gini_otp = route_g.join(route_otp, left_on="route_name", right_on="route_id", how="inner")
        print(f"  {len(gini_otp)} routes matched")

        bus = gini_otp.filter(pl.col("mode") == "BUS").drop_nulls(subset=["gini", "avg_otp"])
        if len(bus) >= 3:
            corr = correlate(bus, "gini", "avg_otp")
            print(f"  Bus-only Pearson r = {corr['pearson_r']:.3f} (p = {corr['pearson_p']:.3f})")
            print(f"  Bus-only Spearman rho = {corr['spearman_r']:.3f} (p = {corr['spearman_p']:.3f})")

    with phase("Saving CSVs"):
        save_csv(pareto, OUT / "pareto_system.csv")
        save_csv(gini_otp, OUT / "route_gini.csv")

    with phase("Generating charts"):
        make_charts(pareto, gini_otp)


if __name__ == "__main__":
    main()

Sources

NameTypeWhy It MattersOwnerFreshnessCaveat
data/bus-stop-usage/wprdc_stop_data.csv file Referenced via DATA_DIR path composition in analysis script. Local project data owner not specified. Snapshot file; refresh by rerunning its pipeline step. May lag upstream source updates.
otp_monthly table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
routes table Primary analytical table used in this page's computations. Produced by Data Ingestion. Updated when the producing pipeline step is rerun. Coverage depends on upstream source availability and ETL assumptions.
Upstream sources (5)
  • file data/routes_by_month.csv — Monthly route OTP source table in wide format.
  • file data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv — Current route metadata and mode classifications.
  • file data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv — Current stop-to-route coverage and trip counts.
  • file data/PRT_Stop_Reference_Lookup_Table.csv — Historical stop reference file with geography attributes.
  • file data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv — Average ridership by route and month.
numpy dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
polars dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.
scipy dependency Runtime dependency required for this page's pipeline or analysis code. Open-source Python ecosystem maintainers. Version pinned by project environment until dependency updates are applied. Library updates may change behavior or defaults.