Pipeline

Data Ingestion

Coverage: 2017-01 to 2025-11 (from otp_monthly, ridership_monthly).

Built 2026-03-03 02:23 UTC ยท Commit defd5c8

Page Navigation

Data Provenance

flowchart LR
  01_data_ingestion(["Data Ingestion"])
  f1_01_data_ingestion[/"data/routes_by_month.csv"/] --> 01_data_ingestion
  f2_01_data_ingestion[/"data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv"/] --> 01_data_ingestion
  f3_01_data_ingestion[/"data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv"/] --> 01_data_ingestion
  f4_01_data_ingestion[/"data/PRT_Stop_Reference_Lookup_Table.csv"/] --> 01_data_ingestion
  f5_01_data_ingestion[/"data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv"/] --> 01_data_ingestion
  01_data_ingestion --> tp_routes[("routes")]
  01_data_ingestion --> tp_stops[("stops")]
  01_data_ingestion --> tp_route_stops[("route_stops")]
  01_data_ingestion --> tp_stop_reference[("stop_reference")]
  01_data_ingestion --> tp_otp_monthly[("otp_monthly")]
  01_data_ingestion --> tp_ridership_monthly[("ridership_monthly")]
  classDef page fill:#dbeafe,stroke:#1d4ed8,color:#1e3a8a,stroke-width:2px;
  classDef table fill:#ecfeff,stroke:#0e7490,color:#164e63;
  classDef dep fill:#fff7ed,stroke:#c2410c,color:#7c2d12,stroke-dasharray: 4 2;
  classDef file fill:#eef2ff,stroke:#6366f1,color:#3730a3;
  classDef api fill:#f0fdf4,stroke:#16a34a,color:#14532d;
  classDef pipeline fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95;
  class 01_data_ingestion page;
  class tp_otp_monthly,tp_ridership_monthly,tp_route_stops,tp_routes,tp_stop_reference,tp_stops table;
  class f1_01_data_ingestion,f2_01_data_ingestion,f3_01_data_ingestion,f4_01_data_ingestion,f5_01_data_ingestion file;

Findings

Findings: Data Ingestion

Summary

The ingestion step rebuilds data/prt.db from source CSV inputs and validates expected core tables.

Notes

  • This step is deterministic for a fixed set of input files.
  • Analyses assume this database exists and is readable.

Methods

Methods: Data Ingestion

Question

How do we reproducibly build a normalized SQLite database used by all analyses?

Approach

  1. Read canonical CSV sources from data/.
  2. Normalize and reshape route, stop, OTP, and ridership records.
  3. Rebuild data/prt.db with expected tables and constraints.
  4. Emit basic verification output (row counts and sanity checks).

Data

  • data/routes_by_month.csv
  • data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv
  • data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv
  • data/PRT_Stop_Reference_Lookup_Table.csv
  • data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv

Output

  • data/prt.db -- normalized SQLite database for downstream analyses.
  • Console verification logs (table counts and sample diagnostics).

Tables Produced

TableDescription
routes Route dimension table.
stops Physical stop dimension table.
route_stops Route-stop bridge with service frequency metrics.
stop_reference Historical stop metadata and geography.
otp_monthly Monthly OTP values by route.
ridership_monthly Monthly ridership by route and day type.

Sources

NameTypeWhy It MattersOwnerFreshnessCaveat
data/routes_by_month.csv file Monthly route OTP source table in wide format. Local project data owner not specified. Snapshot file; refresh by rerunning its pipeline step. May lag upstream source updates.
data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv file Current route metadata and mode classifications. Local project data owner not specified. Snapshot file; refresh by rerunning its pipeline step. May lag upstream source updates.
data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv file Current stop-to-route coverage and trip counts. Local project data owner not specified. Snapshot file; refresh by rerunning its pipeline step. May lag upstream source updates.
data/PRT_Stop_Reference_Lookup_Table.csv file Historical stop reference file with geography attributes. Local project data owner not specified. Snapshot file; refresh by rerunning its pipeline step. May lag upstream source updates.
data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv file Average ridership by route and month. Local project data owner not specified. Snapshot file; refresh by rerunning its pipeline step. May lag upstream source updates.