Pipeline
Data Ingestion
Coverage: 2017-01 to 2025-11 (from otp_monthly, ridership_monthly).
Built 2026-03-03 02:23 UTC ยท Commit defd5c8
Page Navigation
Data Provenance
flowchart LR
01_data_ingestion(["Data Ingestion"])
f1_01_data_ingestion[/"data/routes_by_month.csv"/] --> 01_data_ingestion
f2_01_data_ingestion[/"data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv"/] --> 01_data_ingestion
f3_01_data_ingestion[/"data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv"/] --> 01_data_ingestion
f4_01_data_ingestion[/"data/PRT_Stop_Reference_Lookup_Table.csv"/] --> 01_data_ingestion
f5_01_data_ingestion[/"data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv"/] --> 01_data_ingestion
01_data_ingestion --> tp_routes[("routes")]
01_data_ingestion --> tp_stops[("stops")]
01_data_ingestion --> tp_route_stops[("route_stops")]
01_data_ingestion --> tp_stop_reference[("stop_reference")]
01_data_ingestion --> tp_otp_monthly[("otp_monthly")]
01_data_ingestion --> tp_ridership_monthly[("ridership_monthly")]
classDef page fill:#dbeafe,stroke:#1d4ed8,color:#1e3a8a,stroke-width:2px;
classDef table fill:#ecfeff,stroke:#0e7490,color:#164e63;
classDef dep fill:#fff7ed,stroke:#c2410c,color:#7c2d12,stroke-dasharray: 4 2;
classDef file fill:#eef2ff,stroke:#6366f1,color:#3730a3;
classDef api fill:#f0fdf4,stroke:#16a34a,color:#14532d;
classDef pipeline fill:#f5f3ff,stroke:#7c3aed,color:#4c1d95;
class 01_data_ingestion page;
class tp_otp_monthly,tp_ridership_monthly,tp_route_stops,tp_routes,tp_stop_reference,tp_stops table;
class f1_01_data_ingestion,f2_01_data_ingestion,f3_01_data_ingestion,f4_01_data_ingestion,f5_01_data_ingestion file;
Findings
Findings: Data Ingestion
Summary
The ingestion step rebuilds data/prt.db from source CSV inputs and validates expected core tables.
Notes
- This step is deterministic for a fixed set of input files.
- Analyses assume this database exists and is readable.
Methods
Methods: Data Ingestion
Question
How do we reproducibly build a normalized SQLite database used by all analyses?
Approach
- Read canonical CSV sources from
data/. - Normalize and reshape route, stop, OTP, and ridership records.
- Rebuild
data/prt.dbwith expected tables and constraints. - Emit basic verification output (row counts and sanity checks).
Data
data/routes_by_month.csvdata/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csvdata/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csvdata/PRT_Stop_Reference_Lookup_Table.csvdata/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv
Output
data/prt.db-- normalized SQLite database for downstream analyses.- Console verification logs (table counts and sample diagnostics).
Tables Produced
| Table | Description |
|---|---|
routes |
Route dimension table. |
stops |
Physical stop dimension table. |
route_stops |
Route-stop bridge with service frequency metrics. |
stop_reference |
Historical stop metadata and geography. |
otp_monthly |
Monthly OTP values by route. |
ridership_monthly |
Monthly ridership by route and day type. |
Sources
| Name | Type | Why It Matters | Owner | Freshness | Caveat |
|---|---|---|---|---|---|
| data/routes_by_month.csv | file | Monthly route OTP source table in wide format. | Local project data owner not specified. | Snapshot file; refresh by rerunning its pipeline step. | May lag upstream source updates. |
| data/PRT_Current_Routes_Full_System_de0e48fcbed24ebc8b0d933e47b56682.csv | file | Current route metadata and mode classifications. | Local project data owner not specified. | Snapshot file; refresh by rerunning its pipeline step. | May lag upstream source updates. |
| data/Transit_stops_(current)_by_route_e040ee029227468ebf9d217402a82fa9.csv | file | Current stop-to-route coverage and trip counts. | Local project data owner not specified. | Snapshot file; refresh by rerunning its pipeline step. | May lag upstream source updates. |
| data/PRT_Stop_Reference_Lookup_Table.csv | file | Historical stop reference file with geography attributes. | Local project data owner not specified. | Snapshot file; refresh by rerunning its pipeline step. | May lag upstream source updates. |
| data/average-ridership/12bb84ed-397e-435c-8d1b-8ce543108698.csv | file | Average ridership by route and month. | Local project data owner not specified. | Snapshot file; refresh by rerunning its pipeline step. | May lag upstream source updates. |