DOT Airline On-Time Performance Dataset 2018–Present
BTS Reporting Carrier On-Time Performance (USDOT Form 41)
Under 14 CFR Part 234, US carriers with 1% or more of domestic scheduled service must report monthly on-time performance to the Bureau of Transportation Statistics. This dataset captures every reported domestic flight from 2018 through 2024 — scheduled and actual departure/arrival times, delay cause breakdowns, cancellation reasons, taxi times, and aircraft routing — unified across 84 monthly files into a single analysis-ready table.
Why not pull it directly from BTS?
You can — it's public domain. Here's what we saved you:
- ✓ 84 monthly ZIP files unified into one table — BTS publishes separate monthly downloads; we download, extract, and concatenate all of them with consistent column names
- ✓ Typed numeric columns — raw BTS CSVs contain mixed types, empty strings, and trailing commas; every delay, time, and distance column is cast to the correct numeric type
- ✓ Cancellation reason decoded — raw files have single-letter codes (A/B/C/D); we add a human-readable
cancellation_reasonfield (carrier / weather / national_air_system / security) - ✓ Computed fields:
is_delayedandroute— standard ML target and route key derived from existing columns - ✓ Deduplication — occasional BTS re-publications of corrected monthly files are identified and de-duplicated on the natural flight key
- ✓ Year-partitioned Parquet — in addition to a single flat file, we provide Parquet partitioned by year for efficient range-filtered queries on large datasets
⏱ Skip the 84-file download, unzip, column reconciliation, and type-casting. Ready for pd.read_parquet() in minutes.
35M+
Total Flights
84
Monthly Files Merged
20+
Active Carriers
7
Years of Data
Use Cases
Train models to predict departure and arrival delays using historical patterns by carrier, route, time of day, and season. 35M+ labeled examples across 7 years of post-COVID and pre-COVID operations.
Compare on-time performance across carriers on specific routes, airports, or time periods. Compute carrier-level delay rates, cancellation rates, and delay cause breakdowns.
Analyze NAS and weather delay propagation through hub airports. Identify which airports generate the most 'late aircraft' delay minutes that cascade downstream.
Power connection time recommendations, flight-risk scoring, and rebooking triggers. Identify routes with chronic delay variance vs. reliable on-time records.
Study the collapse and recovery of US aviation from 2020 through 2024 — carrier by carrier, route by route. 2020 data captures the sharpest demand shock in commercial aviation history.
Build flight delay insurance pricing models using historical delay distributions by carrier, route, and season. Compute expected delay costs at the route and carrier level.
Schema
Single table (dot_airline_ontime), delivered as dot_airline_ontime.csv, dot_airline_ontime.parquet, and year-partitioned Parquet under by_year/. One row per individual domestic flight segment.
| Field | Type | Description |
|---|---|---|
| flight_date | date | Date of the flight (YYYY-MM-DD) |
| year | int | Year of flight |
| quarter | int | Quarter (1–4) |
| month | int | Month (1–12) |
| day_of_month | int | Day of month (1–31) |
| day_of_week | int | Day of week (1=Monday … 7=Sunday) |
| carrier | string | IATA reporting carrier code (e.g. AA, DL, WN) |
| tail_number | string | Aircraft tail number (FAA registration) |
| flight_number | string | Carrier-assigned flight number |
| origin | string | Origin airport IATA code (e.g. JFK, LAX) |
| origin_city | string | Origin city and state (e.g. New York, NY) |
| origin_state | string | Origin US state abbreviation |
| origin_airport_id | int | BTS origin airport ID |
| dest | string | Destination airport IATA code |
| dest_city | string | Destination city and state |
| dest_state | string | Destination US state abbreviation |
| dest_airport_id | int | BTS destination airport ID |
| route | string | Route key (e.g. JFK-LAX) — computed |
| crs_dep_time | int | Scheduled departure time (HHMM) |
| dep_time | int | Actual departure time (HHMM) |
| dep_delay | float | Departure delay in minutes (negative = early) |
| dep_delay_minutes | float | Departure delay minutes (0 if early) |
| dep_del15 | int | 1 if departure delay >= 15 min |
| taxi_out | float | Taxi-out time in minutes (gate to wheels off) |
| wheels_off | int | Wheels-off time (HHMM) |
| wheels_on | int | Wheels-on time (HHMM) |
| taxi_in | float | Taxi-in time in minutes (wheels on to gate) |
| crs_arr_time | int | Scheduled arrival time (HHMM) |
| arr_time | int | Actual arrival time (HHMM) |
| arr_delay | float | Arrival delay in minutes (negative = early) |
| arr_delay_minutes | float | Arrival delay minutes (0 if early) |
| arr_del15 | int | 1 if arrival delay >= 15 min |
| is_delayed | int | 1 if arr_delay_minutes >= 15 and not cancelled — computed |
| cancelled | int | 1 if the flight was cancelled |
| cancellation_code | string | Raw cancellation code (A/B/C/D) |
| cancellation_reason | string | Cancellation reason: carrier / weather / national_air_system / security — computed |
| diverted | int | 1 if the flight was diverted |
| crs_elapsed_time | float | Scheduled elapsed time in minutes |
| actual_elapsed_time | float | Actual elapsed time in minutes |
| air_time | float | Airborne time in minutes (wheels off to wheels on) |
| distance | float | Great-circle distance between airports in miles |
| carrier_delay | float | Delay minutes attributable to the carrier |
| weather_delay | float | Delay minutes attributable to weather |
| nas_delay | float | Delay minutes attributable to the National Air System |
| security_delay | float | Delay minutes attributable to security |
| late_aircraft_delay | float | Delay minutes from a late incoming aircraft |
Quick Start
import pandas as pd
df = pd.read_parquet("dot_airline_ontime.parquet")
# On-time rate by carrier
carrier_perf = (
df[df["cancelled"] == 0]
.groupby("carrier")["is_delayed"]
.agg(flights="count", delayed="sum")
.assign(delay_rate=lambda x: (x["delayed"] / x["flights"] * 100).round(1))
.sort_values("delay_rate")
)
print(carrier_perf)
# Worst routes by average arrival delay
route_delays = (
df[(df["cancelled"] == 0) & df["arr_delay"].notna()]
.groupby("route")["arr_delay"]
.agg(flights="count", avg_delay="mean")
.query("flights >= 500")
.sort_values("avg_delay", ascending=False)
.head(20)
)
print(route_delays)
# COVID impact: monthly flight volume 2018-2024
monthly = (
df.groupby(["year", "month"])
.size()
.reset_index(name="flights")
.sort_values(["year", "month"])
)
# Delay cause attribution (minutes) by carrier
delay_causes = (
df[df["arr_del15"] == 1]
.groupby("carrier")[
["carrier_delay", "weather_delay", "nas_delay",
"security_delay", "late_aircraft_delay"]
]
.mean()
.round(1)
)
# Best connecting airports by NAS delay rate
hub_nas = (
df[df["cancelled"] == 0]
.groupby("origin")["nas_delay"]
.agg(flights="count", avg_nas="mean")
.query("flights >= 10000")
.sort_values("avg_nas")
.head(15)
)
# Year-partitioned read (much faster for single year)
df_2023 = pd.read_parquet("by_year/year=2023/data.parquet")Pairs Well With
Join on-time performance data against NTSB accident records by carrier, airport, and date to study whether operational pressure (chronic delays, high cancellation rates) correlates with incident frequency.
Merge BTS weather delay columns with actual NOAA/FAA weather observations at origin airports to build causal delay models and validate carrier-reported weather attribution.
Pricing
$99
Full dataset — 35M+ flights (2018–2024), CSV + Parquet + year-partitioned Parquet
Commercial License
Buy CompleteData Provenance
Source: U.S. Department of Transportation — Bureau of Transportation Statistics (BTS), Reporting Carrier On-Time Performance (Form 41 Traffic)
Portal: BTS TranStats On-Time Performance Download
Coverage: 2018–2024 (84 monthly files). BTS data goes back to 1987; this dataset scopes to 2018-present for a manageable size (~35M flights) while covering the critical COVID disruption period and full post-pandemic recovery.
Who reports: US air carriers with at least 1% of total domestic scheduled service revenues are required by 14 CFR Part 234 to report monthly on-time performance data to BTS. This covers all major US carriers (American, Delta, United, Southwest, Alaska, JetBlue, Spirit, Frontier, etc.).
Computed fields: is_delayed (1 if arr_delay_minutes ≥ 15 and not cancelled), route (origin-dest IATA pair), and cancellation_reason (decoded from single-letter BTS code) are computed by ClarityStorm.
Update frequency: BTS releases each month's data approximately 2–3 months after the flight month closes. Annual subscribers receive pipeline re-runs when new monthly data is available.
License: BTS on-time performance data is a US federal government work in the public domain. Paid tiers are licensed under the ClarityStorm Commercial Data License covering our pipeline and enrichment work (monthly file unification, type-casting, field normalization, computed fields, Parquet conversion).
Need custom date ranges, full 1987–present history, or bulk licensing?
Contact Sales