← All Datasets

DOT Airline On-Time Performance Dataset 2018–Present

BTS Reporting Carrier On-Time Performance (USDOT Form 41)

Under 14 CFR Part 234, US carriers with 1% or more of domestic scheduled service must report monthly on-time performance to the Bureau of Transportation Statistics. This dataset captures every reported domestic flight from 2018 through 2024 — scheduled and actual departure/arrival times, delay cause breakdowns, cancellation reasons, taxi times, and aircraft routing — unified across 84 monthly files into a single analysis-ready table.

Sample: Public DomainPaid: Commercial LicenseCSV + ParquetAnnual Updates2018–2024Year-Partitioned Parquet

Why not pull it directly from BTS?

You can — it's public domain. Here's what we saved you:

  • 84 monthly ZIP files unified into one table — BTS publishes separate monthly downloads; we download, extract, and concatenate all of them with consistent column names
  • Typed numeric columns — raw BTS CSVs contain mixed types, empty strings, and trailing commas; every delay, time, and distance column is cast to the correct numeric type
  • Cancellation reason decoded — raw files have single-letter codes (A/B/C/D); we add a human-readable cancellation_reason field (carrier / weather / national_air_system / security)
  • Computed fields: is_delayed and route — standard ML target and route key derived from existing columns
  • Deduplication — occasional BTS re-publications of corrected monthly files are identified and de-duplicated on the natural flight key
  • Year-partitioned Parquet — in addition to a single flat file, we provide Parquet partitioned by year for efficient range-filtered queries on large datasets

⏱ Skip the 84-file download, unzip, column reconciliation, and type-casting. Ready for pd.read_parquet() in minutes.

35M+

Total Flights

84

Monthly Files Merged

20+

Active Carriers

7

Years of Data

Use Cases

Flight Delay Prediction (ML)

Train models to predict departure and arrival delays using historical patterns by carrier, route, time of day, and season. 35M+ labeled examples across 7 years of post-COVID and pre-COVID operations.

Airline Benchmarking

Compare on-time performance across carriers on specific routes, airports, or time periods. Compute carrier-level delay rates, cancellation rates, and delay cause breakdowns.

Airport Operations Research

Analyze NAS and weather delay propagation through hub airports. Identify which airports generate the most 'late aircraft' delay minutes that cascade downstream.

Travel Product Optimization

Power connection time recommendations, flight-risk scoring, and rebooking triggers. Identify routes with chronic delay variance vs. reliable on-time records.

COVID-19 Aviation Impact

Study the collapse and recovery of US aviation from 2020 through 2024 — carrier by carrier, route by route. 2020 data captures the sharpest demand shock in commercial aviation history.

Insurance & Risk Pricing

Build flight delay insurance pricing models using historical delay distributions by carrier, route, and season. Compute expected delay costs at the route and carrier level.

Schema

Single table (dot_airline_ontime), delivered as dot_airline_ontime.csv, dot_airline_ontime.parquet, and year-partitioned Parquet under by_year/. One row per individual domestic flight segment.

FieldTypeDescription
flight_datedateDate of the flight (YYYY-MM-DD)
yearintYear of flight
quarterintQuarter (1–4)
monthintMonth (1–12)
day_of_monthintDay of month (1–31)
day_of_weekintDay of week (1=Monday … 7=Sunday)
carrierstringIATA reporting carrier code (e.g. AA, DL, WN)
tail_numberstringAircraft tail number (FAA registration)
flight_numberstringCarrier-assigned flight number
originstringOrigin airport IATA code (e.g. JFK, LAX)
origin_citystringOrigin city and state (e.g. New York, NY)
origin_statestringOrigin US state abbreviation
origin_airport_idintBTS origin airport ID
deststringDestination airport IATA code
dest_citystringDestination city and state
dest_statestringDestination US state abbreviation
dest_airport_idintBTS destination airport ID
routestringRoute key (e.g. JFK-LAX) — computed
crs_dep_timeintScheduled departure time (HHMM)
dep_timeintActual departure time (HHMM)
dep_delayfloatDeparture delay in minutes (negative = early)
dep_delay_minutesfloatDeparture delay minutes (0 if early)
dep_del15int1 if departure delay >= 15 min
taxi_outfloatTaxi-out time in minutes (gate to wheels off)
wheels_offintWheels-off time (HHMM)
wheels_onintWheels-on time (HHMM)
taxi_infloatTaxi-in time in minutes (wheels on to gate)
crs_arr_timeintScheduled arrival time (HHMM)
arr_timeintActual arrival time (HHMM)
arr_delayfloatArrival delay in minutes (negative = early)
arr_delay_minutesfloatArrival delay minutes (0 if early)
arr_del15int1 if arrival delay >= 15 min
is_delayedint1 if arr_delay_minutes >= 15 and not cancelled — computed
cancelledint1 if the flight was cancelled
cancellation_codestringRaw cancellation code (A/B/C/D)
cancellation_reasonstringCancellation reason: carrier / weather / national_air_system / security — computed
divertedint1 if the flight was diverted
crs_elapsed_timefloatScheduled elapsed time in minutes
actual_elapsed_timefloatActual elapsed time in minutes
air_timefloatAirborne time in minutes (wheels off to wheels on)
distancefloatGreat-circle distance between airports in miles
carrier_delayfloatDelay minutes attributable to the carrier
weather_delayfloatDelay minutes attributable to weather
nas_delayfloatDelay minutes attributable to the National Air System
security_delayfloatDelay minutes attributable to security
late_aircraft_delayfloatDelay minutes from a late incoming aircraft

Quick Start

import pandas as pd

df = pd.read_parquet("dot_airline_ontime.parquet")

# On-time rate by carrier
carrier_perf = (
    df[df["cancelled"] == 0]
    .groupby("carrier")["is_delayed"]
    .agg(flights="count", delayed="sum")
    .assign(delay_rate=lambda x: (x["delayed"] / x["flights"] * 100).round(1))
    .sort_values("delay_rate")
)
print(carrier_perf)

# Worst routes by average arrival delay
route_delays = (
    df[(df["cancelled"] == 0) & df["arr_delay"].notna()]
    .groupby("route")["arr_delay"]
    .agg(flights="count", avg_delay="mean")
    .query("flights >= 500")
    .sort_values("avg_delay", ascending=False)
    .head(20)
)
print(route_delays)

# COVID impact: monthly flight volume 2018-2024
monthly = (
    df.groupby(["year", "month"])
    .size()
    .reset_index(name="flights")
    .sort_values(["year", "month"])
)

# Delay cause attribution (minutes) by carrier
delay_causes = (
    df[df["arr_del15"] == 1]
    .groupby("carrier")[
        ["carrier_delay", "weather_delay", "nas_delay",
         "security_delay", "late_aircraft_delay"]
    ]
    .mean()
    .round(1)
)

# Best connecting airports by NAS delay rate
hub_nas = (
    df[df["cancelled"] == 0]
    .groupby("origin")["nas_delay"]
    .agg(flights="count", avg_nas="mean")
    .query("flights >= 10000")
    .sort_values("avg_nas")
    .head(15)
)

# Year-partitioned read (much faster for single year)
df_2023 = pd.read_parquet("by_year/year=2023/data.parquet")

Pairs Well With

Join on-time performance data against NTSB accident records by carrier, airport, and date to study whether operational pressure (chronic delays, high cancellation rates) correlates with incident frequency.

External: FAA ATIS / Weather Data

Merge BTS weather delay columns with actual NOAA/FAA weather observations at origin airports to build causal delay models and validate carrier-reported weather attribution.

Pricing

Sample

Free

1,000 rows (CSV) + schema docs

Public Domain

Download Sample
Complete

$99

Full dataset — 35M+ flights (2018–2024), CSV + Parquet + year-partitioned Parquet

Commercial License

Buy Complete
Annual

$199/yr

Full dataset + annual updates as BTS releases new monthly data

Commercial License

Subscribe

Data Provenance

Source: U.S. Department of Transportation — Bureau of Transportation Statistics (BTS), Reporting Carrier On-Time Performance (Form 41 Traffic)

Portal: BTS TranStats On-Time Performance Download

Coverage: 2018–2024 (84 monthly files). BTS data goes back to 1987; this dataset scopes to 2018-present for a manageable size (~35M flights) while covering the critical COVID disruption period and full post-pandemic recovery.

Who reports: US air carriers with at least 1% of total domestic scheduled service revenues are required by 14 CFR Part 234 to report monthly on-time performance data to BTS. This covers all major US carriers (American, Delta, United, Southwest, Alaska, JetBlue, Spirit, Frontier, etc.).

Computed fields: is_delayed (1 if arr_delay_minutes ≥ 15 and not cancelled), route (origin-dest IATA pair), and cancellation_reason (decoded from single-letter BTS code) are computed by ClarityStorm.

Update frequency: BTS releases each month's data approximately 2–3 months after the flight month closes. Annual subscribers receive pipeline re-runs when new monthly data is available.

License: BTS on-time performance data is a US federal government work in the public domain. Paid tiers are licensed under the ClarityStorm Commercial Data License covering our pipeline and enrichment work (monthly file unification, type-casting, field normalization, computed fields, Parquet conversion).

Need custom date ranges, full 1987–present history, or bulk licensing?

Contact Sales