DOT Airline On-Time Performance Dataset 2018–Present

Name: DOT Airline On-Time Performance Dataset 2018–Present
Creator: ClarityStorm
License: https://www.claritystorm.com/license

BTS Reporting Carrier On-Time Performance (USDOT Form 41)

Under 14 CFR Part 234, US carriers with 1% or more of domestic scheduled service must report monthly on-time performance to the Bureau of Transportation Statistics. This dataset captures every reported domestic flight from 2018 through 2024 — scheduled and actual departure/arrival times, delay cause breakdowns, cancellation reasons, taxi times, and aircraft routing — unified across 84 monthly files into a single analysis-ready table.

Sample: Public DomainPaid: Commercial LicenseCSV + ParquetAnnual Updates2018–2024Year-Partitioned Parquet

Why not pull it directly from BTS?

You can — it's public domain. Here's what we saved you:

✓ 84 monthly ZIP files unified into one table — BTS publishes separate monthly downloads; we download, extract, and concatenate all of them with consistent column names
✓ Typed numeric columns — raw BTS CSVs contain mixed types, empty strings, and trailing commas; every delay, time, and distance column is cast to the correct numeric type
✓ Cancellation reason decoded — raw files have single-letter codes (A/B/C/D); we add a human-readable cancellation_reason field (carrier / weather / national_air_system / security)
✓ Computed fields: is_delayed and route — standard ML target and route key derived from existing columns
✓ Deduplication — occasional BTS re-publications of corrected monthly files are identified and de-duplicated on the natural flight key
✓ Year-partitioned Parquet — in addition to a single flat file, we provide Parquet partitioned by year for efficient range-filtered queries on large datasets

⏱ Skip the 84-file download, unzip, column reconciliation, and type-casting. Ready for pd.read_parquet() in minutes.

35M+

Total Flights

Monthly Files Merged

20+

Active Carriers

Years of Data

Use Cases

Flight Delay Prediction (ML)

Train models to predict departure and arrival delays using historical patterns by carrier, route, time of day, and season. 35M+ labeled examples across 7 years of post-COVID and pre-COVID operations.

Airline Benchmarking

Compare on-time performance across carriers on specific routes, airports, or time periods. Compute carrier-level delay rates, cancellation rates, and delay cause breakdowns.

Airport Operations Research

Analyze NAS and weather delay propagation through hub airports. Identify which airports generate the most 'late aircraft' delay minutes that cascade downstream.

Travel Product Optimization

Power connection time recommendations, flight-risk scoring, and rebooking triggers. Identify routes with chronic delay variance vs. reliable on-time records.

COVID-19 Aviation Impact

Study the collapse and recovery of US aviation from 2020 through 2024 — carrier by carrier, route by route. 2020 data captures the sharpest demand shock in commercial aviation history.

Insurance & Risk Pricing

Build flight delay insurance pricing models using historical delay distributions by carrier, route, and season. Compute expected delay costs at the route and carrier level.

Schema

Single table (dot_airline_ontime), delivered as dot_airline_ontime.csv, dot_airline_ontime.parquet, and year-partitioned Parquet under by_year/. One row per individual domestic flight segment.

Field	Type	Description
flight_date	date	Date of the flight (YYYY-MM-DD)
year	int	Year of flight
quarter	int	Quarter (1–4)
month	int	Month (1–12)
day_of_month	int	Day of month (1–31)
day_of_week	int	Day of week (1=Monday … 7=Sunday)
carrier	string	IATA reporting carrier code (e.g. AA, DL, WN)
tail_number	string	Aircraft tail number (FAA registration)
flight_number	string	Carrier-assigned flight number
origin	string	Origin airport IATA code (e.g. JFK, LAX)
origin_city	string	Origin city and state (e.g. New York, NY)
origin_state	string	Origin US state abbreviation
origin_airport_id	int	BTS origin airport ID
dest	string	Destination airport IATA code
dest_city	string	Destination city and state
dest_state	string	Destination US state abbreviation
dest_airport_id	int	BTS destination airport ID
route	string	Route key (e.g. JFK-LAX) — computed
crs_dep_time	int	Scheduled departure time (HHMM)
dep_time	int	Actual departure time (HHMM)
dep_delay	float	Departure delay in minutes (negative = early)
dep_delay_minutes	float	Departure delay minutes (0 if early)
dep_del15	int	1 if departure delay >= 15 min
taxi_out	float	Taxi-out time in minutes (gate to wheels off)
wheels_off	int	Wheels-off time (HHMM)
wheels_on	int	Wheels-on time (HHMM)
taxi_in	float	Taxi-in time in minutes (wheels on to gate)
crs_arr_time	int	Scheduled arrival time (HHMM)
arr_time	int	Actual arrival time (HHMM)
arr_delay	float	Arrival delay in minutes (negative = early)
arr_delay_minutes	float	Arrival delay minutes (0 if early)
arr_del15	int	1 if arrival delay >= 15 min
is_delayed	int	1 if arr_delay_minutes >= 15 and not cancelled — computed
cancelled	int	1 if the flight was cancelled
cancellation_code	string	Raw cancellation code (A/B/C/D)
cancellation_reason	string	Cancellation reason: carrier / weather / national_air_system / security — computed
diverted	int	1 if the flight was diverted
crs_elapsed_time	float	Scheduled elapsed time in minutes
actual_elapsed_time	float	Actual elapsed time in minutes
air_time	float	Airborne time in minutes (wheels off to wheels on)
distance	float	Great-circle distance between airports in miles
carrier_delay	float	Delay minutes attributable to the carrier
weather_delay	float	Delay minutes attributable to weather
nas_delay	float	Delay minutes attributable to the National Air System
security_delay	float	Delay minutes attributable to security
late_aircraft_delay	float	Delay minutes from a late incoming aircraft

Quick Start

import pandas as pd

df = pd.read_parquet("dot_airline_ontime.parquet")

# On-time rate by carrier
carrier_perf = (
    df[df["cancelled"] == 0]
    .groupby("carrier")["is_delayed"]
    .agg(flights="count", delayed="sum")
    .assign(delay_rate=lambda x: (x["delayed"] / x["flights"] * 100).round(1))
    .sort_values("delay_rate")
)
print(carrier_perf)

# Worst routes by average arrival delay
route_delays = (
    df[(df["cancelled"] == 0) & df["arr_delay"].notna()]
    .groupby("route")["arr_delay"]
    .agg(flights="count", avg_delay="mean")
    .query("flights >= 500")
    .sort_values("avg_delay", ascending=False)
    .head(20)
)
print(route_delays)

# COVID impact: monthly flight volume 2018-2024
monthly = (
    df.groupby(["year", "month"])
    .size()
    .reset_index(name="flights")
    .sort_values(["year", "month"])
)

# Delay cause attribution (minutes) by carrier
delay_causes = (
    df[df["arr_del15"] == 1]
    .groupby("carrier")[
        ["carrier_delay", "weather_delay", "nas_delay",
         "security_delay", "late_aircraft_delay"]
    ]
    .mean()
    .round(1)
)

# Best connecting airports by NAS delay rate
hub_nas = (
    df[df["cancelled"] == 0]
    .groupby("origin")["nas_delay"]
    .agg(flights="count", avg_nas="mean")
    .query("flights >= 10000")
    .sort_values("avg_nas")
    .head(15)
)

# Year-partitioned read (much faster for single year)
df_2023 = pd.read_parquet("by_year/year=2023/data.parquet")

Pairs Well With

NTSB Aviation Accidents 1982–Present

Join on-time performance data against NTSB accident records by carrier, airport, and date to study whether operational pressure (chronic delays, high cancellation rates) correlates with incident frequency.

External: FAA ATIS / Weather Data

Merge BTS weather delay columns with actual NOAA/FAA weather observations at origin airports to build causal delay models and validate carrier-reported weather attribution.

Pricing

Sample

Free

1,000 rows (CSV) + schema docs

Public Domain

Download Sample

Complete

$99

Full dataset — 35M+ flights (2018–2024), CSV + Parquet + year-partitioned Parquet

Commercial License

Buy Complete Dataset

Annual

$199/yr

Full dataset + annual updates as BTS releases new monthly data

Commercial License

Subscribe Annual

Data Provenance

Source: U.S. Department of Transportation — Bureau of Transportation Statistics (BTS), Reporting Carrier On-Time Performance (Form 41 Traffic)

Portal: BTS TranStats On-Time Performance Download

Coverage: 2018–2024 (84 monthly files). BTS data goes back to 1987; this dataset scopes to 2018-present for a manageable size (~35M flights) while covering the critical COVID disruption period and full post-pandemic recovery.

Who reports: US air carriers with at least 1% of total domestic scheduled service revenues are required by 14 CFR Part 234 to report monthly on-time performance data to BTS. This covers all major US carriers (American, Delta, United, Southwest, Alaska, JetBlue, Spirit, Frontier, etc.).

Computed fields: is_delayed (1 if arr_delay_minutes ≥ 15 and not cancelled), route (origin-dest IATA pair), and cancellation_reason (decoded from single-letter BTS code) are computed by ClarityStorm.

Update frequency: BTS releases each month's data approximately 2–3 months after the flight month closes. Annual subscribers receive pipeline re-runs when new monthly data is available.

License: BTS on-time performance data is a US federal government work in the public domain. Paid tiers are licensed under the ClarityStorm Commercial Data License covering our pipeline and enrichment work (monthly file unification, type-casting, field normalization, computed fields, Parquet conversion).

Need custom date ranges, full 1987–present history, or bulk licensing?

Contact Sales