Building a Flight Delay Predictor with ML and DOT Airline Data

Every year, US airlines collectively delay over 20% of their flights by 15 minutes or more. For passengers, that's a missed connection. For carriers, it's a cascading operational cost. For ML engineers, it's 35 million labeled training examples — one for every commercial flight operated by major US carriers since 2018, courtesy of the Bureau of Transportation Statistics.

In this tutorial we'll build a binary flight delay classifier (delayed ≥15 minutes or not) using the ClarityStorm DOT Airline On-Time Performance dataset. We'll do real feature engineering from the raw flight fields, train a gradient boosting model, and evaluate it on a held-out test set — all in under 100 lines of Python.

What's in the Dataset

The BTS Reporting Carrier On-Time Performance data is one of the most ML-ready government datasets available. Every record is a single commercial flight — scheduled and actual departure/arrival times, delay decomposed into five causes, cancellation code, aircraft tail number, route, and distance. The ClarityStorm release unifies 2018–2024 monthly BTS files into a single cleaned Parquet, with a computed route key (e.g. JFK-LAX) and derived time-of-day fields added.

35M+ flight records across 2018–2024
Delay breakdown by cause: carrier, weather, NAS, security, late_aircraft
dep_del15 binary label — 1 if departure delay ≥ 15 min (ready-made training target)
Cancellation codes: A (carrier), B (weather), C (NAS/ATC), D (security)
Route, origin, destination, carrier, and aircraft tail number for feature engineering
Scheduled vs. actual departure/arrival times for temporal feature extraction

Loading and Exploring the Data

python

import pandas as pd

df = pd.read_parquet("dot_airline_ontime.parquet")

print(f"Flights: {len(df):,}")
print(f"Delayed (≥15 min): {df['dep_del15'].mean():.1%}")
print(f"Cancelled: {df['cancelled'].mean():.1%}")

# Delay rate by carrier
carrier_delay = (
    df.dropna(subset=["dep_del15"])
    .groupby("carrier")["dep_del15"]
    .agg(["mean", "count"])
    .rename(columns={"mean": "delay_rate", "count": "flights"})
    .sort_values("delay_rate", ascending=False)
)
print(carrier_delay.head(10))

Feature Engineering

Raw flight data contains excellent predictive signal, but most of it needs transformation. Scheduled departure time as a HHMM integer (e.g. 1430 for 2:30 PM) needs to become a cyclical hour feature. Carrier is a categorical that benefits from target-encoding rather than one-hot encoding at 35M rows. Route-level historical delay rates are among the strongest predictors available.

python

import numpy as np

# Drop cancelled flights — we're predicting delays, not cancellations
flights = df[df["cancelled"] == 0].copy()

# Extract hour from scheduled departure (HHMM integer)
flights["dep_hour"] = flights["crs_dep_time"] // 100

# Cyclical encoding for hour and day-of-week
flights["dep_hour_sin"] = np.sin(2 * np.pi * flights["dep_hour"] / 24)
flights["dep_hour_cos"] = np.cos(2 * np.pi * flights["dep_hour"] / 24)
flights["dow_sin"] = np.sin(2 * np.pi * flights["day_of_week"] / 7)
flights["dow_cos"] = np.cos(2 * np.pi * flights["day_of_week"] / 7)

# Month cyclical (seasonality)
flights["month_sin"] = np.sin(2 * np.pi * flights["month"] / 12)
flights["month_cos"] = np.cos(2 * np.pi * flights["month"] / 12)

# Route historical delay rate (compute on training data only)
route_delay = flights.groupby("route")["dep_del15"].mean().rename("route_delay_rate")
flights = flights.join(route_delay, on="route")

# Carrier historical delay rate
carrier_delay_rate = flights.groupby("carrier")["dep_del15"].mean().rename("carrier_delay_rate")
flights = flights.join(carrier_delay_rate, on="carrier")

print(flights[["dep_hour_sin","dep_hour_cos","route_delay_rate","carrier_delay_rate","dep_del15"]].head())

Training the Model

We'll use a gradient boosting classifier — specifically LightGBM, which handles the 35M-row scale efficiently. We split on year (train on 2018–2022, test on 2023–2024) rather than a random split, to evaluate true out-of-sample performance on unseen time periods.

python

from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report, roc_auc_score

FEATURES = [
    "dep_hour_sin", "dep_hour_cos",
    "dow_sin", "dow_cos",
    "month_sin", "month_cos",
    "distance",
    "route_delay_rate",
    "carrier_delay_rate",
]
TARGET = "dep_del15"

# Time-based split — critical for flight data
train = flights[flights["year"] <= 2022]
test  = flights[flights["year"] >= 2023]

X_train = train[FEATURES].fillna(0)
y_train = train[TARGET]
X_test  = test[FEATURES].fillna(0)
y_test  = test[TARGET]

model = LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=63, n_jobs=-1)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")

With these features alone — no external weather data, no real-time airport congestion — a tuned LightGBM model typically achieves an ROC-AUC of 0.73–0.76 on held-out 2023–2024 data. That's meaningfully better than the ~0.50 baseline (random) and comparable to commercial flight delay APIs that use proprietary inputs.

Feature Importance

python

import matplotlib.pyplot as plt

importance = pd.Series(model.feature_importances_, index=FEATURES).sort_values()
importance.plot(kind="barh", figsize=(9, 5), color="#0ea5e9")
plt.title("LightGBM Feature Importance — Flight Delay Model")
plt.xlabel("Importance score")
plt.tight_layout()
plt.savefig("flight_delay_feature_importance.png", dpi=150)

Route historical delay rate and carrier delay rate dominate feature importance — which makes intuitive sense. Structural route delays (congested airspace, slot-controlled airports) are the strongest predictor of whether a given flight will be late. Time-of-day features add meaningful lift, particularly for red-eye and early-morning departures where late-aircraft cascades are common.

Going Further

Add NOAA weather at origin/destination airports on the departure date for a significant AUC lift
Use the delay_cause breakdown (carrier_delay, weather_delay, nas_delay) to train separate models per cause
Build a multi-class model predicting delay bucket: on-time / <30 min / 30–120 min / >2 hr
Join against NTSB Aviation data to flag high-incident aircraft tail numbers as a risk feature
Deploy as a REST API: take flight_date, carrier, origin, dest, crs_dep_time → return P(delay ≥ 15 min)

The free sample contains 1,000 rows. The complete DOT Airline On-Time Performance dataset covers 35M+ flights from 2018 to 2024, available as CSV and Parquet with a commercial license for $99.

What's in the Dataset

Loading and Exploring the Data

Feature Engineering

Training the Model

Feature Importance

Going Further

Get the Full Dataset