Every year, US airlines collectively delay over 20% of their flights by 15 minutes or more. For passengers, that's a missed connection. For carriers, it's a cascading operational cost. For ML engineers, it's 35 million labeled training examples — one for every commercial flight operated by major US carriers since 2018, courtesy of the Bureau of Transportation Statistics.
In this tutorial we'll build a binary flight delay classifier (delayed ≥15 minutes or not) using the ClarityStorm DOT Airline On-Time Performance dataset. We'll do real feature engineering from the raw flight fields, train a gradient boosting model, and evaluate it on a held-out test set — all in under 100 lines of Python.
What's in the Dataset
The BTS Reporting Carrier On-Time Performance data is one of the most ML-ready government datasets available. Every record is a single commercial flight — scheduled and actual departure/arrival times, delay decomposed into five causes, cancellation code, aircraft tail number, route, and distance. The ClarityStorm release unifies 2018–2024 monthly BTS files into a single cleaned Parquet, with a computed route key (e.g. JFK-LAX) and derived time-of-day fields added.
- 35M+ flight records across 2018–2024
- Delay breakdown by cause: carrier, weather, NAS, security, late_aircraft
- dep_del15 binary label — 1 if departure delay ≥ 15 min (ready-made training target)
- Cancellation codes: A (carrier), B (weather), C (NAS/ATC), D (security)
- Route, origin, destination, carrier, and aircraft tail number for feature engineering
- Scheduled vs. actual departure/arrival times for temporal feature extraction
Loading and Exploring the Data
import pandas as pd
df = pd.read_parquet("dot_airline_ontime.parquet")
print(f"Flights: {len(df):,}")
print(f"Delayed (≥15 min): {df['dep_del15'].mean():.1%}")
print(f"Cancelled: {df['cancelled'].mean():.1%}")
# Delay rate by carrier
carrier_delay = (
df.dropna(subset=["dep_del15"])
.groupby("carrier")["dep_del15"]
.agg(["mean", "count"])
.rename(columns={"mean": "delay_rate", "count": "flights"})
.sort_values("delay_rate", ascending=False)
)
print(carrier_delay.head(10))Feature Engineering
Raw flight data contains excellent predictive signal, but most of it needs transformation. Scheduled departure time as a HHMM integer (e.g. 1430 for 2:30 PM) needs to become a cyclical hour feature. Carrier is a categorical that benefits from target-encoding rather than one-hot encoding at 35M rows. Route-level historical delay rates are among the strongest predictors available.
import numpy as np
# Drop cancelled flights — we're predicting delays, not cancellations
flights = df[df["cancelled"] == 0].copy()
# Extract hour from scheduled departure (HHMM integer)
flights["dep_hour"] = flights["crs_dep_time"] // 100
# Cyclical encoding for hour and day-of-week
flights["dep_hour_sin"] = np.sin(2 * np.pi * flights["dep_hour"] / 24)
flights["dep_hour_cos"] = np.cos(2 * np.pi * flights["dep_hour"] / 24)
flights["dow_sin"] = np.sin(2 * np.pi * flights["day_of_week"] / 7)
flights["dow_cos"] = np.cos(2 * np.pi * flights["day_of_week"] / 7)
# Month cyclical (seasonality)
flights["month_sin"] = np.sin(2 * np.pi * flights["month"] / 12)
flights["month_cos"] = np.cos(2 * np.pi * flights["month"] / 12)
# Route historical delay rate (compute on training data only)
route_delay = flights.groupby("route")["dep_del15"].mean().rename("route_delay_rate")
flights = flights.join(route_delay, on="route")
# Carrier historical delay rate
carrier_delay_rate = flights.groupby("carrier")["dep_del15"].mean().rename("carrier_delay_rate")
flights = flights.join(carrier_delay_rate, on="carrier")
print(flights[["dep_hour_sin","dep_hour_cos","route_delay_rate","carrier_delay_rate","dep_del15"]].head())Training the Model
We'll use a gradient boosting classifier — specifically LightGBM, which handles the 35M-row scale efficiently. We split on year (train on 2018–2022, test on 2023–2024) rather than a random split, to evaluate true out-of-sample performance on unseen time periods.
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report, roc_auc_score
FEATURES = [
"dep_hour_sin", "dep_hour_cos",
"dow_sin", "dow_cos",
"month_sin", "month_cos",
"distance",
"route_delay_rate",
"carrier_delay_rate",
]
TARGET = "dep_del15"
# Time-based split — critical for flight data
train = flights[flights["year"] <= 2022]
test = flights[flights["year"] >= 2023]
X_train = train[FEATURES].fillna(0)
y_train = train[TARGET]
X_test = test[FEATURES].fillna(0)
y_test = test[TARGET]
model = LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=63, n_jobs=-1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")With these features alone — no external weather data, no real-time airport congestion — a tuned LightGBM model typically achieves an ROC-AUC of 0.73–0.76 on held-out 2023–2024 data. That's meaningfully better than the ~0.50 baseline (random) and comparable to commercial flight delay APIs that use proprietary inputs.
Feature Importance
import matplotlib.pyplot as plt
importance = pd.Series(model.feature_importances_, index=FEATURES).sort_values()
importance.plot(kind="barh", figsize=(9, 5), color="#0ea5e9")
plt.title("LightGBM Feature Importance — Flight Delay Model")
plt.xlabel("Importance score")
plt.tight_layout()
plt.savefig("flight_delay_feature_importance.png", dpi=150)Route historical delay rate and carrier delay rate dominate feature importance — which makes intuitive sense. Structural route delays (congested airspace, slot-controlled airports) are the strongest predictor of whether a given flight will be late. Time-of-day features add meaningful lift, particularly for red-eye and early-morning departures where late-aircraft cascades are common.
Going Further
- Add NOAA weather at origin/destination airports on the departure date for a significant AUC lift
- Use the delay_cause breakdown (carrier_delay, weather_delay, nas_delay) to train separate models per cause
- Build a multi-class model predicting delay bucket: on-time / <30 min / 30–120 min / >2 hr
- Join against NTSB Aviation data to flag high-incident aircraft tail numbers as a risk feature
- Deploy as a REST API: take flight_date, carrier, origin, dest, crs_dep_time → return P(delay ≥ 15 min)
The free sample contains 1,000 rows. The complete DOT Airline On-Time Performance dataset covers 35M+ flights from 2018 to 2024, available as CSV and Parquet with a commercial license for $99.