Commercial aviation is statistically the safest form of transportation per mile traveled — yet accidents still happen, and when they do, the NTSB investigates exhaustively. The National Transportation Safety Board has recorded every US aviation accident since 1982 in its accident database: over 30,000 events covering general aviation, commercial carriers, and everything in between. Each record links the accident event to the aircraft involved, the injuries sustained, the probable causes found, and in many cases a full narrative of what went wrong.
This tutorial walks through loading the ClarityStorm NTSB Aviation dataset, computing accident rate trends, profiling high-risk aircraft categories, and extracting patterns from the findings table.
Six Tables, One Accident Database
The NTSB database is relational. ClarityStorm ships all six core tables with consistent schema and primary key linkage: events (one row per accident), aircraft (one row per aircraft in each event), engines, narratives (full-text cause descriptions), findings (structured probable cause codes), and occurrences (phase-of-flight and occurrence type flags).
- events: ~30K accident records, 1982–present
- aircraft: aircraft type, category, make, model, amateur-built flag
- narratives: full-text probable cause and factual narratives
- findings: structured cause codes (human factor, mechanical, environmental)
- occurrences: phase-of-flight and occurrence sequence
- engines: engine type, number, and horsepower
Loading and Joining the Tables
import pandas as pd
events = pd.read_parquet("ntsb_events.parquet")
aircraft = pd.read_parquet("ntsb_aircraft.parquet")
findings = pd.read_parquet("ntsb_findings.parquet")
print(f"Events: {len(events):,}")
print(f"Aircraft: {len(aircraft):,}")
print(f"Findings: {len(findings):,}")
# Primary join: events ← aircraft via ev_id
df = events.merge(aircraft, on="ev_id", how="left")
print(f"Joined rows: {len(df):,}")
# Quick profile
print(df.groupby("ev_highest_injury")["ev_id"].count())Accident Rate Trends
The story the NTSB data tells is one of consistent improvement. General aviation accident rates have fallen sharply since the 1980s, driven by improved avionics (GPS, terrain-awareness systems), better pilot training standards, and airspace modernization. The data makes this visible.
import matplotlib.pyplot as plt
annual = (
events.groupby("ev_year")
.agg(
total_events=("ev_id", "count"),
fatal_events=("ev_highest_injury", lambda x: (x == "FATAL").sum()),
)
.reset_index()
)
annual["fatal_rate"] = annual["fatal_events"] / annual["total_events"]
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8), sharex=True)
ax1.bar(annual["ev_year"], annual["total_events"], color="#0ea5e9", alpha=0.7)
ax1.set_title("Total Aviation Accidents per Year (NTSB 1982–Present)")
ax1.set_ylabel("Accidents")
ax2.plot(annual["ev_year"], annual["fatal_rate"], color="#ef4444", linewidth=2)
ax2.fill_between(annual["ev_year"], annual["fatal_rate"], alpha=0.2, color="#ef4444")
ax2.set_title("Fatal Accident Rate")
ax2.set_ylabel("Fraction Fatal")
ax2.set_xlabel("Year")
plt.tight_layout()
plt.savefig("ntsb_trend.png", dpi=150)Profiling High-Risk Aircraft Categories
Not all aircraft categories carry equal risk. Single-engine piston aircraft dominate the accident count by volume (they also dominate the general aviation fleet), but rotorcraft and amateur-built aircraft show disproportionate fatal-accident rates. The aircraft table makes this breakdown straightforward.
# Fatal accident rate by aircraft category
category_risk = (
df.groupby("acft_category")
.agg(
total=("ev_id", "count"),
fatal=("ev_highest_injury", lambda x: (x == "FATAL").sum()),
)
.assign(fatal_rate=lambda d: d["fatal"] / d["total"])
.sort_values("fatal_rate", ascending=False)
.query("total >= 100") # filter low-sample categories
)
print(category_risk[["total", "fatal", "fatal_rate"]].to_string())Mining Probable Cause Findings
The findings table encodes the NTSB's structured causal analysis for each accident: human performance issues, mechanical failures, environmental factors, and their relationships. Aggregating findings lets you quantify how often pilot error, maintenance failures, or weather conditions appear as causes — and how those proportions have shifted over four decades.
# Top finding categories
cause_counts = (
findings.groupby("finding_code")["ev_id"]
.count()
.sort_values(ascending=False)
.head(20)
.reset_index()
.rename(columns={"ev_id": "occurrences"})
)
print(cause_counts)
# Merge with narratives for a specific finding
pilot_error = findings[findings["finding_code"].str.startswith("22")] # human performance
print(f"Human performance findings: {len(pilot_error):,}")The free sample includes 1,000 rows from the events table. The complete dataset ships all six tables (events, aircraft, narratives, findings, occurrences, engines) as CSV and Parquet.