← All Datasets

Vehicle Safety Profile Dataset

NHTSA Complaints + NHTSA Recalls + FARS Fatal Crashes — Joined by Make / Model / Year

A researcher studying Toyota Camry safety currently has to download 3 separate government datasets, figure out join keys, and reconcile schema mismatches. We did it once. This dataset joins 2.2M NHTSA complaints, 176K recall campaigns, and 2.7M FARS vehicle records into a single 33K+ row profile table — one row per make/model/year with complaint counts, recall flags, do-not-drive alerts, and fatal crash counts. The only pre-built vehicle safety dataset of its kind.

Sample: Public DomainPaid: Commercial LicenseCSV + ParquetAnnual Updates1985–20253 Sources Cross-LinkedDo-Not-Drive Flag

Why not build this yourself from NHTSA + FARS?

Technically possible — here's what it takes:

  • 3 separate datasets downloaded and parsed — NHTSA complaints (2.2M rows, 53 columns), NHTSA recalls (176K campaigns), FARS vehicles (2.7M rows, 278 columns) each require separate pipelines
  • Make/model name normalisation across 3 different schemas — complaints use all-caps, recalls use mixed case, FARS uses numeric codes with text decodes via VIN lookup; we normalise all three to a consistent uppercase key
  • FARS model year format cleaned — FARS stores model years as 2-digit codes (73 = 1973), 4-digit years, and special codes (9999 = unknown); we handle all cases
  • Outer join across all three — vehicles appear in some sources but not others; we preserve full coverage with NULL-filled counts so you know when data is absent vs. zero
  • Complaint trend computed per vehicle — increasing/decreasing/stable label based on annual complaint counts over time
  • Parquet output — the raw sources are CSV; columnar Parquet is 5–10× faster for queries across 33K vehicle-year rows

⏱ Skip 3 dataset pipelines, schema reconciliation, multi-key join logic, and trend computation. Profile ready in seconds.

33K+

Vehicle-Year Profiles

2.2M

NHTSA Complaints Joined

176K

Recall Campaigns Joined

2.7M

FARS Fatalities Joined

Use Cases

Used Car Risk Scoring

Score any used vehicle by make/model/year before purchase. Know instantly: how many complaints, how many recalls, whether there was a do-not-drive advisory, and how many fatal crashes involved that specific vehicle.

Automotive Litigation Support

Expert witnesses and plaintiffs' attorneys need pre-built complaint, recall, and fatality histories by vehicle. This dataset replaces weeks of manual NHTSA/FARS lookups with a single Parquet join.

Insurance Underwriting

Incorporate vehicle safety profile scores into auto insurance pricing models. Vehicles with high complaint counts, safety recalls, and fatality rates represent elevated claims risk beyond what standard actuarial tables capture.

Fleet Safety Management

Fleet managers maintaining hundreds of vehicles need to know which models have open safety recalls, do-not-drive flags, or escalating complaint trends. This dataset enables automated fleet-level safety monitoring.

Safety Research & Journalism

Identify which vehicles have the highest complaint-to-recall ratios (NHTSA may be missing patterns), which models have declining fatality rates over model years, or which components drive the most field failures.

Automotive ML Features

Use complaint counts, recall history, and fatality rates as features for vehicle value prediction, insurance pricing, or fraud detection models. Pre-joined, normalized, and ready for feature engineering.

Schema

Two tables delivered as CSV + Parquet. Join on (make, model, model_year).

Profile Table — vehicle_safety_profile

One row per (make, model, year). 33K+ rows. Primary lookup table.

FieldTypeDescription
makestringVehicle manufacturer (normalized uppercase, e.g. FORD, TOYOTA)
modelstringVehicle model (normalized uppercase, e.g. CAMRY, F-150)
model_yearintModel year of the vehicle (e.g. 2019)
complaint_countintTotal NHTSA consumer complaints filed for this vehicle
complaints_with_crashintComplaints where a crash was reported
complaints_with_fireintComplaints where a fire was reported
complaints_with_injuryintComplaints reporting at least one injury
avg_miles_at_failurefloatAverage vehicle mileage when failure was reported
top_complaint_componentstringMost commonly cited component in complaints (e.g. ENGINE, STEERING, AIRBAGS)
complaint_trendstringComplaint trend over time: increasing, decreasing, or stable
first_complaint_datestringDate of earliest complaint (YYYY-MM-DD)
recall_countintNumber of distinct recall campaigns issued for this vehicle
total_affected_unitsintTotal units affected across all recall campaigns
safety_recall_countintNumber of recalls classified as safety-critical (vehicle, equipment, or tire recalls)
has_safety_recallint1 if any safety-critical recall was issued, 0 otherwise
do_not_drive_flagint1 if any recall issued a do-not-drive advisory, 0 otherwise
first_recall_datestringDate of earliest recall (YYYY-MM-DD)
fatality_countintTotal FARS-reported fatalities in crashes involving this vehicle make/model/year
fatal_crash_countintNumber of distinct fatal crash cases involving this vehicle

Timeline Table — vehicle_safety_timeline

Time-series per (make, model, year, quarter). Track complaint escalation and recall history over time.

FieldTypeDescription
makestringVehicle manufacturer (normalized)
modelstringVehicle model (normalized)
model_yearintModel year
quarterstringQuarter when events were reported (e.g. 2020Q3)
complaints_this_quarterintComplaints added to NHTSA database this quarter
recalls_this_quarterintRecall campaigns issued this quarter
fatalities_this_periodintFatalities recorded in FARS for this period

Quick Start

import pandas as pd

profile = pd.read_parquet("vehicle_safety_profile.parquet")

# Look up a specific vehicle
camry_2019 = profile[
    (profile["make"] == "TOYOTA") &
    (profile["model"] == "CAMRY") &
    (profile["model_year"] == 2019)
]
print(camry_2019[[
    "complaint_count", "recall_count", "fatality_count",
    "has_safety_recall", "do_not_drive_flag",
    "top_complaint_component", "complaint_trend"
]])

# Most dangerous vehicles by combined score
profile["safety_score"] = (
    profile["complaint_count"].rank(pct=True) * 0.4 +
    profile["fatality_count"].rank(pct=True) * 0.4 +
    profile["has_safety_recall"] * 0.2
)
top_risk = profile.nlargest(20, "safety_score")[[
    "make", "model", "model_year",
    "complaint_count", "recall_count", "fatality_count",
    "has_safety_recall", "do_not_drive_flag"
]]
print(top_risk)

# Which component categories drive the most complaints?
print(profile.groupby("top_complaint_component")["complaint_count"].sum()
      .sort_values(ascending=False).head(15))

# Vehicles with escalating complaint trends
escalating = profile[profile["complaint_trend"] == "increasing"]
print(f"Vehicles with increasing complaints: {len(escalating):,}")

# Average fatalities by make (all years)
make_risk = (
    profile.groupby("make")["fatality_count"].sum()
    .sort_values(ascending=False)
    .head(15)
)
print(make_risk)

# Year-over-year safety improvement for a specific model
fusion = profile[profile["model"] == "FUSION"].sort_values("model_year")
print(fusion[["model_year", "complaint_count", "recall_count", "fatality_count"]])

Pairs Well With

Need the full complaint text and granular detail? Use the raw complaints dataset for NLP on complaint narratives, component-level drill-downs, or queries not answerable from the profile aggregates.

External: Kelley Blue Book / Black Book Pricing

Join vehicle safety profiles against used car pricing data to study whether safety issues (high complaint rates, major recalls, fatality exposure) suppress resale values — or whether the market underprices safety risk in certain segments.

Pricing

Sample

Free

Top 1,000 most-complained-about vehicles (CSV) + schema docs

Public Domain

Download Sample
Complete

$149

Full dataset — 33K+ vehicle-year profiles, complaint timeline, CSV + Parquet

Commercial License

Buy Complete
Annual

$249/yr

Full dataset + annual updates as NHTSA and FARS release new data

Commercial License

Subscribe

Data Provenance

Source datasets: (1) NHTSA Complaints — National Highway Traffic Safety Administration consumer complaint database; (2) NHTSA Recalls — NHTSA recall campaign database; (3) FARS — NHTSA Fatality Analysis Reporting System vehicle file

Join methodology: Three datasets are joined on normalized (make, model, model_year) keys. Make and model names are uppercased and common aliases resolved. FARS uses VIN-decoded make/model names (vpicmakename /vpicmodelname) when available, with fallback to coded field text. An outer join preserves all vehicle-year combinations appearing in any source.

Coverage: Model years 1985–2025 (configurable). NHTSA complaints begin 1983; recalls begin 1949; FARS begins 1975. Intersection coverage is strongest for model years 2000–present.

Computed fields: complaint_trend (increasing/decreasing/stable based on annual complaint filing rate), has_safety_recall (derived from recall type codes V/E/X), and all count aggregates are computed by ClarityStorm from raw source data.

License: All source data is published by the US federal government and is in the public domain. Paid tiers are licensed under the ClarityStorm Commercial Data License covering our cross-source join, normalisation, trend computation, and enrichment work.

Need custom joins, additional fields, or a specific vehicle subset?

Contact Sales