Vehicle Safety Profile Dataset
NHTSA Complaints + NHTSA Recalls + FARS Fatal Crashes — Joined by Make / Model / Year
A researcher studying Toyota Camry safety currently has to download 3 separate government datasets, figure out join keys, and reconcile schema mismatches. We did it once. This dataset joins 2.2M NHTSA complaints, 176K recall campaigns, and 2.7M FARS vehicle records into a single 33K+ row profile table — one row per make/model/year with complaint counts, recall flags, do-not-drive alerts, and fatal crash counts. The only pre-built vehicle safety dataset of its kind.
Why not build this yourself from NHTSA + FARS?
Technically possible — here's what it takes:
- ✓ 3 separate datasets downloaded and parsed — NHTSA complaints (2.2M rows, 53 columns), NHTSA recalls (176K campaigns), FARS vehicles (2.7M rows, 278 columns) each require separate pipelines
- ✓ Make/model name normalisation across 3 different schemas — complaints use all-caps, recalls use mixed case, FARS uses numeric codes with text decodes via VIN lookup; we normalise all three to a consistent uppercase key
- ✓ FARS model year format cleaned — FARS stores model years as 2-digit codes (73 = 1973), 4-digit years, and special codes (9999 = unknown); we handle all cases
- ✓ Outer join across all three — vehicles appear in some sources but not others; we preserve full coverage with NULL-filled counts so you know when data is absent vs. zero
- ✓ Complaint trend computed per vehicle — increasing/decreasing/stable label based on annual complaint counts over time
- ✓ Parquet output — the raw sources are CSV; columnar Parquet is 5–10× faster for queries across 33K vehicle-year rows
⏱ Skip 3 dataset pipelines, schema reconciliation, multi-key join logic, and trend computation. Profile ready in seconds.
33K+
Vehicle-Year Profiles
2.2M
NHTSA Complaints Joined
176K
Recall Campaigns Joined
2.7M
FARS Fatalities Joined
Use Cases
Score any used vehicle by make/model/year before purchase. Know instantly: how many complaints, how many recalls, whether there was a do-not-drive advisory, and how many fatal crashes involved that specific vehicle.
Expert witnesses and plaintiffs' attorneys need pre-built complaint, recall, and fatality histories by vehicle. This dataset replaces weeks of manual NHTSA/FARS lookups with a single Parquet join.
Incorporate vehicle safety profile scores into auto insurance pricing models. Vehicles with high complaint counts, safety recalls, and fatality rates represent elevated claims risk beyond what standard actuarial tables capture.
Fleet managers maintaining hundreds of vehicles need to know which models have open safety recalls, do-not-drive flags, or escalating complaint trends. This dataset enables automated fleet-level safety monitoring.
Identify which vehicles have the highest complaint-to-recall ratios (NHTSA may be missing patterns), which models have declining fatality rates over model years, or which components drive the most field failures.
Use complaint counts, recall history, and fatality rates as features for vehicle value prediction, insurance pricing, or fraud detection models. Pre-joined, normalized, and ready for feature engineering.
Schema
Two tables delivered as CSV + Parquet. Join on (make, model, model_year).
Profile Table — vehicle_safety_profile
One row per (make, model, year). 33K+ rows. Primary lookup table.
| Field | Type | Description |
|---|---|---|
| make | string | Vehicle manufacturer (normalized uppercase, e.g. FORD, TOYOTA) |
| model | string | Vehicle model (normalized uppercase, e.g. CAMRY, F-150) |
| model_year | int | Model year of the vehicle (e.g. 2019) |
| complaint_count | int | Total NHTSA consumer complaints filed for this vehicle |
| complaints_with_crash | int | Complaints where a crash was reported |
| complaints_with_fire | int | Complaints where a fire was reported |
| complaints_with_injury | int | Complaints reporting at least one injury |
| avg_miles_at_failure | float | Average vehicle mileage when failure was reported |
| top_complaint_component | string | Most commonly cited component in complaints (e.g. ENGINE, STEERING, AIRBAGS) |
| complaint_trend | string | Complaint trend over time: increasing, decreasing, or stable |
| first_complaint_date | string | Date of earliest complaint (YYYY-MM-DD) |
| recall_count | int | Number of distinct recall campaigns issued for this vehicle |
| total_affected_units | int | Total units affected across all recall campaigns |
| safety_recall_count | int | Number of recalls classified as safety-critical (vehicle, equipment, or tire recalls) |
| has_safety_recall | int | 1 if any safety-critical recall was issued, 0 otherwise |
| do_not_drive_flag | int | 1 if any recall issued a do-not-drive advisory, 0 otherwise |
| first_recall_date | string | Date of earliest recall (YYYY-MM-DD) |
| fatality_count | int | Total FARS-reported fatalities in crashes involving this vehicle make/model/year |
| fatal_crash_count | int | Number of distinct fatal crash cases involving this vehicle |
Timeline Table — vehicle_safety_timeline
Time-series per (make, model, year, quarter). Track complaint escalation and recall history over time.
| Field | Type | Description |
|---|---|---|
| make | string | Vehicle manufacturer (normalized) |
| model | string | Vehicle model (normalized) |
| model_year | int | Model year |
| quarter | string | Quarter when events were reported (e.g. 2020Q3) |
| complaints_this_quarter | int | Complaints added to NHTSA database this quarter |
| recalls_this_quarter | int | Recall campaigns issued this quarter |
| fatalities_this_period | int | Fatalities recorded in FARS for this period |
Quick Start
import pandas as pd
profile = pd.read_parquet("vehicle_safety_profile.parquet")
# Look up a specific vehicle
camry_2019 = profile[
(profile["make"] == "TOYOTA") &
(profile["model"] == "CAMRY") &
(profile["model_year"] == 2019)
]
print(camry_2019[[
"complaint_count", "recall_count", "fatality_count",
"has_safety_recall", "do_not_drive_flag",
"top_complaint_component", "complaint_trend"
]])
# Most dangerous vehicles by combined score
profile["safety_score"] = (
profile["complaint_count"].rank(pct=True) * 0.4 +
profile["fatality_count"].rank(pct=True) * 0.4 +
profile["has_safety_recall"] * 0.2
)
top_risk = profile.nlargest(20, "safety_score")[[
"make", "model", "model_year",
"complaint_count", "recall_count", "fatality_count",
"has_safety_recall", "do_not_drive_flag"
]]
print(top_risk)
# Which component categories drive the most complaints?
print(profile.groupby("top_complaint_component")["complaint_count"].sum()
.sort_values(ascending=False).head(15))
# Vehicles with escalating complaint trends
escalating = profile[profile["complaint_trend"] == "increasing"]
print(f"Vehicles with increasing complaints: {len(escalating):,}")
# Average fatalities by make (all years)
make_risk = (
profile.groupby("make")["fatality_count"].sum()
.sort_values(ascending=False)
.head(15)
)
print(make_risk)
# Year-over-year safety improvement for a specific model
fusion = profile[profile["model"] == "FUSION"].sort_values("model_year")
print(fusion[["model_year", "complaint_count", "recall_count", "fatality_count"]])Pairs Well With
Need the full complaint text and granular detail? Use the raw complaints dataset for NLP on complaint narratives, component-level drill-downs, or queries not answerable from the profile aggregates.
Join vehicle safety profiles against used car pricing data to study whether safety issues (high complaint rates, major recalls, fatality exposure) suppress resale values — or whether the market underprices safety risk in certain segments.
Pricing
$149
Full dataset — 33K+ vehicle-year profiles, complaint timeline, CSV + Parquet
Commercial License
Buy CompleteData Provenance
Source datasets: (1) NHTSA Complaints — National Highway Traffic Safety Administration consumer complaint database; (2) NHTSA Recalls — NHTSA recall campaign database; (3) FARS — NHTSA Fatality Analysis Reporting System vehicle file
Join methodology: Three datasets are joined on normalized (make, model, model_year) keys. Make and model names are uppercased and common aliases resolved. FARS uses VIN-decoded make/model names (vpicmakename /vpicmodelname) when available, with fallback to coded field text. An outer join preserves all vehicle-year combinations appearing in any source.
Coverage: Model years 1985–2025 (configurable). NHTSA complaints begin 1983; recalls begin 1949; FARS begins 1975. Intersection coverage is strongest for model years 2000–present.
Computed fields: complaint_trend (increasing/decreasing/stable based on annual complaint filing rate), has_safety_recall (derived from recall type codes V/E/X), and all count aggregates are computed by ClarityStorm from raw source data.
License: All source data is published by the US federal government and is in the public domain. Paid tiers are licensed under the ClarityStorm Commercial Data License covering our cross-source join, normalisation, trend computation, and enrichment work.
Need custom joins, additional fields, or a specific vehicle subset?
Contact Sales