Vehicle Safety Profile Dataset

Name: Vehicle Safety Profile Dataset — Complaints + Recalls + Fatal Crashes
Creator: ClarityStorm
License: https://www.claritystorm.com/license

NHTSA Complaints + NHTSA Recalls + FARS Fatal Crashes — Joined by Make / Model / Year

A researcher studying Toyota Camry safety currently has to download 3 separate government datasets, figure out join keys, and reconcile schema mismatches. We did it once. This dataset joins 2.2M NHTSA complaints, 176K recall campaigns, and 2.7M FARS vehicle records into a single 33K+ row profile table — one row per make/model/year with complaint counts, recall flags, do-not-drive alerts, and fatal crash counts. The only pre-built vehicle safety dataset of its kind.

Sample: Public DomainPaid: Commercial LicenseCSV + ParquetAnnual Updates1985–20253 Sources Cross-LinkedDo-Not-Drive Flag

Why not build this yourself from NHTSA + FARS?

Technically possible — here's what it takes:

✓ 3 separate datasets downloaded and parsed — NHTSA complaints (2.2M rows, 53 columns), NHTSA recalls (176K campaigns), FARS vehicles (2.7M rows, 278 columns) each require separate pipelines
✓ Make/model name normalisation across 3 different schemas — complaints use all-caps, recalls use mixed case, FARS uses numeric codes with text decodes via VIN lookup; we normalise all three to a consistent uppercase key
✓ FARS model year format cleaned — FARS stores model years as 2-digit codes (73 = 1973), 4-digit years, and special codes (9999 = unknown); we handle all cases
✓ Outer join across all three — vehicles appear in some sources but not others; we preserve full coverage with NULL-filled counts so you know when data is absent vs. zero
✓ Complaint trend computed per vehicle — increasing/decreasing/stable label based on annual complaint counts over time
✓ Parquet output — the raw sources are CSV; columnar Parquet is 5–10× faster for queries across 33K vehicle-year rows

⏱ Skip 3 dataset pipelines, schema reconciliation, multi-key join logic, and trend computation. Profile ready in seconds.

33K+

Vehicle-Year Profiles

2.2M

NHTSA Complaints Joined

176K

Recall Campaigns Joined

2.7M

FARS Fatalities Joined

Use Cases

Used Car Risk Scoring

Score any used vehicle by make/model/year before purchase. Know instantly: how many complaints, how many recalls, whether there was a do-not-drive advisory, and how many fatal crashes involved that specific vehicle.

Automotive Litigation Support

Expert witnesses and plaintiffs' attorneys need pre-built complaint, recall, and fatality histories by vehicle. This dataset replaces weeks of manual NHTSA/FARS lookups with a single Parquet join.

Insurance Underwriting

Incorporate vehicle safety profile scores into auto insurance pricing models. Vehicles with high complaint counts, safety recalls, and fatality rates represent elevated claims risk beyond what standard actuarial tables capture.

Fleet Safety Management

Fleet managers maintaining hundreds of vehicles need to know which models have open safety recalls, do-not-drive flags, or escalating complaint trends. This dataset enables automated fleet-level safety monitoring.

Safety Research & Journalism

Identify which vehicles have the highest complaint-to-recall ratios (NHTSA may be missing patterns), which models have declining fatality rates over model years, or which components drive the most field failures.

Automotive ML Features

Use complaint counts, recall history, and fatality rates as features for vehicle value prediction, insurance pricing, or fraud detection models. Pre-joined, normalized, and ready for feature engineering.

Schema

Two tables delivered as CSV + Parquet. Join on (make, model, model_year).

Profile Table — `vehicle_safety_profile`

One row per (make, model, year). 33K+ rows. Primary lookup table.

Field	Type	Description
make	string	Vehicle manufacturer (normalized uppercase, e.g. FORD, TOYOTA)
model	string	Vehicle model (normalized uppercase, e.g. CAMRY, F-150)
model_year	int	Model year of the vehicle (e.g. 2019)
complaint_count	int	Total NHTSA consumer complaints filed for this vehicle
complaints_with_crash	int	Complaints where a crash was reported
complaints_with_fire	int	Complaints where a fire was reported
complaints_with_injury	int	Complaints reporting at least one injury
avg_miles_at_failure	float	Average vehicle mileage when failure was reported
top_complaint_component	string	Most commonly cited component in complaints (e.g. ENGINE, STEERING, AIRBAGS)
complaint_trend	string	Complaint trend over time: increasing, decreasing, or stable
first_complaint_date	string	Date of earliest complaint (YYYY-MM-DD)
recall_count	int	Number of distinct recall campaigns issued for this vehicle
total_affected_units	int	Total units affected across all recall campaigns
safety_recall_count	int	Number of recalls classified as safety-critical (vehicle, equipment, or tire recalls)
has_safety_recall	int	1 if any safety-critical recall was issued, 0 otherwise
do_not_drive_flag	int	1 if any recall issued a do-not-drive advisory, 0 otherwise
first_recall_date	string	Date of earliest recall (YYYY-MM-DD)
fatality_count	int	Total FARS-reported fatalities in crashes involving this vehicle make/model/year
fatal_crash_count	int	Number of distinct fatal crash cases involving this vehicle

Timeline Table — `vehicle_safety_timeline`

Time-series per (make, model, year, quarter). Track complaint escalation and recall history over time.

Field	Type	Description
make	string	Vehicle manufacturer (normalized)
model	string	Vehicle model (normalized)
model_year	int	Model year
quarter	string	Quarter when events were reported (e.g. 2020Q3)
complaints_this_quarter	int	Complaints added to NHTSA database this quarter
recalls_this_quarter	int	Recall campaigns issued this quarter
fatalities_this_period	int	Fatalities recorded in FARS for this period

Quick Start

import pandas as pd

profile = pd.read_parquet("vehicle_safety_profile.parquet")

# Look up a specific vehicle
camry_2019 = profile[
    (profile["make"] == "TOYOTA") &
    (profile["model"] == "CAMRY") &
    (profile["model_year"] == 2019)
]
print(camry_2019[[
    "complaint_count", "recall_count", "fatality_count",
    "has_safety_recall", "do_not_drive_flag",
    "top_complaint_component", "complaint_trend"
]])

# Most dangerous vehicles by combined score
profile["safety_score"] = (
    profile["complaint_count"].rank(pct=True) * 0.4 +
    profile["fatality_count"].rank(pct=True) * 0.4 +
    profile["has_safety_recall"] * 0.2
)
top_risk = profile.nlargest(20, "safety_score")[[
    "make", "model", "model_year",
    "complaint_count", "recall_count", "fatality_count",
    "has_safety_recall", "do_not_drive_flag"
]]
print(top_risk)

# Which component categories drive the most complaints?
print(profile.groupby("top_complaint_component")["complaint_count"].sum()
      .sort_values(ascending=False).head(15))

# Vehicles with escalating complaint trends
escalating = profile[profile["complaint_trend"] == "increasing"]
print(f"Vehicles with increasing complaints: {len(escalating):,}")

# Average fatalities by make (all years)
make_risk = (
    profile.groupby("make")["fatality_count"].sum()
    .sort_values(ascending=False)
    .head(15)
)
print(make_risk)

# Year-over-year safety improvement for a specific model
fusion = profile[profile["model"] == "FUSION"].sort_values("model_year")
print(fusion[["model_year", "complaint_count", "recall_count", "fatality_count"]])

Pairs Well With

ClarityStorm: NHTSA Complaints (raw)

Need the full complaint text and granular detail? Use the raw complaints dataset for NLP on complaint narratives, component-level drill-downs, or queries not answerable from the profile aggregates.

External: Kelley Blue Book / Black Book Pricing

Join vehicle safety profiles against used car pricing data to study whether safety issues (high complaint rates, major recalls, fatality exposure) suppress resale values — or whether the market underprices safety risk in certain segments.

Pricing

Sample

Free

Top 1,000 most-complained-about vehicles (CSV) + schema docs

Public Domain

Download Sample

Complete

$149

Full dataset — 33K+ vehicle-year profiles, complaint timeline, CSV + Parquet

Commercial License

Buy Complete

Annual

$249/yr

Full dataset + annual updates as NHTSA and FARS release new data

Commercial License

Data Provenance

Source datasets: (1) NHTSA Complaints — National Highway Traffic Safety Administration consumer complaint database; (2) NHTSA Recalls — NHTSA recall campaign database; (3) FARS — NHTSA Fatality Analysis Reporting System vehicle file

Join methodology: Three datasets are joined on normalized (make, model, model_year) keys. Make and model names are uppercased and common aliases resolved. FARS uses VIN-decoded make/model names (vpicmakename /vpicmodelname) when available, with fallback to coded field text. An outer join preserves all vehicle-year combinations appearing in any source.

Coverage: Model years 1985–2025 (configurable). NHTSA complaints begin 1983; recalls begin 1949; FARS begins 1975. Intersection coverage is strongest for model years 2000–present.

Computed fields: complaint_trend (increasing/decreasing/stable based on annual complaint filing rate), has_safety_recall (derived from recall type codes V/E/X), and all count aggregates are computed by ClarityStorm from raw source data.

License: All source data is published by the US federal government and is in the public domain. Paid tiers are licensed under the ClarityStorm Commercial Data License covering our cross-source join, normalisation, trend computation, and enrichment work.

Need custom joins, additional fields, or a specific vehicle subset?

Contact Sales