Building a Vehicle Defect Detection Model with NHTSA Complaints

When a driver notices their car shaking at highway speed, smoke coming from the dashboard, or brakes that feel soft, they can file a complaint with the NHTSA Office of Defects Investigation. Since 1995, drivers have filed over 2.2 million such complaints — creating an extraordinary corpus of real-world vehicle failure reports. Each record includes the vehicle make, model, model year, component system, mileage at failure, crash/fire/injury flags, and a free-text narrative describing exactly what happened.

In this tutorial we'll use the ClarityStorm NHTSA complaints dataset to build a defect clustering pipeline, profile high-risk vehicle-component combinations, and train a simple recall-likelihood classifier.

Dataset Overview

The ClarityStorm release includes all 2.2M complaints from 1995 through the current NHTSA data feed, with parsed component hierarchies (system → component → part) and derived fields like fail_year. Both CSV and Parquet formats are included.

2.2M complaint records spanning 30+ years
100+ vehicle makes and thousands of models
Component hierarchy: system / component / part (500+ unique parts)
Crash, fire, injury, and death flags for severity scoring
Free-text complaint_desc field — rich NLP corpus

Loading and Profiling

python

import pandas as pd

df = pd.read_parquet("nhtsa_complaints.parquet")

print(f"Total complaints: {len(df):,}")
print(f"Date range: {df['fail_year'].min()} – {df['fail_year'].max()}")

# Top makes by complaint volume
top_makes = (
    df.groupby("make")
    .agg(
        complaints=("cmplid", "count"),
        crash_pct=("crash", "mean"),
        injury_avg=("injured", "mean"),
    )
    .sort_values("complaints", ascending=False)
    .head(15)
)
print(top_makes)

NLP: Clustering Complaint Narratives

The complaint_desc field is where the real signal lives. A TF-IDF vectorization followed by K-Means clustering surfaces natural complaint theme groups — steering failures, transmission slipping, airbag non-deployment, fire risk — without needing labeled training data.

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import MiniBatchKMeans
import numpy as np

# Sample for quick prototyping; use full dataset for production
sample = df[df["complaint_desc"].notna()].sample(50_000, random_state=42)

vectorizer = TfidfVectorizer(
    max_features=10_000,
    ngram_range=(1, 2),
    stop_words="english",
    min_df=5,
)
X = vectorizer.fit_transform(sample["complaint_desc"])

# Mini-batch K-Means scales to the full 2.2M corpus
kmeans = MiniBatchKMeans(n_clusters=20, random_state=42, batch_size=5000)
kmeans.fit(X)
sample = sample.copy()
sample["cluster"] = kmeans.labels_

# Top terms per cluster
terms = vectorizer.get_feature_names_out()
for i, centroid in enumerate(kmeans.cluster_centers_):
    top_idx = np.argsort(centroid)[-8:][::-1]
    top_terms = ", ".join(terms[top_idx])
    print(f"Cluster {i:>2}: {top_terms}")

Recall Likelihood Classifier

Not every complaint cluster becomes a recall — but complaints that involve crashes, fires, or injuries at high rates, or that concentrate on specific component failures, are strong recall precursors. A simple gradient-boosted classifier trained on component system, crash/fire flags, complaint frequency, and model-year age can predict recall likelihood with reasonable precision.

python

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Feature engineering: aggregate complaints to (make, model, model_year, comp_system)
agg = (
    df.groupby(["make", "model", "model_year", "comp_system"])
    .agg(
        complaint_count=("cmplid", "count"),
        crash_rate=("crash", "mean"),
        fire_rate=("fire", "mean"),
        injury_rate=("injured", lambda x: (x > 0).mean()),
        death_rate=("deaths", lambda x: (x > 0).mean()),
    )
    .reset_index()
)

# Label: did this combo accumulate 50+ complaints? (proxy for investigation threshold)
agg["high_volume"] = (agg["complaint_count"] >= 50).astype(int)

FEATURES = ["complaint_count", "crash_rate", "fire_rate", "injury_rate", "death_rate"]
X = agg[FEATURES].fillna(0)
y = agg["high_volume"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

print(classification_report(y_test, clf.predict(X_test)))

Insurance and Risk Scoring Applications

Beyond defect detection, the NHTSA complaints dataset is a powerful signal for insurance actuaries. By computing complaint volume, crash-involvement rate, and injury rate by make/model/year, you can build a vehicle risk index that supplements traditional actuarial tables. High complaint rates on specific components (fuel systems, steering) correlate with elevated claim frequencies — and complaints often surface 12–24 months before formal NHTSA investigations open.

The free sample contains 1,000 rows. The complete dataset covers 2.2M complaints with all fields including full complaint narratives — available as CSV and Parquet.

Dataset Overview

Loading and Profiling

NLP: Clustering Complaint Narratives

Recall Likelihood Classifier

Insurance and Risk Scoring Applications

Get the Full Dataset