·9 min read

Building a Vehicle Defect Detection Model with NHTSA Complaints

Use 2.2M NHTSA vehicle complaint narratives to build an NLP defect detection model in Python. Cluster complaint text, predict recall likelihood, and profile high-risk makes.

NHTSANLPvehicle safetyscikit-learnPythontutorial

When a driver notices their car shaking at highway speed, smoke coming from the dashboard, or brakes that feel soft, they can file a complaint with the NHTSA Office of Defects Investigation. Since 1995, drivers have filed over 2.2 million such complaints — creating an extraordinary corpus of real-world vehicle failure reports. Each record includes the vehicle make, model, model year, component system, mileage at failure, crash/fire/injury flags, and a free-text narrative describing exactly what happened.

In this tutorial we'll use the ClarityStorm NHTSA complaints dataset to build a defect clustering pipeline, profile high-risk vehicle-component combinations, and train a simple recall-likelihood classifier.

Dataset Overview

The ClarityStorm release includes all 2.2M complaints from 1995 through the current NHTSA data feed, with parsed component hierarchies (system → component → part) and derived fields like fail_year. Both CSV and Parquet formats are included.

  • 2.2M complaint records spanning 30+ years
  • 100+ vehicle makes and thousands of models
  • Component hierarchy: system / component / part (500+ unique parts)
  • Crash, fire, injury, and death flags for severity scoring
  • Free-text complaint_desc field — rich NLP corpus

Loading and Profiling

python
import pandas as pd

df = pd.read_parquet("nhtsa_complaints.parquet")

print(f"Total complaints: {len(df):,}")
print(f"Date range: {df['fail_year'].min()} – {df['fail_year'].max()}")

# Top makes by complaint volume
top_makes = (
    df.groupby("make")
    .agg(
        complaints=("cmplid", "count"),
        crash_pct=("crash", "mean"),
        injury_avg=("injured", "mean"),
    )
    .sort_values("complaints", ascending=False)
    .head(15)
)
print(top_makes)

NLP: Clustering Complaint Narratives

The complaint_desc field is where the real signal lives. A TF-IDF vectorization followed by K-Means clustering surfaces natural complaint theme groups — steering failures, transmission slipping, airbag non-deployment, fire risk — without needing labeled training data.

python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import MiniBatchKMeans
import numpy as np

# Sample for quick prototyping; use full dataset for production
sample = df[df["complaint_desc"].notna()].sample(50_000, random_state=42)

vectorizer = TfidfVectorizer(
    max_features=10_000,
    ngram_range=(1, 2),
    stop_words="english",
    min_df=5,
)
X = vectorizer.fit_transform(sample["complaint_desc"])

# Mini-batch K-Means scales to the full 2.2M corpus
kmeans = MiniBatchKMeans(n_clusters=20, random_state=42, batch_size=5000)
kmeans.fit(X)
sample = sample.copy()
sample["cluster"] = kmeans.labels_

# Top terms per cluster
terms = vectorizer.get_feature_names_out()
for i, centroid in enumerate(kmeans.cluster_centers_):
    top_idx = np.argsort(centroid)[-8:][::-1]
    top_terms = ", ".join(terms[top_idx])
    print(f"Cluster {i:>2}: {top_terms}")

Recall Likelihood Classifier

Not every complaint cluster becomes a recall — but complaints that involve crashes, fires, or injuries at high rates, or that concentrate on specific component failures, are strong recall precursors. A simple gradient-boosted classifier trained on component system, crash/fire flags, complaint frequency, and model-year age can predict recall likelihood with reasonable precision.

python
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Feature engineering: aggregate complaints to (make, model, model_year, comp_system)
agg = (
    df.groupby(["make", "model", "model_year", "comp_system"])
    .agg(
        complaint_count=("cmplid", "count"),
        crash_rate=("crash", "mean"),
        fire_rate=("fire", "mean"),
        injury_rate=("injured", lambda x: (x > 0).mean()),
        death_rate=("deaths", lambda x: (x > 0).mean()),
    )
    .reset_index()
)

# Label: did this combo accumulate 50+ complaints? (proxy for investigation threshold)
agg["high_volume"] = (agg["complaint_count"] >= 50).astype(int)

FEATURES = ["complaint_count", "crash_rate", "fire_rate", "injury_rate", "death_rate"]
X = agg[FEATURES].fillna(0)
y = agg["high_volume"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

print(classification_report(y_test, clf.predict(X_test)))

Insurance and Risk Scoring Applications

Beyond defect detection, the NHTSA complaints dataset is a powerful signal for insurance actuaries. By computing complaint volume, crash-involvement rate, and injury rate by make/model/year, you can build a vehicle risk index that supplements traditional actuarial tables. High complaint rates on specific components (fuel systems, steering) correlate with elevated claim frequencies — and complaints often surface 12–24 months before formal NHTSA investigations open.

The free sample contains 1,000 rows. The complete dataset covers 2.2M complaints with all fields including full complaint narratives — available as CSV and Parquet.

Get the Full Dataset

NHTSA Vehicle Complaints 1995–Present

From $79