When a driver notices their car shaking at highway speed, smoke coming from the dashboard, or brakes that feel soft, they can file a complaint with the NHTSA Office of Defects Investigation. Since 1995, drivers have filed over 2.2 million such complaints — creating an extraordinary corpus of real-world vehicle failure reports. Each record includes the vehicle make, model, model year, component system, mileage at failure, crash/fire/injury flags, and a free-text narrative describing exactly what happened.
In this tutorial we'll use the ClarityStorm NHTSA complaints dataset to build a defect clustering pipeline, profile high-risk vehicle-component combinations, and train a simple recall-likelihood classifier.
Dataset Overview
The ClarityStorm release includes all 2.2M complaints from 1995 through the current NHTSA data feed, with parsed component hierarchies (system → component → part) and derived fields like fail_year. Both CSV and Parquet formats are included.
- 2.2M complaint records spanning 30+ years
- 100+ vehicle makes and thousands of models
- Component hierarchy: system / component / part (500+ unique parts)
- Crash, fire, injury, and death flags for severity scoring
- Free-text complaint_desc field — rich NLP corpus
Loading and Profiling
import pandas as pd
df = pd.read_parquet("nhtsa_complaints.parquet")
print(f"Total complaints: {len(df):,}")
print(f"Date range: {df['fail_year'].min()} – {df['fail_year'].max()}")
# Top makes by complaint volume
top_makes = (
df.groupby("make")
.agg(
complaints=("cmplid", "count"),
crash_pct=("crash", "mean"),
injury_avg=("injured", "mean"),
)
.sort_values("complaints", ascending=False)
.head(15)
)
print(top_makes)NLP: Clustering Complaint Narratives
The complaint_desc field is where the real signal lives. A TF-IDF vectorization followed by K-Means clustering surfaces natural complaint theme groups — steering failures, transmission slipping, airbag non-deployment, fire risk — without needing labeled training data.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import MiniBatchKMeans
import numpy as np
# Sample for quick prototyping; use full dataset for production
sample = df[df["complaint_desc"].notna()].sample(50_000, random_state=42)
vectorizer = TfidfVectorizer(
max_features=10_000,
ngram_range=(1, 2),
stop_words="english",
min_df=5,
)
X = vectorizer.fit_transform(sample["complaint_desc"])
# Mini-batch K-Means scales to the full 2.2M corpus
kmeans = MiniBatchKMeans(n_clusters=20, random_state=42, batch_size=5000)
kmeans.fit(X)
sample = sample.copy()
sample["cluster"] = kmeans.labels_
# Top terms per cluster
terms = vectorizer.get_feature_names_out()
for i, centroid in enumerate(kmeans.cluster_centers_):
top_idx = np.argsort(centroid)[-8:][::-1]
top_terms = ", ".join(terms[top_idx])
print(f"Cluster {i:>2}: {top_terms}")Recall Likelihood Classifier
Not every complaint cluster becomes a recall — but complaints that involve crashes, fires, or injuries at high rates, or that concentrate on specific component failures, are strong recall precursors. A simple gradient-boosted classifier trained on component system, crash/fire flags, complaint frequency, and model-year age can predict recall likelihood with reasonable precision.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Feature engineering: aggregate complaints to (make, model, model_year, comp_system)
agg = (
df.groupby(["make", "model", "model_year", "comp_system"])
.agg(
complaint_count=("cmplid", "count"),
crash_rate=("crash", "mean"),
fire_rate=("fire", "mean"),
injury_rate=("injured", lambda x: (x > 0).mean()),
death_rate=("deaths", lambda x: (x > 0).mean()),
)
.reset_index()
)
# Label: did this combo accumulate 50+ complaints? (proxy for investigation threshold)
agg["high_volume"] = (agg["complaint_count"] >= 50).astype(int)
FEATURES = ["complaint_count", "crash_rate", "fire_rate", "injury_rate", "death_rate"]
X = agg[FEATURES].fillna(0)
y = agg["high_volume"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))Insurance and Risk Scoring Applications
Beyond defect detection, the NHTSA complaints dataset is a powerful signal for insurance actuaries. By computing complaint volume, crash-involvement rate, and injury rate by make/model/year, you can build a vehicle risk index that supplements traditional actuarial tables. High complaint rates on specific components (fuel systems, steering) correlate with elevated claim frequencies — and complaints often surface 12–24 months before formal NHTSA investigations open.
The free sample contains 1,000 rows. The complete dataset covers 2.2M complaints with all fields including full complaint narratives — available as CSV and Parquet.