New Jersey has one of the highest traffic densities of any US state — and one of the best-maintained public crash databases to go along with it. The New Jersey Department of Transportation (NJDOT) Crash Records System tracks every reportable motor vehicle crash on NJ roads dating back to 2001: over 4 million crash events with details on crash date and time, road type, municipality, contributing factors, weather conditions, and the severity of every injury involved.
In this tutorial we'll load the ClarityStorm NJ Crash Records dataset, analyze injury trends over time, map crash hotspots by county and road type, and build a simple feature matrix for injury-severity prediction.
What's in the Dataset
The ClarityStorm NJ Crash release consolidates annual NJDOT files into a single cleaned table spanning 2001 to the present. Each row represents one crash event, with columns covering the crash itself (date, time, weather, light conditions, road surface), the location (county, municipality, road name, route prefix), and the outcomes (crash severity, number injured, number killed, property damage flag).
- 4M+ crash records from 2001 to the present
- 21 NJ counties with municipality-level granularity
- Contributing factors: distracted driving, DUI, speeding, failure to yield, and more
- Weather and road surface conditions for environmental analysis
- Crash severity: fatal, injury (A/B/C), and property-damage-only
- Road classification: interstate, US/state route, county road, local street
Loading the Data
import pandas as pd
df = pd.read_parquet("nj_crash_records.parquet")
print(f"Total crashes: {len(df):,}")
print(f"Date range: {df['crash_date'].min()} – {df['crash_date'].max()}")
print(f"Columns: {list(df.columns)}")
# Quick severity breakdown
print(df['crash_severity'].value_counts())Injury Trend Analysis
Total crash volumes in NJ have fluctuated with traffic patterns — notably dropping sharply during the COVID-19 lockdowns of 2020 — but the long-run trend in injury rates per crash tells a more nuanced story. Mandatory hands-free phone laws (2008), enhanced DUI penalties, and Vision Zero municipal initiatives all leave visible marks in the time series.
import matplotlib.pyplot as plt
df["crash_year"] = pd.to_datetime(df["crash_date"]).dt.year
annual = (
df.groupby("crash_year")
.agg(
total_crashes=("crash_id", "count"),
total_injured=("total_injured", "sum"),
total_killed=("total_killed", "sum"),
)
.reset_index()
)
annual["injury_rate"] = annual["total_injured"] / annual["total_crashes"]
fig, axes = plt.subplots(2, 1, figsize=(12, 8), sharex=True)
axes[0].bar(annual["crash_year"], annual["total_crashes"], color="#0ea5e9", alpha=0.75)
axes[0].set_title("NJ Annual Crash Volume 2001–Present")
axes[0].set_ylabel("Crashes")
axes[1].plot(annual["crash_year"], annual["injury_rate"], color="#f59e0b", linewidth=2, marker="o", ms=4)
axes[1].set_title("Injuries per Crash")
axes[1].set_ylabel("Avg Injuries / Crash")
axes[1].set_xlabel("Year")
plt.tight_layout()
plt.savefig("nj_crash_trends.png", dpi=150)County-Level Hotspot Mapping
NJ's 21 counties vary dramatically in crash frequency — but raw count is misleading because population and vehicle miles traveled differ sharply between urban Essex/Hudson counties and rural Sussex/Warren counties. Normalizing by county area or AADT (annual average daily traffic) reveals which counties genuinely over-perform on safety. The dataset's county field enables this comparison directly.
# Crashes and fatalities by county
county_stats = (
df.groupby("county_name")
.agg(
crashes=("crash_id", "count"),
killed=("total_killed", "sum"),
injured=("total_injured", "sum"),
)
.assign(fatality_rate=lambda d: d["killed"] / d["crashes"] * 1000)
.sort_values("crashes", ascending=False)
)
print(county_stats.to_string())
# Worst fatal-rate counties (min 10k crashes for significance)
worst_fatal = (
county_stats[county_stats["crashes"] >= 10_000]
.nlargest(5, "fatality_rate")
)
print("\nHighest fatality rate counties:")
print(worst_fatal[["crashes", "killed", "fatality_rate"]])Road Type and Time-of-Day Risk
Crash risk in NJ isn't uniform across road classifications. Local streets account for the largest share of total crashes (by sheer volume of intersections), but interstates produce the highest fatality rate per crash due to vehicle speed. The time-of-day breakdown is equally instructive — late-night crashes on Friday and Saturday involve disproportionate DUI rates, while weekday morning crashes skew toward rear-end collisions in rush-hour congestion.
# Parse crash time
df["crash_hour"] = pd.to_datetime(df["crash_time"], format="%H%M", errors="coerce").dt.hour
df["crash_dow"] = pd.to_datetime(df["crash_date"]).dt.day_name()
# Hourly crash volume and severity
hourly = (
df.groupby("crash_hour")
.agg(
total=("crash_id", "count"),
fatal=("total_killed", lambda x: (x > 0).sum()),
)
.assign(fatal_pct=lambda d: d["fatal"] / d["total"] * 100)
.reset_index()
)
import matplotlib.pyplot as plt
fig, ax1 = plt.subplots(figsize=(12, 5))
ax1.bar(hourly["crash_hour"], hourly["total"], color="#0ea5e9", alpha=0.6, label="Total crashes")
ax2 = ax1.twinx()
ax2.plot(hourly["crash_hour"], hourly["fatal_pct"], color="#ef4444", linewidth=2, label="Fatal %")
ax1.set_xlabel("Hour of Day")
ax1.set_ylabel("Total Crashes", color="#0ea5e9")
ax2.set_ylabel("Fatal %", color="#ef4444")
ax1.set_title("NJ Crashes by Hour: Volume vs. Fatality Rate")
fig.legend(loc="upper left", bbox_to_anchor=(0.1, 0.95))
plt.tight_layout()
plt.savefig("nj_hourly_risk.png", dpi=150)Building an Injury-Severity Classifier
The structured fields in the NJ crash dataset — weather, light conditions, road surface, route type, contributing factors, crash time — form a solid feature set for predicting injury severity. Even a baseline logistic regression provides useful signal for insurance actuaries, traffic engineers, and road safety planners estimating crash outcomes from pre-crash conditions.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
# Binary target: injury or worse (vs. property-damage-only)
df["injured_any"] = (df["total_injured"] > 0).astype(int)
FEATURES = [
"road_surface_condition",
"light_condition",
"weather_condition",
"crash_hour",
"route_type_code",
]
# One-hot encode categoricals
X = pd.get_dummies(df[FEATURES].fillna("Unknown"), drop_first=True)
y = df["injured_any"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
clf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
clf.fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))
# Feature importance
importances = pd.Series(clf.feature_importances_, index=X.columns)
print("\nTop 10 features:")
print(importances.nlargest(10))The free sample contains 1,000 rows. The complete NJ Crash Records dataset ships 4M+ crash events spanning 2001 to the present as CSV and Parquet with a commercial license for production use.