Every year, hundreds of thousands of US employers submit their workplace injury and illness counts to OSHA's Injury Tracking Application. The data covers total injuries, days away from work, job transfers, fatalities, and illness types — broken down by establishment, industry, state, and year. It's one of the most underutilised public datasets in the labour and insurance space, and it's been public since 2016.
In this tutorial we'll load the ClarityStorm OSHA Workplace Injury dataset, compute DART and TCIR rates across industries, identify the sectors with the worst safety records, and chart year-over-year improvement trends. All with pandas and matplotlib in under 80 lines.
What's in the Dataset
The ClarityStorm OSHA release combines eight years of ITA annual files (2016–2023) into a single cleaned Parquet, with pre-computed DART and TCIR rate columns added. Each row is one establishment in one survey year. The NAICS code links to the full industry taxonomy, and the size_class field enables comparisons across small, mid-size, and large employers. Note that OSHA publishes ZIP3 (first 3 digits) rather than full ZIP codes.
- 4M+ establishment-year records (2016–2023)
- tcir_rate — Total Case Incident Rate per 200,000 hours worked (pre-computed)
- dart_rate — Days Away, Restricted, Transfer rate per 200,000 hours (pre-computed)
- Injury breakdown: dafw_cases, djtr_cases, other_cases, deaths
- Illness breakdown: skin_disorders, resp_conditions, poisonings, hearing_loss
- 6-digit NAICS code + industry_description for sector analysis
- size_class for large vs. small employer comparisons
Loading the Data
import pandas as pd
df = pd.read_parquet("osha_workplace_injuries.parquet")
print(f"Records: {len(df):,}")
print(f"Years: {sorted(df['survey_year'].unique())}")
print(f"Establishments: {df['estab_name'].nunique():,}")
print(f"States: {df['state'].nunique()}")
print(df[["survey_year","estab_name","state","naics_code","tcir_rate","dart_rate"]].head(5))DART Rate by Industry — Which Sectors Are Most Dangerous?
The DART rate (Days Away, Restricted, Transfer per 200,000 hours worked) is the standard OSHA benchmark for comparing injury severity across industries of different sizes. Higher is worse. Agriculture, forestry, animal production, warehousing, and nursing care facilities consistently rank at the top. Technology and finance sectors sit near zero.
# Aggregate DART rate by 2-digit NAICS sector (first 2 digits)
df["naics2"] = df["naics_code"].str[:2]
# Weighted average DART rate (weight by hours worked to avoid small-establishment bias)
def weighted_dart(group):
hours = group["total_hours_worked"].fillna(0)
dart = group["dart_rate"].fillna(0)
total_hours = hours.sum()
return (dart * hours).sum() / total_hours if total_hours > 0 else 0
sector_dart = (
df.groupby(["naics2", "industry_description"])
.apply(weighted_dart)
.reset_index(name="weighted_dart_rate")
.drop_duplicates("naics2")
.sort_values("weighted_dart_rate", ascending=False)
.head(20)
)
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 8))
sector_dart.set_index("industry_description")["weighted_dart_rate"].sort_values().plot(
kind="barh", ax=ax, color="#0ea5e9", alpha=0.85
)
ax.set_xlabel("Weighted DART Rate (per 200,000 hours)")
ax.set_title("Top 20 Industries by DART Rate — OSHA ITA 2016–2023")
plt.tight_layout()
plt.savefig("osha_dart_by_industry.png", dpi=150)Year-Over-Year Safety Trends
One of the most valuable uses of the longitudinal OSHA dataset is tracking whether industries are actually getting safer over time. The data spans 2016–2023, capturing the COVID-19 pandemic years (2020–2021) which produced notable spikes in illness cases — particularly respiratory conditions and hearing loss in certain sectors — while simultaneously causing a drop in injury cases as workplaces closed or reduced staffing.
# National TCIR trend by year
annual = (
df.groupby("survey_year")
.apply(lambda g: (g["tcir_rate"] * g["total_hours_worked"].fillna(0)).sum()
/ g["total_hours_worked"].fillna(0).sum())
.reset_index(name="national_tcir")
)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(annual["survey_year"], annual["national_tcir"],
marker="o", linewidth=2, color="#0ea5e9")
ax.set_xlabel("Survey Year")
ax.set_ylabel("National TCIR (per 200,000 hours)")
ax.set_title("US Workplace Injury Rate Trend — OSHA ITA 2016–2023")
ax.axvspan(2020, 2021, alpha=0.15, color="#f59e0b", label="COVID years")
ax.legend()
plt.tight_layout()
plt.savefig("osha_tcir_trend.png", dpi=150)Size Class Analysis: Large vs. Small Employers
Larger establishments consistently report lower injury rates per hour worked — a well-documented pattern attributed to more formal safety programmes, dedicated safety officers, and greater OSHA inspection exposure. The OSHA data makes this easy to verify and quantify. For insurance underwriting, the size_class field is a useful risk segmentation variable independent of NAICS code.
size_stats = (
df.groupby("size_class")
.agg(
establishments=("estab_name", "count"),
avg_tcir=("tcir_rate", "mean"),
avg_dart=("dart_rate", "mean"),
total_deaths=("total_deaths", "sum"),
)
.reset_index()
.sort_values("avg_dart", ascending=False)
)
print(size_stats.to_string(index=False))Identifying High-Risk Establishments
The establishment-level granularity enables targeted screening — useful for workers' comp underwriting, ESG due diligence, or safety research. Filtering for establishments with DART rates more than 3× their NAICS sector median, in multiple consecutive years, surfaces a list of persistently high-risk sites that any insurer or investor would want to flag.
# Compute sector median DART by NAICS2 and year
sector_median = (
df.groupby(["naics2", "survey_year"])["dart_rate"]
.median()
.rename("sector_median_dart")
.reset_index()
)
flagged = df.merge(sector_median, on=["naics2", "survey_year"])
flagged["dart_ratio"] = flagged["dart_rate"] / flagged["sector_median_dart"].replace(0, float("nan"))
# Establishments flagged 3+ years as 3× sector median
high_risk = (
flagged[flagged["dart_ratio"] >= 3.0]
.groupby("estab_name")
.agg(flagged_years=("survey_year", "nunique"), state=("state", "first"),
naics=("naics_code", "first"))
.query("flagged_years >= 3")
.sort_values("flagged_years", ascending=False)
)
print(f"Persistently high-risk establishments: {len(high_risk):,}")
print(high_risk.head(20).to_string())Use Cases
- Workers' compensation underwriting: DART and TCIR rates as risk inputs alongside NAICS and size class
- ESG scoring: rank industries and benchmark portfolio company sectors on workplace safety trajectory
- OSHA compliance analytics: identify sectors with deteriorating rates ahead of inspection cycles
- Academic research: study how unionisation, minimum wage, or inspection intensity correlates with injury rates
- Supply chain risk: screen suppliers by NAICS code and state for elevated workplace safety exposure
The free sample contains 1,000 rows. The complete OSHA Workplace Injury & Illness dataset covers 4M+ establishment-year records from 2016 to 2023, available as CSV and Parquet with a commercial license for $79.