What Consumer Complaints Tell Us About Fintech

Since 2011, the Consumer Financial Protection Bureau has published every complaint it receives from US consumers about financial products — from predatory mortgage servicers to credit card billing errors to debt collectors calling at midnight. The result is a dataset of 14 million+ records spanning credit reporting, mortgages, student loans, credit cards, bank accounts, and more. For fintech analysts, compliance teams, and NLP researchers, this dataset is a goldmine.

In this tutorial we'll load the ClarityStorm CFPB dataset, explore complaint trends across products and companies, map geographic complaint density, and run a simple topic model on the 3.75 million free-text consumer narratives.

What's in the Dataset

The ClarityStorm CFPB release is a single flat file — one row per complaint — covering 2011 to present. Key fields include the financial product, issue type, company named in the complaint, US state, submission channel, company response, and the optional free-text consumer narrative. Our pipeline normalises the product taxonomy across CFPB's evolving category names, so you get consistent long-horizon comparisons without manual string mapping.

14M+ complaints (2011–present), updated annually
22 fields including product_normalised for consistent category analysis
3.75M records include consumer_narrative free text — ideal for NLP
Timely response flag and consumer_disputed flag for company performance scoring
State and ZIP fields for geographic drill-down
CSV and Parquet formats — Parquet loads ~10× faster for the full dataset

Loading the Data

python

import pandas as pd

# Parquet is strongly preferred for the 14M-row full dataset
df = pd.read_parquet("cfpb_complaints.parquet")

print(f"Total complaints: {len(df):,}")
print(f"Date range: {df['date_received'].min()} → {df['date_received'].max()}")
print(f"Complaints with narrative: {df['has_narrative'].sum():,}")
print(df[["date_received","product_normalised","company","state","timely_response"]].head(5))

Which Products Generate the Most Complaints?

Credit reporting has dominated CFPB complaint volumes since 2017, driven by consumers disputing inaccurate items on their Equifax, Experian, and TransUnion reports. The Equifax data breach in 2017 and the COVID-19 pandemic's effect on credit both appear as sharp spikes in the data.

python

import matplotlib.pyplot as plt

# Complaint volume by normalised product
product_counts = (
    df.groupby("product_normalised")
    .size()
    .sort_values(ascending=False)
    .head(10)
)

fig, ax = plt.subplots(figsize=(10, 6))
product_counts.sort_values().plot(kind="barh", ax=ax, color="#0ea5e9")
ax.set_xlabel("Complaint count")
ax.set_title("CFPB Complaints by Product Category (2011–Present)")
plt.tight_layout()
plt.savefig("cfpb_by_product.png", dpi=150)

Annual Trends: Spotting Market Events in the Data

Plotting annual complaint volumes reveals the macro story: the 2017 Equifax breach, the 2020 COVID forbearance wave, and the 2022–23 student loan servicer transition all produce visible spikes. This temporal signal makes CFPB data a leading indicator for regulatory risk — issues that surface in complaint data typically reach enforcement action 12–24 months later.

python

# Annual complaint trend
df["year"] = pd.to_datetime(df["date_received"]).dt.year
annual = df.groupby("year").size().reset_index(name="complaints")

fig, ax = plt.subplots(figsize=(12, 5))
ax.bar(annual["year"], annual["complaints"] / 1e3, color="#0ea5e9", alpha=0.8)
ax.set_xlabel("Year")
ax.set_ylabel("Complaints (thousands)")
ax.set_title("CFPB Annual Complaint Volume 2011–Present")

# Annotate key events
ax.axvline(2017, color="#ef4444", linestyle="--", alpha=0.6, label="Equifax breach")
ax.axvline(2020, color="#f59e0b", linestyle="--", alpha=0.6, label="COVID forbearance")
ax.legend()
plt.tight_layout()
plt.savefig("cfpb_annual_trend.png", dpi=150)

Company Complaint Rankings

Which companies appear most frequently? Raw complaint counts favour large institutions with more customers, so normalising by customer base or revenue is recommended for benchmarking. But the raw ranking still reveals outliers — smaller institutions with disproportionately high complaint rates are worth investigating.

python

# Top 15 companies by complaint volume
top_companies = (
    df.groupby("company")
    .agg(
        complaints=("company", "count"),
        timely_pct=("timely_response", "mean"),
    )
    .sort_values("complaints", ascending=False)
    .head(15)
)

print(top_companies.to_string())

NLP on the Free-Text Narratives

The real research value sits in the 3.75 million consumer_narrative fields. These are raw, unfiltered consumer descriptions of their experience — complaint categories, company names, dollar amounts, and emotional language all appear. Running even a basic topic model reveals clusters that match known enforcement patterns years before the CFPB itself publishes supervisory findings.

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# Filter to records with narratives, limit to 100K for speed
narratives = df[df["has_narrative"] == 1]["consumer_narrative"].dropna().head(100_000)

# TF-IDF vectorise (skip common English stop words)
vectorizer = TfidfVectorizer(max_features=5_000, stop_words="english", min_df=5)
X = vectorizer.fit_transform(narratives)

# NMF topic model: 10 topics
model = NMF(n_components=10, random_state=42)
model.fit(X)

terms = vectorizer.get_feature_names_out()
for i, topic in enumerate(model.components_):
    top_terms = [terms[j] for j in topic.argsort()[-10:]]
    print(f"Topic {i}: {' | '.join(top_terms)}")

Typical topics that emerge include: credit report dispute language ("inquiry", "account", "dispute", "inaccurate"), debt collection contact patterns ("calls", "collector", "cease", "robo"), and mortgage servicing complaints ("escrow", "payment", "modification", "foreclosure"). Each topic cluster maps cleanly to a CFPB enforcement priority.

Use Cases

Compliance monitoring: track competitor complaint rates to anticipate regulatory scrutiny
NLP training data: 3.75M labeled, real-world consumer finance narratives for sentiment and classification models
Geographic risk: map complaint density by ZIP to correlate with demographic or economic indicators
Fintech due diligence: assess complaint history before partnerships or acquisitions
Regulatory intelligence: identify emerging issues before they become enforcement actions

The free sample contains 1,000 rows. The complete CFPB Consumer Financial Complaints dataset covers 14M+ complaints from 2011 to present, with 3.75M free-text narratives — available as CSV and Parquet with a commercial license for $79.

What's in the Dataset

Loading the Data

Which Products Generate the Most Complaints?

Annual Trends: Spotting Market Events in the Data

Company Complaint Rankings

NLP on the Free-Text Narratives

Use Cases

Get the Full Dataset