
Large language models struggle to solve research-level math questions. It takes a human to assess just how poorly they perform.
Extracted Claims
Large language models struggle to solve research-level math questions.
Confidence: 90%
It takes a human to assess just how poorly large language models perform.
Confidence: 90%