Benchmarks in Leipzig

A recent paper details how 49 mathematicians crafted 100 research-level math problems, only for state-of-the-art LLMs to solve all but two. This impressive feat sparks a lively Hacker News debate on whether AI is truly reasoning or merely regurgitating advanced material. It highlights the incredible progress in AI's mathematical capabilities, pushing the boundaries of what constitutes a challenging benchmark.

Score

Comments

Highest Rank

on Front Page

First Seen

Jun 6, 2:00 PM

Last Seen

Jun 6, 5:00 PM

Rank Over Time

The Lowdown

A group of 49 mathematicians convened in Leipzig to create a novel dataset of 100 research-level mathematics questions with known answers. The primary goal was to test the advanced mathematical reasoning capabilities of modern Large Language Models (LLMs).

Between April 1 and May 15, 2026, mathematicians compiled 100 questions, with a significant portion developed during a 3-day workshop.
The questions were evaluated in three stages against five state-of-the-art LLMs.
Initially, 41 questions remained unsolved after a single attempt by the models.
Through subsequent stages, including multiple runs per model and 'heavy-thinking' models, the number of unsolved questions dropped dramatically.
Ultimately, only 2 out of the 100 research-level mathematics questions remained completely unsolved.

The study concludes that the mathematical reasoning capabilities of LLMs have become strikingly impressive, suggesting that traditional benchmark design for exercise-style problems based on public research may be reaching its limits against top-performing models.

The Gossip

Data Dependency Debate

The most prominent discussion point revolves around whether the LLMs are genuinely performing mathematical reasoning or simply retrieving and synthesizing information present in their vast training data. Commenters questioned if 'known answers' implied solutions were directly accessible to the models. While some acknowledged that the paper's methodology attempted to filter out trivially contained solutions (by ensuring not all models could solve them initially), the debate highlights the ongoing challenge of distinguishing true comprehension from sophisticated pattern matching in AI.

Benchmark Breakthroughs & Bottlenecks

Users expressed amazement at the LLMs' performance, particularly the reduction to only two unsolved questions, comparing it to earlier benchmarks where LLMs struggled with simpler math problems. There was a consensus that this achievement signifies a major leap in AI's capabilities. However, discussions also focused on the limitations of current benchmarking approaches, with the paper itself stating that 'the concept of writing exercise-style benchmark questions... has reached its limits.' This implies a need for new, more challenging benchmarks to test 'frontier challenges' that LLMs cannot simply 'find' answers to.