HN
Today

Learning from context is harder than we thought

Tencent HY Research argues that despite their impressive benchmark performance, current large language models fundamentally struggle with 'context learning,' the human ability to absorb and apply new information in real-time. They introduce CL-bench, a rigorous new benchmark, demonstrating that even state-of-the-art models fail on the vast majority of tasks requiring true context-dependent learning. This deep technical dive highlights a critical limitation in AI capabilities, offering a new direction for research and development highly relevant to HN's AI enthusiasts.

3
Score
0
Comments
#5
Highest Rank
5h
on Front Page
First Seen
Feb 6, 6:00 PM
Last Seen
Feb 6, 10:00 PM
Rank Over Time
195669

The Lowdown

Despite recent advancements, Large Language Models (LLMs) still exhibit a significant gap in their ability to learn effectively from new contexts, unlike humans who constantly adapt to novel information. Tencent HY Research posits that LLMs are primarily 'test-takers' reliant on pre-trained knowledge, rather than dynamic 'context learners.' This fundamental mismatch limits their utility in real-world, constantly evolving environments. To address this, they've developed a new evaluation framework, CL-bench.

  • The Problem: Current LLMs operate largely on parametric knowledge encoded during pre-training, struggling to integrate and apply new information presented in real-time context. They frequently revert to pre-trained assumptions even when new rules are explicitly defined.
  • Introducing CL-bench: This benchmark aims to measure true context learning, comprising 500 complex contexts, 1,899 tasks, and over 31,000 verification rubrics, all crafted by domain experts. Each task demands learning new knowledge directly from the provided context.
  • Contamination-Free Design: To ensure genuine context learning rather than memorization, CL-bench utilizes entirely fictional content, modified real-world content, and niche/emerging information unlikely to be in pre-training datasets. Without context, models like GPT-5.1 (High) solve less than 1% of tasks.
  • Key Findings: State-of-the-art LLMs perform poorly on CL-bench, with an average success rate of just 17.2%. Even the top model, GPT-5.1 (High), only achieves 23.7%. The dominant failure mode is ignoring or misusing contextual information.
  • Failure Analysis: Long-context reasoning and instruction following are necessary but not sufficient. Inductive reasoning (discovering patterns from data) is significantly harder for models than deductive application of given rules. Higher reasoning effort generally improves performance, though not universally.
  • Context Complexity: Task difficulty correlates with context length, but complex, dense short contexts also pose significant challenges.

The findings from CL-bench make it clear that today's frontier LLMs are still far from being reliable context learners. This limitation explains many frustrations in real-world AI deployments where models fail subtly despite sophisticated context engineering. Improving context learning is crucial for LLMs to function effectively in dynamic environments, potentially shifting human roles in AI from 'training data providers' to 'context providers,' though the challenge of making context-learned knowledge persistent remains a deeper question.