HN
Today

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

Simon Willison humorously pits a locally-run Qwen model against Anthropic's new Claude Opus 4.7 using his whimsical "pelican riding a bicycle" benchmark. Surprisingly, the quantized Qwen3.6-35B-A3B running on a laptop produced superior and more accurate illustrations than the highly-anticipated commercial model. This unexpected outcome cleverly highlights the current absurdities and complexities of AI model evaluation, suggesting that sometimes, the underdog can win in specific, niche tasks.

8
Score
0
Comments
#8
Highest Rank
3h
on Front Page
First Seen
Apr 16, 6:00 PM
Last Seen
Apr 16, 8:00 PM
Rank Over Time
3098

The Lowdown

Simon Willison, known for his unique AI model evaluation methods, has once again used his "pelican riding a bicycle" benchmark to compare two new models: Alibaba's Qwen3.6-35B-A3B and Anthropic's Claude Opus 4.7.

  • The test involved prompting both models to generate an image of a "pelican riding a bicycle." A secondary test used "flamingo riding a unicycle."
  • Qwen3.6-35B-A3B, specifically a 21GB quantized version, was run locally on a MacBook Pro M5 using LM Studio.
  • Claude Opus 4.7 was accessed as Anthropic's proprietary, newly released model.
  • In both image generation tasks, the locally-run Qwen model produced illustrations that the author deemed significantly better and more accurate than those from Claude Opus 4.7, which struggled with basic elements like bicycle frames.
  • Willison acknowledges his benchmark is largely a joke, satirizing the difficulty of comparing AI models, but notes a historical correlation between good pelicans and general model utility. However, this result breaks that correlation, as he doubts Qwen's overall superiority to Opus.
  • The practical takeaway is that for the very specific task of illustrating pelicans on bicycles (or flamingos on unicycles), Qwen running on a laptop currently outperforms Opus.

This amusing comparison underscores the unpredictable nature of AI model performance and how specialized, even seemingly trivial, benchmarks can yield surprising results that challenge conventional expectations about model capabilities and scaling.