Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

Simon Willison, known for his unique AI model evaluation methods, has once again used his "pelican riding a bicycle" benchmark to compare two new models: Alibaba's Qwen3.6-35B-A3B and Anthropic's Claude Opus 4.7.

The test involved prompting both models to generate an image of a "pelican riding a bicycle." A secondary test used "flamingo riding a unicycle."
Qwen3.6-35B-A3B, specifically a 21GB quantized version, was run locally on a MacBook Pro M5 using LM Studio.
Claude Opus 4.7 was accessed as Anthropic's proprietary, newly released model.
In both image generation tasks, the locally-run Qwen model produced illustrations that the author deemed significantly better and more accurate than those from Claude Opus 4.7, which struggled with basic elements like bicycle frames.
Willison acknowledges his benchmark is largely a joke, satirizing the difficulty of comparing AI models, but notes a historical correlation between good pelicans and general model utility. However, this result breaks that correlation, as he doubts Qwen's overall superiority to Opus.
The practical takeaway is that for the very specific task of illustrating pelicans on bicycles (or flamingos on unicycles), Qwen running on a laptop currently outperforms Opus.

This amusing comparison underscores the unpredictable nature of AI model performance and how specialized, even seemingly trivial, benchmarks can yield surprising results that challenge conventional expectations about model capabilities and scaling.

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

The Lowdown