StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)

UniClaw's OpenClaw Arena introduces a novel approach to benchmarking AI models, providing 'real tasks, real agents, real results' to evaluate performance and cost-effectiveness across a range of LLMs. The platform distinguishes between raw performance and value, revealing a significant divergence between top models in each category.

The Arena features two distinct leaderboards: one for raw performance and another for cost-effectiveness.
StepFun 3.5 Flash unexpectedly secures the #1 spot for cost-effectiveness, demonstrating strong capabilities even though it's significantly cheaper than competitors.
Conversely, models like Claude Opus 4.6 lead in raw performance but rank much lower in cost-effectiveness.
Several mid-tier models, including GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7, surprisingly outrank more established names like Gemini 3.1 Pro in performance.
The benchmarking methodology uses relative ordering and a grouped Plackett-Luce model with bootstrap confidence intervals, similar to Chatbot Arena, to ensure robust rankings.

This analysis underscores the growing importance of considering both performance and cost when selecting AI models for practical applications, showcasing that top-tier performance doesn't always come with a premium price tag, and vice-versa.

StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)

The Lowdown

The Gossip

StepFun's Surprising Strengths

Authorial AI Accusations & Arena Accountability