Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge
An open-weights Chinese model, Kimi K2.6, surprisingly outperformed proprietary giants like GPT-5.5 and Claude Opus 4.7 in a novel AI coding challenge. This victory, attributed to its aggressive, greedy strategy in a unique Word Gem Puzzle, has ignited discussions about the true capabilities of open-source AI and the relevance of specialized benchmarks. Hacker News is buzzing about whether this signals a genuine shift in AI leadership or simply a well-played game.
The Lowdown
The AI Coding Contest (AICC) recently hosted a 'Word Gem Puzzle' challenge, pitting ten major language models against each other in real-time programming tasks with objective scoring. The results defied expectations, with Kimi K2.6, an open-weights model from Chinese startup Moonshot AI, clinching the top spot, leaving prominent Western models in its wake.
The challenge involved a sliding-tile letter puzzle on a grid (ranging from 10x10 to 30x30). Models had to slide adjacent tiles into a blank space and claim valid English words formed horizontally or vertically. Scoring rewarded longer words (7+ letters) and penalized shorter ones (under 7 letters).
- Kimi K2.6's Strategy: Kimi won by aggressively sliding tiles. Its greedy approach scored each possible move and executed the best one. While sometimes inefficient on smaller boards due to 'edge-oscillation,' this high-volume movement proved highly effective on larger 30x30 grids where static words were rare, leading to the highest cumulative score of 77.
- MiMo V2-Pro (2nd Place): This model relied on a static scan, claiming intact seed words early on. It scored well on smaller grids but failed on heavily scrambled 30x30 boards as it never moved tiles.
- Other Frontier Models: GPT-5.5 was more conservative with sliding, performing best on 15x15 and 30x30. Claude Opus 4.7, like MiMo, did not slide, which severely hampered its performance on larger grids requiring active tile manipulation.
- Catastrophic Failures: DeepSeek V4 sent malformed data and scored almost nothing. Muse Spark, however, displayed a profound misunderstanding of the scoring rules, claiming every word regardless of length, leading to a disastrous cumulative score of -15,309.
- The 30x30 Separator: The author notes that the largest grids dramatically separated models. Those capable of dynamic tile manipulation (like Kimi) thrived, while static scanners faltered.
- Bigger Picture: While acknowledging this is just one benchmark, the author points out that open-weights models like Kimi are rapidly closing the capability gap with frontier models, which previously seemed unbridgeable. Kimi's performance, just a few points behind GPT-5.5 and Claude on the Artificial Analysis Intelligence Index, signals a significant competitive shift.
This specific challenge highlights the importance of real-time decision-making and robust code generation for novel tasks, rather than just long-context reasoning. The emergence of strong, openly available models challenges the established order and offers new competitive dynamics in the AI landscape.
The Gossip
Benchmark Brouhaha
Many commenters debated the true significance of this specific challenge, arguing it might not be a general indicator of coding prowess. Some suggested Kimi's win stemmed more from its optimized strategy for this particular game's mechanics rather than superior overall coding ability, likening it to a LeetCode-style puzzle with a TCP layer. Others appreciated the shift towards objectively scored tests, even if specialized, providing a tangible comparison point.
Open-Weights Overtaking
A dominant theme was the excitement and implications of an open-weights model nearing or matching the performance of closed, frontier models. Users highlighted the benefits of open weights for cost, stability (avoiding 'enshittification' or silent model changes), and fostering a competitive ecosystem. There was a strong desire for more accessible, powerful open models, even if they currently require significant hardware or cloud infrastructure to run at scale.
Kimi K2.6's Characteristics and Cost-Effectiveness
Users shared their practical experiences with Kimi K2.6, praising its 'personality' and effectiveness for focused, iterative coding tasks. However, some noted its verbosity and tendency to fill up its context window on larger planning tasks, leading to higher token usage. There was also significant discussion around the cost of using Kimi, with many recommending specific coding plans or services as more economical alternatives to general API calls, emphasizing its value for money compared to more expensive frontier models.