Things that makes me go hmm
Every week there’s a new post ranking the coding models, which one is smartest, which one is cheapest, which one you should switch to today. Most of them come down to someone reading the output and going “yeah, that looks better”. Which is fine, but I wanted to score it without the eyeballing. So I built a little thing. A handful of small coding tasks, each with a hidden test that decides pass or fail. A deadlock to fix, a glob matcher, an event bus, that sort of thing. The model works on a clean checkout and never sees the answer.
./run_all.sh # runs everything in models.txt
so I could throw Claude’s models and a bunch of free Ollama cloud ones at the same tasks. Then I ran it, and mostly what I got back was hmm.

The two models alone at the top are glm-5.2 and kimi-k2.7-code, both free. Every Claude model landed a row below. Fable cost me over five dollars and scored the same as haiku at sixty cents. Not what I expected.
But the thing that really got me was this. I also have a runner that feeds an Ollama model the task agentically, over many turns, letting it read files and poke around. So I ran the same free models that way. And they got worse. glm-5.2 swept everything in a single shot, then dropped two tasks once it had all those turns to look around. The whole bet right now is that more turns is more help. Here it was the opposite, and I don’t have an answer for it.
And I should be honest about the test. Nine tiny tasks, one run each, and I built the whole thing with Claude’s help, the tasks and the hidden tests and all, so who knows which way it leans. It’s a toy. Which is why I want to point it at a real project next and see if any of this survives.
Grab it at github.com/epatel/model-benchmark if you want to poke at it…