What is Ornith-1.0? The open coding model developers are testing now

Ornith-1.0 landed with exactly the kind of claim that makes the local model world stop scrolling: an open-source model family for agentic coding, trained to improve not just the answer, but the scaffold used to get there.
The announcement spread quickly because the numbers were loud. The X launch post claimed state-of-the-art results among comparable open models, including 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified. Google Trends exports for the week after launch show breakout interest across "ornith 35b", "ornith model", "ornith ai", "ornith 9b", "ollama ornith", "ornith hugging face", and "ornith 31b".
That last query matters. Ornith is announced as a family spanning 9B dense, 31B dense, 35B MoE, and 397B MoE. But the practical release story is messier than the headline.
If you are trying it today, the first question is not "is Ornith the best coding model?" It is simpler: which Ornith can you actually download, run, and evaluate on your own work?
What Ornith-1.0 is claiming
Ornith is from DeepReinforce, and the core technical claim is not merely another coding fine-tune. The project describes Ornith-1.0 as a self-improving training framework built on top of pretrained Gemma 4 and Qwen 3.5 models.
The important distinction: "self-improving" here refers to the training process, not a local model that keeps modifying its own weights while you use it.
The training idea is that reinforcement learning should optimize two things together:
- the solution rollouts
- the task-specific harnesses or scaffolds that guide those rollouts
That framing is interesting because coding agents do not succeed only by emitting code. They succeed by choosing good steps: inspect the right files, run the right tests, write the right scratch scripts, use tools in a productive order, and avoid loops. A model that is better at scaffolding its own work could be better in an agent harness even if it is not dramatically better in a plain chat window.
That is also why the benchmark discussion is hard to interpret. A tool-use model can look bad in a no-tool chat benchmark and still be useful in a real coding harness. It can also score well on coding benchmarks and still fail in your repo because it loops, over-searches, ignores project conventions, or optimizes for the benchmark shape.
What is actually available
The launch copy names four sizes:
| Model | Shape | Practical status |
|---|---|---|
| Ornith-1.0 9B | Dense | Available on Ollama and Hugging Face |
| Ornith-1.0 31B | Dense | Announced, but not visible in the public release set during this check |
| Ornith-1.0 35B | MoE | Available on Ollama and Hugging Face |
| Ornith-1.0 397B | MoE | Visible in the Hugging Face collection, including FP8/GGUF-related releases |
For most developers, the immediately relevant models are the 9B and 35B variants. Ollama currently exposes 9b and 35b tags, which makes it the fastest path for a quick local test.
The missing 31B dense model is not a footnote. Several people in the Hacker News thread and the LocalLLaMA discussion asked the same thing: if the 31B dense model is part of the family, where are the weights and benchmarks?
That gap matters because 31B dense is exactly the size many local developers care about. It is large enough to be meaningfully more capable than a 9B model, but still plausible on high-memory desktop setups or aggressive quantization. The 35B MoE release is useful, but it is not the same tradeoff.
Why people are skeptical
The skeptical read is straightforward: Ornith may be a strong benchmark-tuned fine-tune of Qwen/Gemma bases, but the launch language makes it sound more novel than it is.
On HN, the discussion quickly centered on what "self-improving" means. The useful interpretation is training-time self-improvement: the system learns better scaffolds and better rollouts during RL. The misleading interpretation is runtime self-improvement: the model learns while you use it. It does not appear to mean that.
There is also concern around benchmark transfer. The Reddit thread is mixed, but the center of gravity is cautious:
- some users report Ornith 9B and 35B falling behind comparable Qwen models in real coding tasks
- several describe it as "benchmaxxed" or too tuned to the charts
- one recurring complaint is doom loops in longer agent runs
- others say the 35B model is genuinely useful, fast, or better for certain Hermes/tool-calling workflows
- at least one 397B tester reported a good early experience on multi-turn refactors
That is the normal pattern for a hyped local coding model. The benchmarks create the first wave. The second wave is people running it against private codebases, weird toolchains, messy repos, and personal eval suites. The second wave is usually more useful.
The will-it-mythos caveat
The will-it-mythos post is worth reading because it is a good example of how benchmark framing can clash with practical testing. Ornith is mentioned there as performing poorly in a chat setup without tools, finding only the common bug most models found and showing a tendency to hallucinate.
That does not settle the question. The author explicitly notes that a replication with full tool access could change the result.
For Ornith, that caveat is central. If the model is trained for agentic coding, then a no-tools chat test is not the environment it is designed for. At the same time, a model that hallucinates confidently without tools is still telling you something about its failure modes. A good agent harness can reduce those failures, but it cannot make them irrelevant.
The right conclusion is not "Ornith failed" or "the benchmark is invalid." It is: test it in the mode you intend to use.
How to test Ornith without fooling yourself
If you want to know whether Ornith belongs in your local coding rotation, do not start with the launch chart. Start with your own workload.
A useful test should include:
- one small greenfield task where success is easy to inspect
- one change to an existing repo with real conventions
- one bug fix where the model has to read surrounding code before editing
- one tool-use task that requires running tests or a script
- one longer task where loops and path hallucinations can appear
Then compare Ornith against the model you already use, not against a screenshot. For many local developers, that means Qwen 3.5 or Qwen 3.6 variants, Kimi, GLM, DeepSeek, Hermes, or whatever currently sits in your agent harness.
Pay attention to the boring metrics:
- did it edit the right files?
- did it run the right checks?
- did it recover from failures?
- did it repeat commands after they failed?
- did it invent paths, APIs, or test results?
- did it finish faster enough to matter?
That last question is important. A model can be slightly worse on final quality and still worth using if it is much faster, cheaper, or easier to run locally. Several positive reports about Ornith 35B focus on speed and compact thinking traces rather than raw intelligence.
Which model should you try first?
For most people, start with ornith:9b only if you are constrained by hardware or want a fast smoke test. Small coding models can be useful for narrow tasks, but the Reddit thread has multiple reports of 9B not transferring cleanly to real agentic coding.
The 35B MoE model is the more interesting practical release. It is the one to test if you want to know whether Ornith has a place in a local coding agent stack right now.
The 397B release is a different category. It may be interesting for people running vLLM on serious multi-GPU hardware, but it is not the model most developers are going to casually evaluate over lunch.
The 31B dense model is the one to watch. Search interest is already there, and community comments keep circling back to it. If DeepReinforce releases a strong Gemma 4 31B dense tune with reliable tool behavior, that could be the most practical member of the family for high-end local development.
The honest read
Ornith-1.0 is interesting for the right reason: it points at the part of coding agents that matters most, the scaffold around the solution. Better tool-use trajectories, better harness choices, and fewer wasted loops would be valuable even without a giant leap in base model intelligence.
But the release also deserves the skepticism it is getting. The model family is not fully available in the shape announced. The "self-improving" language is easy to overread. The community reports are mixed. And benchmark wins do not automatically mean the model will survive your repo, your tests, and your agent loop.
So the practical answer is:
- try the 35B release if you care about local agentic coding today
- use the 9B release for fast experiments, not final judgment
- wait for the 31B dense release if that is the hardware/performance sweet spot you actually want
- treat the launch benchmarks as a lead, not a verdict
Ornith might become a serious local coding model. It is not proven by the chart. It will be proven, or not, by what happens when developers put it inside real tool loops and ask it to change real code.

Break free from the terminal
As your Claude Code powered Linear agent, Cyrus is capable of accomplishing whatever large or small issues you throw at it. Get PMs, designers and the CEO shipping product.