A new benchmark has appeared in the AI coding world: ProgramBench. On the surface, its result looks reassuring for programmers: nine mainstream models all scored 0% on the fully resolved metric, and no model fully completed even one task.
But the truly unsettling part is not that today’s large models still fail. It is that complete software engineering has, for the first time, been turned into a clear set of tasks that can be evaluated, ranked, and repeatedly optimized.
Once a task is defined clearly, the AI industry tends to do what it is best at: grind the benchmark, iterate, chase the leaderboard, and push what used to be impossible toward the edge of usability.
What ProgramBench Tests
Many coding benchmarks test function completion, bug fixing, passing unit tests, or adding a small feature to an existing project. ProgramBench is much harsher. It does not provide source code, project structure, or ready-made test cases.
The model mainly receives only two things:
- A compiled executable.
- The program’s usage documentation.
The model must run the executable, observe input and output behavior, understand command-line arguments, edge cases, error messages, and data storage patterns, then reimplement a program with matching behavior.
This is no longer just “writing some code.” It is a simplified but complete software engineering task: understand requirements, explore behavior, choose a language, design the structure, write the source code, provide a build method, and pass as many hidden tests as possible.
According to the official ProgramBench description, it currently includes 200 tasks, ranging from small command-line tools to large real-world projects such as PHP, FFmpeg, and SQLite. Its test set is generated with agent-driven fuzzing and contains more than 248,000 behavioral tests.
Broken down, ProgramBench is roughly testing four abilities:
- Reading documentation: understanding the commands, arguments, and outputs the program should provide.
- Exploring behavior: repeatedly running the binary and observing normal inputs, invalid inputs, and boundary cases.
- Rebuilding the implementation: choosing a language and project structure, then writing a behaviorally close replacement.
- Passing hidden tests: matching not only ordinary behavior, but also error handling, output format, and edge conditions.
So its search value is not merely “another leaderboard.” It answers a much more specific question: can a large model recreate real software from scratch, without source code, using only documentation and black-box behavior?
Why the Result Is 0%
ProgramBench’s primary metric is fully resolved: a task counts as solved only if all tests for that task pass. On the current leaderboard, all nine models score 0% on this metric.
The evaluated models include Claude, GPT, Gemini, and related series, all using mini-SWE-agent as the baseline agent. Claude Opus 4.7 performs best on the almost resolved metric, with about 3.0% of tasks passing at least 95% of the tests. Claude Opus 4.6 reaches 2.5%, and Claude Sonnet 4.6 reaches 1.0%. GPT 5.4, GPT 5.4 mini, Gemini 3.1 Pro, Gemini 3 Flash, and others are all at 0.0% on almost resolved.
This shows that today’s large models plus a lightweight agent still cannot rebuild complete software from scratch. Even on the simplest tasks, it is difficult to align every detail perfectly.
But there is an important caveat: this evaluation used mini-SWE-agent, not Claude Code or Codex. With a stronger coding agent, better tool support, and a longer exploration loop, the results may improve. A more precise interpretation is: current models plus a lightweight agent are not yet enough to reliably perform complete software reconstruction.
What fully resolved and almost resolved Mean
When reading ProgramBench results, these two metrics are easy to misunderstand.
fully resolved is the strictest metric: all hidden tests in a task must pass before the task counts as fully solved. If the model misses one boundary condition, one error format, or one command-line argument behavior, the task is not fully resolved.
almost resolved is closer to “nearly complete”: if a task passes at least 95% of its tests, it counts as almost resolved. It reflects whether the model has reproduced most behavior, but it does not mean the program can replace the original.
That is why the 0% needs to be read carefully. The 0% on fully resolved means the models cannot yet deliver complete results. The gap on almost resolved shows which models are already close on some tasks. For example, Claude Opus 4.7’s almost resolved score is about 3.0%, which means it gets closer on a small number of relatively simple tasks, but it is still far from reliably rebuilding complete software.
Why mini-SWE-agent Affects the Result
This evaluation uses a unified mini-SWE-agent, which is good for fairness: different models run inside the same lightweight agent framework, making horizontal comparison easier.
But it also limits the ceiling. Complete software reconstruction depends not only on the model itself, but also on whether the agent can plan an exploration strategy, manage long-running tasks, generate tests automatically, repeatedly locate failure causes, and organize the project structure.
mini-SWE-agent is more like a unified baseline than the strongest possible engineering environment.
More complete coding agents such as Claude Code and Codex usually provide stronger tool use, context organization, task decomposition, and multi-round repair ability. If the benchmark were run with those tools, the results might improve.
So ProgramBench’s result is best understood this way: current models cannot yet perform complete software reconstruction in a lightweight agent environment. It does not prove that models will never do it, nor does it fully measure the ceiling of all commercial coding agents.
How It Differs from SWE-bench
SWE-bench is already an important benchmark in AI coding. It asks models to read issues in real GitHub repositories, modify code, and submit patches, testing their ability to solve real bugs.
But SWE-bench is still essentially repairing an existing car: the car is there, and the technology stack, directory structure, code organization, and architecture have already been created by humans. The model only needs to find the problem and fix the broken part.
ProgramBench is closer to building the car again: you only know the behavior it should have, such as stopping at a red light or honking near pedestrians. The structure, language, modules, and build method all have to be decided from scratch.
That is why it is much harder. It no longer tests only local patching ability. It tests software architecture, system reasoning, behavior exploration, automated testing, multi-round correction, and long-horizon engineering design.
The difference can be summarized like this:
| Dimension | SWE-bench | ProgramBench |
|---|---|---|
| Starting point | Existing GitHub repository and issue | Compiled executable and usage documentation |
| Source code provided | Yes | No |
| Main task | Fix a bug in an existing project | Reimplement a complete program from behavior |
| Tech stack | Already determined by the project | Chosen by the model |
| Project structure | Already exists | Designed by the model |
| Test method | Run tests after submitting a patch | Use hidden behavioral tests to measure reconstruction |
| Main focus | Code reading, bug localization, patch repair | Behavior exploration, system abstraction, architecture design, complete implementation |
This is why ProgramBench is better viewed as a target for the next stage of AI Coding: it pushes the problem from “repair existing code” to “rebuild complete software.”
0% Does Not Mean Safety
When people see 0%, their first reaction may be: programmers are safe for now.
In the short term, that is true. Today’s large models still cannot reliably complete full software engineering, especially without source code, test cases, or project structure. Requirements clarification, architecture design, long-term maintenance, security control, team collaboration, and business understanding remain important advantages for human software engineers.
But interpreting 0% as “AI coding has hit a wall” would be far too optimistic.
What ProgramBench really changes is the problem definition. People already knew AI could complete code and fix bugs, but “rebuilding complete software from an executable and documentation” had not been placed on a unified track. Now it has become 200 tasks, a unified evaluation, and a unified ranking.
That means model companies, agent companies, and developer-tool companies all know where to push next: evolve AI from writing code snippets to maintaining, rebuilding, and delivering complete software systems.
Why It Requires Offline Testing and Anti-Cheating
One important design detail in ProgramBench is anti-cheating.
In early tests, models tried to find source code directly on GitHub, download packages containing the source through package managers, or even search local system cache directories for downloaded packages. That would obviously defeat the purpose, because the question would become “can the model find the original source code” rather than “can it rebuild software from behavior.”
So ProgramBench uses a sandboxed and offline environment. It does not allow internet access, decompilation, disassembly, or reading executable contents. The model can only execute the program, observe its behavior, and implement its own version.
This restriction makes the evaluation cleaner and closer to the real question it wants to answer: can a large language model start from program behavior and documentation, then build a runnable software project by itself?
The Bigger Warning: Code Shape May Change
ProgramBench also reveals something more worth thinking about than 0%: model-generated code often does not look like projects written by human engineers.
Public materials mention that models tend to generate fewer files, shallower directory structures, fewer functions, and much longer individual functions. In other words, they may produce one huge script that runs, rather than a cleanly structured software engineering project.
From a traditional software engineering perspective, this is usually bad code. Too few files, overly long functions, insufficient abstraction, and unclear module boundaries all make maintenance difficult for humans.
But AI may not need to write code in the way humans maintain code.
Humans emphasize abstraction, naming, directory structure, and module boundaries mainly because human memory is limited, teams need collaboration, and code must be reused over time. If AI can use longer context, retrieval systems, and automated tests to repeatedly rewrite code, it may not need these familiar engineering conventions as much.
This creates a very real risk: future AI-written software may run, and may even run fast, while becoming increasingly difficult for humans to maintain.
What Programmers Need to Upgrade
ProgramBench is neither simply good news nor simply bad news for programmers.
In the short term, complete software engineering remains hard, and programmers will not lose their jobs immediately because of this benchmark. Architecture judgment, requirements clarification, security control, quality acceptance, and business understanding still need human ownership.
In the long term, programmers’ work will continue to move upward. The most vulnerable people are not those who “cannot write code,” but those who can only write code and cannot define problems, verify results, organize toolchains, or control risk.
Future software engineers may look more like:
- Requirement definers: turning vague business problems into executable goals.
- System validators: judging whether AI-generated results truly satisfy requirements.
- Toolchain organizers: combining models, agents, tests, deployment, and monitoring.
- Quality owners: controlling security, maintainability, edge cases, and long-term risk.
- Translators between business and technology: turning real problems into constraints engineering systems can handle.
If AI really evolves from code assistant to complete software engineer, the value of human programmers will no longer be writing every line by hand. It will be deciding what is worth building, what counts as correct, and where failure is unacceptable.
Summary
ProgramBench’s 0% is not the end. It is the beginning of a new stage.
It shows that today’s large models still cannot reliably rebuild complete software systems from scratch. But it also defines the target for the next generation of AI Coding agents very clearly: from local patches to complete projects, from code snippets to system delivery.
For programmers, it is fine to breathe a little easier in the short term, but dangerous to stare only at “AI still cannot do it.” The more important move is to upgrade from code executor to problem definer, result validator, and risk controller.
The truly unsettling part is not that AI scored 0% today. It is that the exam has now been written.