Deterministic vs probabilistic code generation

May 17, 2026

Bun recently vibe coded a million line change to their codebase, turning Zig into Rust. While they might see this as a magical win, I see this as the collapse of software engineering.

Deterministic code generation

A deterministic system, when given the same set of inputs, will perform the same operations. Programming languages are largely deterministic. There are some languages which allow for undefined behaviour, but on the whole every time code is run, it operates in the same way. Uncertainty or confusion in behaviour leads to bugs, often security bugs.

There are automated yet deterministic ways to convert code from one language to another.

The majority of languages I create have transpiling support out of the box. Derw can produce JavaScript, TypeScript, or English. Tegan can produce JavaScript or Go. Mojie can produce JavaScript, Python, or English1. json-to-elm produced JSON parsing code for multiple versions of Elm, with optional library support.

These are all based on building ASTs by parsing code and producing new code based on the AST. This is a deterministic process: given the same input, you’ll get the same output.

Deterministic tooling has existed for years. Python’s 2to3 is well known: used for automated conversion from Python 2 to 3 in a deterministic way. The same Python 2 script run through 2to3 will produce the same Python 3 script. Transpiling languages, like Elm, PureScript, TypeScript, all target JavaScript and produce the same JavaScript each time. It makes them predictable.

Deterministic systems have a forced structure: it will be consistent. Consistency is crucial in technical systems. If a bug is consistent, we can fix it. If a bug is inconsistent, it becomes exponentially more difficult to fix. Simply reproducing an inconsistent bug will take more time. It is the role of software engineering to make systems consistent. Even small inconsistencies can lead to severe damage to a system.

Even with deterministic code generation, I still do not trust the process to be fully automated. There will always be edge cases. It still requires validation and correction.

Probabilistic code generation

Generative AI takes input, and produces an output. However, that output varies. Sometimes it’s A, other times it’s B. This introduces uncertainty into the process. It is no longer consistent. Code generally should be predictable. APIs should be intuitive. It is impossible to be intuitive about LLM generated code which you did not review, because it could be different each time.

I created neuro-lingo 3 years ago: a programming language where a human only writes function signatures and comments, and the implementation code is entirely generated by LLMs.

function add(a: number, b: number): number {
    // Add two numbers together
}

function main() {
    // Print "Hello World" to the console
    // Print the result of add(2, 3)
}

An example from neuro-lingo.

Every time neuro-lingo is compiled, the code is generated from fresh by the LLMs. It’s slightly different each time. Sometimes it introduces bugs. Sometimes it’s clean and simple. Sometimes it’s chaotic. Neuro-lingo was intended as a parody, but fully AI flows to produce code are doing the exact same thing.

When code is shipped, humans are accountable for that code. Not always legally, but morally and ethically. While open source licenses intentionally provide no warranty, the fact remains: code which is pushed into the open source ecosystem has an impact on the industry. Both in open source, and in corporate enterprises. The Havard Business review estimated the economical worth of open source to be $8.8 trillion.

It is not possible for a human to review 1 million lines of changes in 9 days. Let’s be clear about that: Bun has not reviewed the code they have merged to master.

The “there are tests” fallocy

Tests have never been enough to single-handily measure the quality of code. Consider SQLite, widely considered to be the most tested codebase:

As of version 3.42.0 (2023-05-16), the SQLite library consists of approximately 155.8 KSLOC of C code. (KSLOC means thousands of “Source Lines Of Code” or, in other words, lines of code excluding blank lines and comments.) By comparison, the project has 590 times as much test code and test scripts - 92053.1 KSLOC.

They list a wide range of different tests they have:

Four independently developed test harnesses
100% branch test coverage in an as-deployed configuration
Millions and millions of test cases
Out-of-memory tests
I/O error tests
Crash and power loss tests
Fuzz tests
Boundary value tests
Disabled optimization tests
Regression tests
Malformed database tests
Extensive use of assert() and run-time checks
Valgrind analysis
Undefined behavior checks
Checklists

And yet they do not automate the entire process. They create tools for humans to review and verify changes.

The release checklist is not automated: developers run each item on the checklist manually. We find that it is important to keep a human in the loop. Sometimes problems are found while running a checklist item even though the test itself passed. It is important to have a human reviewing the test output at the highest level, and constantly asking “Is this really right?”

Tests are simply one tool to help teams build systems. It is not enough to depend entirely on tests. While tests may verify some behaviours, they don’t capture the design of a system.

function add(x: number, y: number): number {
    let z = x;
    z += y;
    return z;
}

The above function will pass all tests for add. And yet, it’s badly designed. We can do better. So let’s simplify it:

function add(x: number, y: number): number {
    return x + y;
}

The simpler version will pass the exact same test suite. Tests did not capture that the design of the first was strange. There are deterministic tools which can help, like linters. But this is a trivial example, and the real complexity starts to grow as the codebase grows. Every little change can lead to death by a thousand cuts.

The uncertainty introduced by using AI to convert between languages makes the code unpredictable. Without the review of a human, the design of the system drifts from intentional architecture into random chance. No human is intimately familiar with Bun’s codebase now. No amount of tests are able to replace that. DORA metrics provide a long list of metrics, that when used in combination, tell a story about the health of a project.

DORA calls out a related common pitfall:

Having one metric to rule them all. Attempting to measure complex systems with the idea that only one metric matters. Teams should identify multiple metrics, including some with a healthy amount of tension between them

So no, tests alone are not enough.

The “humans make mistakes too” distraction

Yes, they do. And they learn. They adapt. They do things differently in the future. Nobody is debating whether humans are perfect. They aren’t. But AI is just a tool, and humans are responsible for the tools they use. No amount of automated AI workflows absolve the human from being responsible.

Leaders in particular should be concerned, because when people on the floor are automating things they shouldn’t, then the organisations or companies themselves are at threat. Either financially, legally, or security. In 2 years time, we’ll see a true boom of consultancies that specialize in cleaning up messes created by automated coding flows. Or, society will accept that software is more dangerous, more risky, and less reliable, with nobody understanding the code they ship to customers.

Do better

Open source is an ecosystem under stress. Between Github’s severe drop in reliability2, AI drive-by slop pull requests, and supply chain attacks, open source is changing. Even collapsing. Interacting with open source communities have been how I’ve spend a considerable portion of my career and personal time. I’ve never felt so pessimistic about the future of open source. And when a widely used project abandons software engineering in favour of vibes, what hope is there?

There is hope. But we need humans to stop treating bad practices as if it’s something that can be justified. We have mechanical tools that can provide deterministic automation. Probabilistic generation of code should make everyone worried.

I created Mojie specifically for my “Computer Science For Vibe Coders” blog series.

A self-created problem.

The Tech Enabler

Discussion about this post

Ready for more?