
A user reported a bug in the Rush app. They clicked a button labeled Report. That click filed a ticket to our Linear board with the relevant application logs, anonymized. No human triaged it.
An agent picked up the ticket, read the logs, traced the data path across four files, and wrote an implementation plan. That plan was posted as a comment on the ticket.
A swarm of three models reviewed the plan. Claude wrote it. Codex and Gemini critiqued it. The original agent revised the plan based on their feedback, then began implementation.
The agent created a branch, wrote the fix, wrote tests, ran them. A second agent reviewed the pull request independently, reading the diff without knowledge of the plan. It found an inconsistent database query filter that wasn't in the original bug report. A third agent fixed that finding, ran the tests again, and labeled the PR qa-ready.
The ticket now shows the original bug report, the agent's plan, a link to the full reasoning trace, the PR, the review, and the fix. Twelve minutes passed between the user clicking Report and the PR being ready to merge.
No engineer was paged. No standup discussed it. No one context-switched.
The code was incidental. The organizational cognition became executable.
The state machine hiding in every organization
Every productive organization runs the same loop. A signal arrives. Someone decides it matters. Someone plans what to do about it. The plan gets challenged. Work happens. The output gets checked. It ships. The shipped result generates new signals. The loop continues.
This loop usually lives in people's heads, in Slack threads, in meetings that could have been emails. It is implicit, undocumented, and fragile. When a key person leaves, part of the loop walks out the door with them.
Look at what those twelve minutes actually produced. Every step of the organizational judgment loop became an explicit, auditable artifact:
- Perception is explicit. The bug report handler starts a Linear payload with structured logs. The system saw something and filed it.
- Hypothesis is explicit. The agent's plan is required as a comment headed "Implementation Plan" before any code is written. The system proposed an interpretation.
- Critique is explicit. Multiple models reviewed the plan before implementation. The system challenged its own hypothesis.
- Memory is explicit. The agent's full reasoning trace is uploaded as a gist and linked from the PR. The system recorded how it thought.
- Judgment is explicit. A separate review agent posted one structured comment covering requirements, impact analysis, test results, and coverage gaps. The system evaluated its own output.
- The exit condition is explicit. A
qa-readylabel marks the point where the system believes the change is ready. The system declared when it was done.
These are not implementation details of a coding pipeline. These are the data structures of organizational cognition. And the moment they are data structures, they are programmable.
A pattern with six names
While building this system, we kept running into the same constraint. Generating code was fast. Knowing whether the code was correct took all the engineering effort. We assumed this was a problem unique to AI agents. It is not. It is one of the oldest problems in organized work, and every domain solved it independently.
In banking, it is called the Maker-Checker principle, sometimes the Four-Eyes rule. One person initiates a transaction. A second person verifies it before it clears. This practice predates computers by centuries. It is how money has moved safely since double-entry bookkeeping.
In audit and compliance, it is called Separation of Duties. The person who writes the check cannot approve the check. SOX, NIST, and every internal controls framework codifies this.
In manufacturing, Toyota called it Jidoka: build quality in. Any worker on the line can pull the cord and stop production when they detect a defect. The verification step is not downstream. It is embedded in the flow.
In software, we call it code review. One person writes the code, another reads it before it merges. Google requires three independent approval bits: correctness, ownership, and language readability.
Every single one is the same structure: one role produces, another role verifies, and the depth of verification scales with the trust you have in the producer.
This gives us an equation. The net output of any productive system is V = min(P, R) × Q − (1 − Q) × F, where P is the production rate, R is the review rate, Q is quality, and F is the cost of failure. Throughput is bounded by whichever is slower. The depth of review scales with how much you trust the producer and how much a failure costs.
Banking understood this before electricity. Toyota understood it before software. Goldratt gave it math. What AI changes is the ratio. AI made P very large. It barely changed R. Every organization now faces three options:
- Review everything. Throughput equals R. Quality holds, but you are bottlenecked by verification capacity. This is most engineering teams today.
- Skip review. Throughput equals P. Fast, but quality drops. This is the internet right now: more content, lower average quality per item.
- Automate review. Throughput approaches P with quality maintained. This is the goal. It is only possible in domains where verification can be expressed as code.
The third option explains why software is transforming first. Tests are automated reviewers. A test suite is a verification queue that scales for free. When you can write assert result == expected, you have turned organizational judgment into an executable program.
The evidence
The production/verification gap is measurable across domains.
Academic research. arXiv submissions grew 47% in 2024 and 61% in 2025, five to seven times faster than the historical 8-9% annual growth in global publications (Zhang et al., 2025). Peer review cycle times remained flat: 12 to 14 weeks for medical and natural sciences, 25 or more weeks for economics and business (Drozdz & Ladomery, 2024). The top 10% of reviewers handle 50% of all reviews. ACL, one of the largest NLP conferences, saw submissions grow 8.7x from 2015 to 2024 while acceptance rates fell from 30% to 19.6% (Ma et al., 2025).
The consequences are showing. Retraction rates in stem cell research peaked at 0.84% in 2023, a 16 to 84x increase from historical averages (Song et al., 2025). Papers generated by SCIgen, a nonsense paper generator, passed peer review and were published. 96% of citations to retracted papers fail to mention the retraction (Schneider et al., 2020).
An honest caveat. Not all of this is caused by AI. COVID-era research funding surges, geographic expansion of the research workforce, and institutional publish-or-perish pressure were already straining the system before ChatGPT existed. The more accurate framing: exponential content growth from multiple causes has exposed the fundamental inability of elite gatekeeping systems to scale. AI accelerated a crisis that was already forming.
Software. The production side exploded. GitHub hosts hundreds of millions of repositories. npm processes millions of package publishes per year. Cursor crossed $2 billion in annual recurring revenue. Lovable went from zero to $400 million ARR in eight months.
The verification side is where it gets interesting. For code, automated testing means R can scale with P. Our pipeline demonstrates this: the review agent runs the test suite, checks coverage on changed lines, traces impact across related files. The exit condition is not "a human approved it." The exit condition is "tests pass, coverage is adequate, no critical gaps, no requirement is unaddressed." These are executable criteria.
This is why five companies building autonomous coding tools have crossed $100 million in annual revenue while autonomous research tools and autonomous medical tools remain largely theoretical. The difference is not the model. The difference is that software has assert.
Why software is first, not last
Software outputs are uniquely inspectable. You can diff them. You can test them. You can run them and observe the result. You can have a second program check the first program's output. No other domain has this property at the same level.
In medicine, verification is a 12-year clinical trial with a 90% failure rate. The cost of failure is catastrophic. No amount of AI acceleration on the production side changes the verification constraint.
In legal work, verification requires a senior partner who has practiced for decades to read a contract and judge whether it protects the client. That judgment cannot yet be expressed as assert.
In research, verification is three overworked professors reading a paper for free over six months. The system was never designed to handle exponential growth. Peer review is intentionally selective. The 12-to-25-week timeline is not a bug. It is a prestige mechanism that creates scarcity.
In each case, F is high and R cannot be automated. The equation predicts this directly. When (1 − Q) × F is large and R is fixed, increasing P does not increase V. It increases the size of the queue.
Software is not the last domain AI will transform. It is the first, because it is the one domain where the verification queue can be written in the same language as the production output.
The firm as a state machine
The deeper argument is not that agents can write code. It is that organizations are becoming programmable.
An AI factory is a system that turns signals into verified change through explicit, auditable loops. Not just task to code, but: signal to intent to plan to critique to implementation to verification to shipped change to new signal.
The coding pipeline we built is one instantiation. But the pattern does not require code as the output. It requires signals that can be captured digitally, intent that can be expressed as a structured task, a plan that can be critiqued before execution, execution that produces an inspectable artifact, verification with explicit criteria, and an auditable trace of how the system decided to act.
When all six exist, the organizational judgment loop is software. It can be versioned, debugged, optimized, and measured. The unit of competition shifts from headcount to loop speed: how fast can the organization go from signal to verified adaptation?
A task board like Linear is not the theory. It is one interface for an explicit organizational state machine. The webhook that triggers an agent is a state transition. The label that marks a PR ready is an exit condition. The reasoning trace is the audit log.
Every organization already runs this loop. Most run it implicitly, in meetings and Slack threads and institutional memory that evaporates when people leave. The programmable organization runs it explicitly, in data structures that persist, compound, and improve.
When the loop breaks
This system fails in specific, instructive ways.
The agent declares done too early. It writes code that compiles, passes the tests it wrote, and opens a PR. But the tests are shallow. They verify the happy path and miss the edge case that caused the original bug. The review agent catches some of these, but not all. When both the implementer and the reviewer share the same blind spot, the defect ships.
The plan is wrong. The agent reads the logs, misdiagnoses the root cause, and writes a plan that fixes the wrong thing. The swarm critiques the plan, but if the misdiagnosis is plausible, the critique confirms it. Three models agreeing does not mean they are right. It means the error was convincing.
The fix introduces a new bug. The third agent addresses the review findings, but its fix touches code the original agent did not touch. No test covers the new path. The test suite passes. The defect is invisible until a user encounters it.
Each failure has the same shape: the verification was not deep enough for the complexity of the change. The equation captures this directly. When Q drops and F is significant, the (1 − Q) × F term dominates. You shipped faster and lost more than you gained.
The lesson is not that autonomous loops are unreliable. The lesson is that verification depth must be proportional to the cost of failure. A label change can auto-merge after lint passes. An architecture change requires a human who understands the system. The pipeline is not a replacement for judgment. It is a way to make the judgment explicit, so you can decide where to apply it.
What this means
The important invention is not the coding agent. It is the harness that makes work legible, reviewable, and governable. The harness is the organizational operating system. The agent is a worker that plugs into it.
One consequence follows from the equation and the pattern, and it follows with near-certainty: every domain will build its own version of assert.
The equation demands it. If V = min(P, R) × Q − (1 − Q) × F, and AI has made P large, then the only way to increase V is to increase R. The only way to increase R at scale is to automate verification. Software did it with test suites. The domains that have not done it yet are stuck at their current R, watching the queue grow.
Research will develop automated replication checks and statistical verification. Legal will develop compliance engines that read contracts against regulatory databases. Medicine will develop accelerated trial simulations that compress the 12-year timeline. Finance already has real-time reconciliation. Each domain will find its own language for assert, and the domains that find it fastest will be the first to see their R scale with P.
This is not a prediction about technology. It is arithmetic. When production is cheap and verification is the constraint, the organizations that automate verification win. The rest are bottlenecked by a queue that grows every day.
The equation V = min(P, R) × Q − (1 − Q) × F synthesizes concepts that appear independently across banking (Maker-Checker), audit (Separation of Duties), manufacturing (Toyota's Jidoka), operations research (Goldratt's Drum-Buffer-Rope, 1984), and quality management (Juran's Cost of Quality). The individual components are centuries old. Their unification as a single model for AI-era organizational throughput, connecting the production/verification gap across domains, appears to be new. We welcome pointers to prior work that we missed.
Data sources: Zhang et al. (2025), SSRN #5287265; Drozdz & Ladomery (2024), British Journal of Biomedical Science, Vol. 81; Song et al. (2025), BMC Medicine; Ma et al. (2025), arXiv:2512.04448; Schneider et al. (2020), cited in Drozdz & Ladomery (2024).


