Back to blog

AGI Won't Pass a Test. It'll Build a Company.

The real test of general intelligence is whether AI can build a living, adaptive company from scratch. ARC-AGI 3 just showed how far that is, and why the distance matters.

In February, Francois Chollet released what he designed to be the benchmark that outlasts all the others. ARC-AGI 3 doesn't ask language models to reason about text, solve math problems, or write code -- tasks where they've gotten embarrassingly good. Instead, it drops them into video-game-like environments with no instructions. No stated goals. No explanation of the rules. Figure out what you're supposed to do, then do it, then adapt when the next level changes everything.

Untrained humans solve 100% of the environments. Opus 4.6 scored 0.25%. GPT 5.4: 0.26%.

ARC-AGI 3 scores (undefined environments)
Humans
100%
GPT 5.4
0.26%
Opus 4.6
0.25%

These are the same models that write production code, pass bar exams, and ship hundreds of pull requests a month. The distance between 100% and 0.25% should be impossible to reconcile. Unless intelligence isn't one thing, and we've been measuring the wrong half of it.

The Distinction Nobody Talks About

Consider what ARC-AGI 3 actually demands. Each environment has internal rules that the agent must discover through experimentation: probe the boundaries, observe what happens, revise the mental model, probe again. There is no training data for these specific environments. No one has described the solution in a blog post or Stack Overflow thread or textbook. The agent has to construct an understanding of a world it has never seen, from scratch, in real time.

Now consider what today's AI agents do extraordinarily well. Write a function that validates email addresses. Draft a contract clause for a Series A financing round. Refactor a legacy module to reduce latency by 40%. Each of these tasks shares something in common that is easy to miss: someone already defined the problem. The specification exists, either explicitly or through thousands of similar examples in the training data. The tools are known. The shape of success is clear before the work begins.

The distinction isn't smart versus dumb. It's defined versus undefined.

Defined "Here's the task, here are the tools, here's what success looks like." Execution under clear constraints. AI is superhuman here today.
Undefined "Figure out what to build, who to serve, and how to survive." No instructions. No stated goals. No rules given. Humans handle this intuitively. AI scores 0.25%.

This gap is the entire AGI debate compressed into a single number. And it suggests that the real test of general intelligence has nothing to do with benchmarks at all.

Why a Company Is the Real Test

If you wanted to design a single test for general intelligence -- not a benchmark, but a test of everything we actually mean when we use the word -- you would want something that requires fluid reasoning in undefined environments, sustained over months, with consequences that compound. You would want a test where the rules change without warning, where success depends on reading ambiguous signals, and where raw intelligence alone isn't enough without the organizational scaffolding to apply it.

That test already exists. It's called building a company.

Think about what a startup founder does in the first year. You wake up with no instructions, and the market sends conflicting signals. Customers tell you they want one thing and pay for another. Your competitor pivots and suddenly your positioning is wrong. Every assumption you made last month might be irrelevant this week, and the only way to find out is to notice the shift, which requires exactly the kind of fluid environmental reasoning that ARC-AGI 3 measures and current AI utterly lacks.

A company is not a bridge, which is designed once and stands. It is not software, which executes the same logic every time it runs. A company is a living system that must continuously adapt or die, and building one from scratch requires you to do things that no benchmark tests. You have to convince a stranger to pay money for something that doesn't fully exist yet. That's persuasion. You have to decide between building the feature your biggest customer is begging for and the one that opens an entirely new market. That's judgment under uncertainty. And when your cofounder leaves six months in, taking half the codebase knowledge with them, you have to rebuild the institutional memory from whatever scraps remain. That's adaptive resilience.

Companies survive losing their people. One person leaves. Five leave. A third of the company walks out. The company persists, because the intelligence isn't in any single person. It's in the system itself: the processes, the relationships, the institutional memory that persists across departures. Creating that kind of resilient, adaptive organization from nothing is the hardest thing intelligence does.

This isn't a fringe idea. OpenAI's internal roadmap ends at "Organizations" -- AI that performs the work of an entire company. It is the final stage of their AGI progression, beyond Agents and beyond Innovators. They know this is the destination. Everyone does. They just disagree on the timeline.

The Hundred-Dollar Version

The $100 test

In 2023, Mustafa Suleyman proposed his own version of a practical intelligence test: give an AI $100,000 and see if it can turn it into a million. It was a good instinct hobbled by the wrong number. With a hundred thousand dollars you can hire contractors, buy ads, and absorb mistakes along the way. The capital does half the work. That is a portfolio management test, not an intelligence test.

Strip it down. One hundred dollars.

With $100, every decision is load-bearing. You cannot A/B test your way to product-market fit. You cannot hire your way out of confusion. You have to identify a market need nobody pointed you toward, build something people will actually pay for, acquire customers without a budget, and create processes that generate recurring revenue, all while operating in an environment where the rules change weekly and nobody tells you when they do.

The journey from nothing to something, with no safety net, is the purest test of general intelligence we have.

Can today's AI do this? No. An agent that cannot infer the rules of a visual puzzle cannot infer the rules of a market. But the question isn't whether today's agents can pass the test. The question is what needs to exist for future agents to pass it.

And the answer, I think, isn't just smarter models.

The Infrastructure Gap

The infrastructure gap

It is tempting to assume the models just need to get bigger, that a few more generations of scaling will close the 0.25% gap on their own. Maybe that's true for the benchmark. But running a company isn't a model problem. It's an infrastructure problem.

Think about what a functioning organization requires that no model, however intelligent, can provide on its own:

  • Persistent memory that spans months of operation -- not a context window that resets every session, but genuine institutional recall. The kind of memory where an insight from January shapes a decision in October without anyone having to remember to bring it up.
  • Adaptive goals that restructure themselves when reality contradicts the plan. Not static objectives that drive the company off a cliff because nobody updated the OKRs, but priorities that shift because the system noticed the ground moved.
  • Multi-agent coordination where specialists hand off context the way colleagues do, with shared understanding, implicit trust, and accumulated working relationships. Not the way chat messages do.

This is the gap between a smart model and a functioning organization. It is an engineering problem as much as a research problem, and arguably the more urgent one, because the models are improving every quarter while the infrastructure barely exists.

Why the Distance Is Exciting

When the gap closes (and eventually it will), AI organizations will have structural advantages that human organizations physically cannot match. Three stand out.

When a person leaves a company, they take everything in their head with them: the client relationships they built, the informal processes they invented, the institutional knowledge that was never written down because it lived in hallway conversations. When an AI agent is retired, the memory layer persists. Nothing is lost.

When a human team needs to restructure, the process takes months. Severance negotiations, recruiting cycles, weeks of interviews, the quiet demoralization that follows every reorg as people wonder whether they are next. An AI organization reconfigures in seconds. If the market shifts on Tuesday, the team can be different by Wednesday.

When two human teams need to coordinate, they schedule a meeting. Then a follow-up meeting to clarify what was decided in the first meeting. Then a Slack thread because the meeting notes didn't capture the nuance. Two AI agents share a function call.

These advantages aren't speculative. They are architectural consequences of how AI systems already work. They are just waiting for the cognitive capability, the fluid reasoning that ARC-AGI 3 measures, to catch up with the structural scaffolding.

The Timeline

Demis Hassabis puts human-level AI at five to ten years. Chollet designed ARC-AGI 3 to be the benchmark that survives the longest, which tells you something about his confidence that it will, eventually, fall. The trajectory is steep, and the slope keeps steepening.

The first AI that passes the company test will not look like a chatbot that got very good at conversation. It will look like an organization: dozens of specialized agents coordinating through shared memory, adapting their own structure in real time, discovering what to build by reading the same ambiguous signals that human founders read.

And it probably will not ace abstract reasoning benchmarks first. Plenty of successful founders would struggle with Chollet's visual puzzles. Company-building draws on accumulated social intelligence, long-horizon planning, and the ability to hold contradictory information without resolving it prematurely -- skills that pure reasoning benchmarks don't capture. Which is exactly why building a company is the more honest test.

Right now, the distance between 0.25% and a functioning AI company feels enormous. It is enormous. But the gap isn't a deficit of intelligence. The intelligence is already superhuman in defined domains. What's missing is the infrastructure: persistent memory, adaptive goals, the coordination layer that turns raw capability into a functioning organization.

That layer is the most valuable piece of software nobody has built yet. And 0.25% is where the story starts, not where it ends.


See how this works.

Read Next