
Nobody can agree on what AGI means. That's the tell.
The labs call it a sloppy term. The researchers say 5-10 years. Francois Chollet just launched ARC-AGI 3, and every frontier model scored below 1%. Meanwhile, those same models write production code, pass bar exams, and ship hundreds of pull requests a month.
So which is it? Superhuman or subhuman?
Both. It depends entirely on what you're measuring. And that's the problem -- we've been measuring the wrong thing.
Benchmarks Test the Wrong Thing
Every benchmark that was supposed to settle the AGI question has either been crushed and dismissed, or has revealed a gap nobody expected.
ARC-AGI 3 is the latest. Unlike previous benchmarks that tested static pattern matching, it drops agents into video-game-like environments with no instructions, no stated goals, no rules. Figure out what you're supposed to do. Then do it. Then adapt when the next level changes everything.
Untrained humans solve 100% of the environments. The best AI scored 12.58% -- and it wasn't even a language model. It was a CNN with structured search. Frontier LLMs? Opus 4.6: 0.25%. GPT 5.4: 0.26%.
The same agents that write production code and land hundreds of pull requests per month cannot figure out the rules of a simple visual environment that any human navigates intuitively.
This is important. Not because it proves AI is dumb -- it obviously isn't. But because it shows that current AI is extraordinarily capable in defined environments and nearly helpless in undefined ones.
And that distinction is everything.
Defined vs. Undefined
Today's AI agents excel when someone defines the problem. Here's the task, here are the tools, here's what success looks like. Go. Under those conditions, they're superhuman.
But the most impressive thing humans do isn't executing defined tasks. It's operating in undefined environments. Waking up with no instructions. Reading ambiguous signals. Inferring what matters. Deciding what to build, who to serve, and how to survive -- without anyone telling you the rules.
That's what ARC-AGI 3 tests, stripped to its purest form. And that's what building a company demands every single day.
The distance between these two is the distance between a powerful tool and a general intelligence.
The Real Test
If you wanted one test for general intelligence -- not a benchmark, but a test of everything we actually mean when we use the word -- it would be this: build a company from scratch.
A company is the most complex thing humans build. Not a bridge -- a bridge is designed once and stands. Not software -- software executes the same logic every time. A company is a living system that must adapt continuously or die.
Building one requires every cognitive capability at once. Strategy. Persuasion. Resource allocation. Judgment under uncertainty. Adaptation to novelty. And the creation of structures that outlast any single person or decision. It requires genuine creativity and fluid reasoning and the sustained execution that no benchmark measures.
Here's what makes it an especially honest test: companies survive losing their people. One person leaves. Five leave. A third of the company walks out. The company keeps going. The intelligence isn't in any single person. It's in the system -- the processes, the relationships, the institutional memory that persists. Creating that kind of resilient, adaptive system from nothing is the hardest thing intelligence does.
This isn't a fringe idea. OpenAI's internal roadmap ends at "Organizations" -- AI that performs the work of an entire company. It's the final stage, beyond Agents and beyond Innovators. They know this is the destination. Everyone does. They just disagree on the timeline.
The $100 Version
Make it concrete. Give an AI $100. Not $10 million -- a hundred dollars.
Mustafa Suleyman proposed something similar in 2023: give AI $100,000 and see if it can turn it into $1 million. But that's a portfolio manager test, not an intelligence test. With $100K you can hire contractors, buy ads, absorb mistakes. The capital does half the work.
The real test is closer to what the best founders do: start with almost nothing and build a unicorn. Not because the billion-dollar outcome matters, but because the journey from nothing to something -- with no safety net -- is the purest test of general intelligence we have.
With $100, every decision is load-bearing. What to build. Who to serve. What to skip entirely. The AI has to identify a market need nobody pointed it toward, build something people will pay for, acquire customers, and create processes that generate recurring revenue.
Not through trading. Not through gambling. Through the unglamorous work of building foundations -- the kind that lets a business survive its first year and compound from there.
Can today's AI do this? No. An agent that can't infer the rules of a visual puzzle can't infer the rules of a market. But the question isn't whether today's agents can do it. The question is what needs to exist for future agents to do it.
And the answer isn't just smarter models.
What's Missing Isn't Intelligence
The models get smarter every quarter. The raw reasoning keeps improving. But running a company isn't a model problem. It's an infrastructure problem.
Think about what a functioning company needs that no model provides:
- Persistent memory across agents, sessions, and months of operation -- not just a context window but genuine institutional memory
- Adaptive goals that evolve as conditions change -- not static objectives but priorities that restructure themselves when reality contradicts the plan
- Multi-agent coordination where specialists hand off context like colleagues, not chat messages
- World models that let agents understand cause and effect in their environment, not just pattern-match on text
This is the gap between "smart model" and "functioning organization." It's an engineering problem as much as a research problem.
Why the Distance Matters
Here's what makes this exciting rather than discouraging: when the gap closes, AI organizations will have a structural advantage that human organizations physically cannot match.
Every time a human is in the loop, things slow down. Meetings. Alignment sessions. Reorgs. Performance reviews. Large companies spend enormous energy just keeping everyone pointed in roughly the same direction.
An AI organization doesn't have alignment meetings. It has shared memory and instant coordination. The bandwidth between agents isn't a conference room -- it's a function call.
Hiring a human takes weeks. Onboarding takes months. Restructuring a team costs severance, recruiting cycles, and institutional trauma. An AI spins up a new agent in seconds. If the problem changes, kill the agent, spin up a different one. No lost institutional knowledge, because the memory layer persists across everything.
This structural advantage is real. It's just waiting for the cognitive capability to catch up. The organizational scaffolding for AI companies is being built now, even though the AI isn't ready to fill it.
The Honest Timeline
We're in the age of AI agents -- systems that act autonomously on defined tasks. The next frontier is AI that innovates. Beyond that is AI that runs organizations. That last stage is not next year. ARC-AGI 3 makes the gap viscerally clear.
But the trajectory is steep and accelerating. Demis Hassabis gives it 5-10 years. Chollet designed ARC-AGI 3 to be the benchmark that survives the longest -- which means he expects it to eventually fall.
The first AI that passes the company test won't look like a chatbot that got really good. It'll look like an organization -- dozens of specialized agents coordinating through shared memory, adapting their own structure in real time, discovering what to build by reading the same ambiguous signals that human founders read.
And it probably won't ace abstract reasoning benchmarks first. Plenty of successful founders would bomb a pattern-matching test. Company-building draws on accumulated knowledge, social intelligence, and long-horizon planning that pure reasoning benchmarks don't capture. The test for general intelligence is broader than any single benchmark -- which is exactly why building a company is the right one.
The Race
If company-building is the real AGI test, then the race isn't about bigger models or better benchmark scores. It's about the orchestration layer: the infrastructure that turns intelligence into a functioning organization.
Agent coordination. Persistent memory. Adaptive goal management. World models. And an interface that lets a human founder direct the entire operation without writing code.
The models are getting smarter. The infrastructure for AI-run organizations doesn't exist yet.
That's the most valuable piece of software nobody has built yet.


