Even the Believers Underestimated AI
- Oliver Nowak
- 8 hours ago
- 9 min read
Every major AI paper seems to open with some version of "AI progress has been fast." And while that is obviously true, I have always found the claim frustratingly vague. Fast compared to what? Faster than we expected? This week, a report from Epoch AI gave us something rare: actual numbers. And the results are, frankly, extraordinary.
The AI Digest and Epoch AI ran a forecasting survey throughout late 2024 and early 2025. They asked 421 people, many with serious forecasting credentials, to predict AI progress across benchmarks, risk levels, and real-world metrics. These were not casual observers. Three-quarters had professional or academic AI experience. Half expected machines capable of outperforming humans at any task by 2030. 75% gave a greater than 10% probability of AI leading to catastrophic outcomes, including extinction.
In other words, these forecasters were already the bulls. They were the short-timelines, high-risk crowd. And yet, when the dust settled on 2025, the most striking finding was not that they were wrong about the technology. It was that they were wrong about the money.

The Revenue Surprise
The forecasters expected OpenAI, Anthropic, and xAI to reach combined annualised revenue of $16 billion by the end of 2025. The actual figure came in at $30.4 billion, nearly double. That is not a small miss. That is a fundamental misread of commercial momentum, coming from the very people who are most plugged into the AI discourse.
If you had told these same forecasters in early 2024 that annualised revenue would grow from roughly $6.4 billion to $30.4 billion in a single year, approximately a 4.8x increase, many would have dismissed it as hype. Yet here we are. Even the believers underestimated.
For anyone advising enterprises on AI strategy, this data point is sobering. We are not in a speculative bubble that has yet to produce commercial traction. We are in a market where the revenue reality is outpacing even the most aggressive expectations of informed forecasters. Global enterprise AI spending exceeded $600 billion in 2025, according to the D2A2 Insights Report. The money is very real.
Benchmarks: Where Predictions Mostly Held
The forecasters did much better predicting benchmark performance. On RE-Bench, a set of AI R&D tasks, the median forecast was 1.1 and the actual score was 1.13. On FrontierMath, an extremely challenging mathematics benchmark, the median forecast was 40% and the actual performance was 40.7%. These are remarkably close.
There were misses, of course. Forecasters substantially underestimated progress on Cybench, a cybersecurity benchmark, by about 20 percentage points. They overestimated SWE-Bench Verified, a coding benchmark, by about 7 points. But on balance, this crowd understood where the technology was heading.
The interesting question is why the benchmark predictions were so accurate while the commercial predictions were so far off. One explanation is that benchmarks are what this community pays closest attention to. They track model releases, read the system cards, follow the leaderboard updates. Revenue figures are noisier, harder to verify, and less central to the AI safety discourse that many of these forecasters inhabit.
The Benchmark-to-Business Gap
This brings us to a theme that I keep returning to in this blog: benchmark progress does not automatically translate into business value.
MIT's Project NANDA released a report in mid-2025 that found 95% of enterprise AI investments had produced zero measurable return. Now, that figure has been disputed, and some critics argue the methodology was too narrow. But the broad pattern is consistent with what I see every day in the field. Companies are piloting AI at unprecedented scale and struggling to convert those pilots into production systems that move the P&L.
The D2A2 Insights Report, published this month by a team of industry analysts, validates this from a practitioner perspective. Their prediction from early 2025 that "Agentic AI will NOT instantly boost business productivity" scored a perfect 5 out of 5 on their realisation scale. The report notes that early agentic systems faced "reliability, governance, and orchestration challenges" and that productivity gains required significant workflow redesign. This is not a technology failure. It is a translation failure.
The Epoch AI study reinforces this dynamic from the opposite direction. The forecasters correctly predicted that AI models would get dramatically better at coding, maths, and research tasks. But they did not anticipate just how quickly that capability would translate into commercial demand. It turns out that even without "transformation" at most enterprises, raw adoption has been explosive. ChatGPT, Claude, and similar tools are being used daily by over 90% of workers at surveyed organisations, often through personal subscriptions rather than enterprise deals.
The implication is uncomfortable for those of us who advise on transformation. The market is growing faster than the value realisation. This is not inherently bad, but it creates pressure. Eventually, the commercial momentum will need to be justified by measurable outcomes.
Two Types of AI, Two Types of Measurement
One of the most useful frameworks I have encountered for navigating this tension comes from the D2A2 report. They argue that enterprise AI investment has entered a "crisis of accountability" and propose a critical distinction: Productivity AI versus Engineered AI.
Productivity AIÂ includes tools like ChatGPT, Copilot, and coding assistants. These warrant qualitative and adoption metrics rather than hard ROI calculations. The report is blunt about why: "Sales teams write faster emails but still send the same number. Developers code better but follow the same release cycle. Hard ROI calculation produces false precision." What matters here is whether employees find tools useful and whether adoption is growing.
Engineered AIÂ includes process automation, system redesign, and custom applications. These investments consume significant capital and should deliver measurable impact. They require defined cost caps, explicit value targets, and clear accountability.
This distinction helps explain the gap between commercial momentum and business transformation. Most of the revenue growth at OpenAI, Anthropic, and others is likely driven by Productivity AI adoption, not by deeply engineered solutions. The 95% failure rate from MIT is measuring the wrong thing for the wrong category.
For enterprise leaders, the implication is clear: stop forcing productivity tools through ROI frameworks they cannot satisfy. Instead, focus your measurement discipline on the engineered solutions where capital is at stake and outcomes are attributable.
Classical AI Is Not Dead
Another theme from the D2A2 report that deserves attention: in the rush towards generative AI, classical AI has been "grossly neglected." The hype surrounding LLMs has led many in the C-suite to prioritise launching GenAI initiatives without recognising that many critical business use cases are better served by traditional machine learning.
Classical AI, the kind that predicts and optimises business processes based on structured data, remains indispensable for risk prediction, demand forecasting, operational optimisation, and process automation. The D2A2 analysts recommend applying a "separation of concerns" pattern: use classical AI for data-driven decision-making, including risk assessment, pattern recognition, and operational optimisation. Use generative AI for language, creativity, personalisation, and communication.
This matters for the forecasting discussion because benchmark progress on LLMs does not tell you anything about classical AI capabilities. And for most enterprises, classical AI is where the proven, measurable value lives. The shiny new reasoning models are impressive, but they are not a substitute for a well-tuned demand forecasting system.
The Shift Toward Smaller Models and Local AI
The D2A2 report makes two related predictions that speak to where enterprise AI is heading: the rise of Small Language Models (SLMs) and the critical mass of local and offline AI.
On SLMs, the argument is straightforward. Large Language Models come with infrastructure costs, energy consumption, hallucinations, and precision issues that make them suboptimal for many business challenges. Small Language Models use fewer parameters, are trained on narrower data, and are often more reliable for domain-specific tasks. Microsoft's Phi model family is already available on Azure for businesses seeking efficient and cost-effective alternatives.
On local AI, the drivers are economic, regulatory, and operational. For high-volume, repetitive tasks like document classification or data extraction, cloud API pricing becomes prohibitively expensive. In regulated sectors like healthcare, legal, and finance, strict privacy requirements make on-premise AI essential. And for operational resilience, local infrastructure avoids the fragility of cloud outages and rate-limit changes.
Both trends point towards a more diversified AI landscape. The future is not one giant model in the cloud. It is a hybrid architecture with workloads allocated based on sensitivity, economics, latency, and capability needs.
Governance Is Becoming Quantifiable
The Epoch AI forecasters underestimated CBRN risk (risks involving threats from Chemical, Biological, Radiological, and Nuclear agents). OpenAI's internal assessment moved from "medium" to "high" after GPT-5.2 demonstrated meaningful capability to assist non-experts in creating biological threats. The Epoch AI authors note that a separate study found superforecasters and biology experts drastically underestimated this kind of "uplift" potential.
This is the uncomfortable edge of AI forecasting. We are collectively not very good at predicting the harmful applications of improving capabilities. It is one thing to track benchmark scores; it is another to anticipate the second-order effects of general-purpose tools in the hands of motivated actors.
The D2A2 report argues that AI governance is shifting from principles to evidence-based accountability. Regulatory pressure from the EU AI Act, US executive directives, and emerging global standards now requires ongoing monitoring, documentation, and proof of human oversight. According to a Gartner survey cited in the report, organisations that perform regular audits of AI system performance are over three times more likely to achieve high GenAI business value.
The message here is that governance is not a drag on innovation. It is a driver of ROI. Quantified governance, including fairness stability, model drift, explainability scores, and human oversight coverage, is becoming a differentiator between organisations that can scale AI and those that cannot.
Public Awareness Lagged Behind
One final finding from the Epoch AI study stands out. The forecasters expected public concern about AI to rise significantly. The median forecast was that 2% of Americans would identify computers and technology as the nation's most important problem by the end of 2025. The actual figure was closer to 0.4%.
This is a meaningful gap. It suggests that the AI safety community has overestimated how visible these concerns are to the broader public. The narrative on X, in Substacks, and at AI conferences does not reflect mainstream attention. For better or worse, most people are using AI tools without existential dread.
For those of us working on AI education, this is worth noting. There is a genuine gap between insider concern and public awareness. Bridging that gap requires meeting people where they are, not where the narrative assumes they should be.
What This Means for Enterprise Leaders
The Dwarkesh Patel observation quoted in the Epoch AI report captures something important: "Models keep getting more impressive at the rate the short timelines people predict, but more useful at the rate the long timelines people predict." The revenue data complicates this slightly, suggesting adoption is outpacing even the bulls, but the transformation data from MIT suggests the deeper impact is still lagging.
For enterprise leaders, this creates a specific challenge. You are under pressure to adopt AI because the market is moving. Your peers are experimenting, your vendors are pitching agentic solutions, and your board is asking about your AI strategy. But the evidence suggests that most AI pilots fail to deliver measurable returns unless they are focused, narrowly scoped, and executed with external partners rather than internal builds.
The 5% of organisations that are succeeding, according to MIT, share a common trait: they pick one pain point, execute well, and partner smartly. They aren't just buying everyone a Copilot license. They are grounding AI in specific workflows with clear feedback loops.
The D2A2 report offers a complementary set of principles: separate Productivity AI from Engineered AI in your measurement frameworks; do not neglect classical AI for structured data problems; consider smaller models and local deployment for cost, privacy, and resilience; and treat governance as a strategic capability, not a compliance burden.
Looking Ahead
The Epoch AI study is part of a broader effort to bring rigour to AI forecasting. The AI Digest has launched a 2026 survey, and the Forecasting Research Institute continues to track expert and superforecaster predictions across a range of AI-related questions. The D2A2 group has now published six years of predictions, with an impressive track record of execution-aware, value-focused insights.
What strikes me most is the humility embedded in the findings. Even among the most informed observers, there is significant uncertainty and frequent error. The forecasters were roughly right about benchmark progress, mixed on risks, and substantially wrong about revenue. That is not a failure of methodology; it is a reflection of how genuinely uncertain the near future is.
For those of us working in digital transformation, this uncertainty is not an excuse for inaction. It is a call for adaptive strategy. Build systems that can learn and improve. Ground AI in real data and real workflows. Measure outcomes, not just adoption. And remain sceptical of both hype and doom, because even the believers got it wrong.
References
Ho, A. (2026, January 16). How well did forecasters predict 2025 AI progress? Epoch AI Gradient Updates. https://epochai.substack.com/p/how-well-did-forecasters-predict
AI Digest. (2025). 2025 AI Forecasting Survey Results. https://ai2025.org/
Challapally, A., Pease, C., Raskar, R., & Chari, P. (2025, July). The GenAI Divide: State of AI in Business 2025. MIT Project NANDA. https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf
Southekal, P., Marar, A., Singh, S., Balasubramanian, V., Zwingmann, T., & Tomov, V. (2026). D2A2 Insights: 2026 Digital, Data, Analytics, and AI Predictions and Prescriptions. DBP Institute & D2A2.ai. https://www.d2a2.ai
Forecasting Research Institute. (2026, January 14). What Experts and Superforecasters Think About the Future of AI Research and Development: LEAP Wave 4. https://forecastingresearch.substack.com/p/what-experts-and-superforecasters
Gartner. (2025, November 4). Gartner Survey Finds Regular AI System Assessments Triple the Likelihood of High GenAI Value. [Press release]. Referenced in D2A2 Insights Report 2026.
Kumar, A. et al. (2025, September 8). The Case for Using Small Language Models. Harvard Business Review. Referenced in D2A2 Insights Report 2026.
Furr, N. & Shipilov, A. (2025, August 29). Beware the AI Experimentation Trap. Harvard Business Review. https://hbr.org/2025/08/beware-the-ai-experimentation-trap
Patel, D. (2025, December). Thoughts on AI Progress. Dwarkesh Podcast. https://open.substack.com/pub/dwarkesh/p/thoughts-on-ai-progress-dec-2025
OpenAI. (2025). GPT-5.2 System Card. https://cdn.openai.com/pdf/3a4133c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf
