Is AI Confidence Your Biggest Enterprise Risk?

Oliver Nowak
Dec 16, 2025
5 min read

A man questions a confident computer screen. Screen text reads, "Because I am trained on the entire internet." The setting is humorous.

Anyone who has spent the last year playing with ChatGPT, Gemini, Claude or otherwise knows the double act: a model that can answer exam questions and write boilerplate code at lightning speed, yet just as easily makes up a policy for your own business. When the response looks polished, it can be tempting to believe it. That is why a recent paper, AA‑Omniscience, fascinated me so much.

The authors built a benchmark of 6,000 questions across 42 economically significant topics and six domains, deliberately penalising incorrect guesses and rewarding abstention. They call the resulting score the Omniscience Index. It ranges from –100 to 100 and is designed to capture both factual recall and the humility of saying “I don’t know”. Out of all the frontier models tested, only three managed to get above zero. Accuracy alone was not the problem; hallucinations were. As Grok 4 and GPT‑5 demonstrated, you can be the most accurate model at 39% and your overall score still gets dragged down by hallucination rates north of 60%.

I have written a lot about the “iron triangle” of AI adoption: data, context and grounding. This research confirms that long held AI suspicion: parametric memory is a trap. The knowledge in a model’s weights might answer a trivia question, but it will fail when you ask about your own systems. Worse, it will fail confidently.

Head vs Tail - where value lives

If you take a walk through your organisation’s service desk, you will notice two types of questions. Head questions: the capital of France, basic contract terms, widely known coding syntax; these are the ones we have all been optimising AI to answer. Models have memorised all of those. The Tail questions are where the real work lives: the bizarre error code thrown out by a 20 year old mainframe, that specific clause your legal team negotiated back in 2019, the local regulation that only applies to waste management in the Highlands. They do not show up on Wikipedia.

When the AA‑Omniscience researchers tested models across law, business, software engineering, health, humanities and science, no single model dominated. Claude 4.1 Opus scored highest overall thanks to low hallucination and moderate accuracy, while GPT‑5.1 topped business queries and Grok 4 excelled in health and science. In other words, you cannot pick a model by looking at its general intelligence; you must understand how it behaves in your domain.

Two men discuss an AI's humor, comparing cloud hosts. The computer screen displays text about unicorns and fairy dust, evoking skepticism.

The danger here is not ignorance; ignorance is fixable with retrieval. The danger is confident ignorance. AA‑Omniscience deliberately rewards models for abstaining. A model that answers half its questions correctly and leaves the rest blank scores 0; a model that guesses and hallucinates on all questions plunges to -100. In enterprise settings, a polite “I don’t know, let me check the documentation” is priceless.

Four quadrants of reliability

One of the most interesting findings in the paper is the four‑quadrant view of reliability. Think of it as a scatter chart: knowledge on one axis, calibration on the other. You get four archetypes:

High knowledge / high reliability - these rare models (Claude 4.1 Opus, GPT‑5.1, Grok 4) answer more correctly than incorrectly and know when to admit uncertainty. They are the current gold standard.
Low knowledge / high reliability - models like Claude 4.5 Haiku and Llama 3.1 405B answer fewer questions correctly overall but refrain from hallucinating. They are safe orchestrators and excellent at routing tasks to tools.
High knowledge / low reliability - this is where models like GPT‑5 (minimal) and Gemini 2.5 Flash sit. They know a lot but guess often. In enterprise contexts they are potentially dangerous without guardrails.
Low knowledge / low reliability - combine ignorance with confidence, and you get a model that should not be anywhere near your business protocols.

The separation of knowledge and calibration is a sobering reminder that sheer parameter count does not guarantee reliability. The paper shows that smaller models sometimes outperform larger ones on reliability. Scale improves accuracy but does not automatically reduce hallucinations.

From Oracle to Orchestrator

So what do we do with this information? We must stop treating LLMs as omniscient oracles. Instead, we treat them as orchestrators i.e. reasoning engines sitting on top of our own data and processes.

Shift from parametric recall to retrieval‑augmented generation

Models will not miraculously learn your bespoke outage procedure. Connect them to the right data. Retrieval‑augmented generation (RAG) fetches the actual policy text from your knowledge base and uses it as context. It is the only way to improve the Omniscience Index within your four walls. As the paper notes, embedded knowledge can help with efficiency, but domain‑specific information remains essential.

Engineer for humility

Set system prompts that reward caution. The best agents are those that politely admit when they cannot answer. They ask for more context or go off to check a document. The AA‑Omniscience evaluation penalises guessing; you should, too.

Audit the tail

Stop measuring new models against generic trivia. If you are in transport, test the agent on your rail‑scheduling edge cases. If you are in healthcare, ask about obscure compliance rules. As the research shows, performance varies wildly by domain. The tail is where hallucinations hide and where value emerges.

Consider cost vs. reliability

AA‑Omniscience also measures the cost of running each model. Some high‑performing models require substantial token budgets. Smaller, well‑calibrated models might offer better value for narrow tasks. Do not just reach for the newest or largest; match the model to the job and your budget.

Why this matters

In digital transformation projects, we often chase automation on autopilot. We talk about “agentic AI,” letting the bots handle tasks like incident resolution, approval flows and knowledge search while we focus on strategy. That is a worthy goal. But there is a risk: if our agents confidently hallucinate, we automate at scale the very mistakes we used to catch manually.

Models with high general intelligence but poor calibration may pass surface‑level demos but totally fail in your operational reality. The path forward is to ground them in the reality of your systems. Use the CMDB, and your vendor documentation. Build feedback loops so the model learns when it is outside its comfort zone.

Looking ahead

By the time you read this, there will likely be a new model with even more impressive parameter counts. If the past year is any guide, it will be faster, more coherent and, paradoxically, more believable when it hallucinates. The Omniscience Index gives us a way to cut through the hype. It reminds us that intelligence and reliability are not the same thing.

In 2026, the most successful AI projects will not be the ones that boast the highest IQ scores. They will be the ones built on humility, retrieval and human oversight. They will know that true intelligence is not pretending to know everything. It is knowing when to say "I don't know".