If a large language model just predicts the next word, how can it reason?

To predict the next token accurately across the breadth of human text, the model is forced to build internal representations that capture something far deeper than word patterns. The training objective is prediction, but the structure the model has to build internally to achieve it is considerably richer. What emerges from next-token prediction at scale functions like reasoning, even though no reasoning module was explicitly designed.

What is chain-of-thought reasoning in large language models?

Each token a transformer generates gets one forward pass, a fixed amount of computation. Chain-of-thought asks the model to produce intermediate tokens before its final answer. Each generated token becomes part of the context for the next prediction, giving the model more forward passes and letting each intermediate conclusion inform the next. The text it generates becomes a cognitive scaffold, a form of working memory the architecture does not otherwise have.

What is a 'thinking' model, and how is it different from a standard LLM?

A thinking model is trained to use a hidden reasoning space before producing visible output. Before generating anything the user sees, it produces potentially thousands of tokens of internal deliberation: exploring, decomposing, considering alternatives, and checking its own logic. Mechanically it is the same transformer architecture, trained through reinforcement learning to spend more tokens thinking before answering, and rewarded for doing so during training.

Are large language models actually reasoning, or are they just doing statistics?

It depends on what you mean by reasoning. If you mean conscious, self-aware thought, there is no credible evidence these models have subjective experience. But if you mean the ability to take a novel problem, decompose it, apply relevant principles, handle constraints, and arrive at a correct conclusion, then yes, these models do that. The functional outcome is the same regardless of which philosophical position you hold.

If AI just predicts the next word, how does it think?

Oliver Nowak
Apr 20
11 min read

I was recently asking myself a really obvious but strangely complex question: if all a large language model does is predict the next word, how can it actually reason?

At their mathematical core, large language models are simply next token prediction engines. During training, they are shown vast quantities of text and asked over and over again, given everything that came before, what comes next? Then billions of internal parameters are adjusted until they get better and better at that prediction, and that is essentially the entire training objective.

There is no explicit module labeled "reasoning". There is no logic engine bolted on the side. There is no database of facts being queried in the background. For all intents and purposes, it is simply really sophisticated auto-complete.

In the early days, when the outputs were impressive but shallow, this framing held up reasonably well. But in recent times, framing a large language model as "just auto-complete" feels misleading. It would be a bit like saying the human brain "just fires neurons". That's technically true, but it's definitely not telling the full story.

Why Next-Token Prediction Undersells What Is Happening

To fully understand what is going on, you have to appreciate what next-token prediction actually demands at scale.

Take a simple sentence: "The capital of France is..." Most people can predict the next word. Paris. That is autocomplete the way your phone keyboard does it. It's pattern matching against common sequences.

But now try something harder: "Sally has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?"

Ask a modern large language model this question and it will reliably answer correctly. The answer is 1 because Sally is one of the 2 sisters. The brothers share the same set of sisters, and Sally is included in that set.

This is not a case of the model matching a memorised question-and-answer pair. You can rephrase it a thousand ways, change the names, change the numbers, wrap it in a story, and the model will still get it right. Something is happening inside that network that looks, functionally, a lot like reasoning.

So how?

To predict the next token accurately across the extraordinary breadth of human text, from mathematics and logic puzzles to legal arguments, scientific papers, and philosophical debates, the model is forced to build internal representations that capture something far deeper than surface-level word patterns.

Imagine being asked to predict the next word in millions of mathematical proofs. You would eventually need to learn the mathematics. Not because anyone told you to, but because the patterns in the text are generated by mathematical reasoning, and the only way to reliably predict those patterns is to develop an internal model that mirrors the process that produced them.

That is precisely what happens inside a large language model. The training objective is prediction. The capability that emerges, the structure the model has to build internally to achieve that objective, is something considerably richer.

Blue and black infographic titled "The dismissal is correct. And misleading." compares LLM predictions: surface view vs. emergent scale.

What the Model Actually Builds

Researchers have put serious effort into understanding what is happening inside these networks, and the picture is, frankly, extraordinary.

During training, the model develops what are sometimes called internal representations: high-dimensional mathematical structures that encode relationships between concepts. These are not stored as explicit rules. They are not lookup tables. They are distributed across the network's parameters in ways that are still not fully understood.

But we know enough to say a few important things.

First, the model develops something that functions as a world model. When it processes the Sally puzzle, it is not matching against a template of similar puzzles it has seen before. It is constructing an internal representation of the scenario, including the family structure, the relationships, and the constraints, and using that representation to generate a coherent answer. Work from Anthropic's interpretability team (Templeton et al., Scaling Monosemanticity, 2024) and from MIT (Li et al., Emergent World Representations, 2022; Gurnee and Tegmark, Language Models Represent Space and Time, 2023) has shown these internal representations to be surprisingly structured. They encode spatial relationships, temporal sequences, logical dependencies, and even something that looks like causal reasoning.

Second, the model learns to decompose problems. When it encounters a multi-step reasoning problem, it does not arrive at the answer in a single computational step. The architecture itself, the transformer, processes information through multiple layers, and each layer performs additional computation on the representation built by the previous one. In a meaningful sense, each layer is a step in a reasoning chain, progressively refining the model's understanding before it generates an output.

Third, and this is the part most people recognise as "thinking models", the model's reasoning gets dramatically better when you give it room to think out loud.

The Chain-of-Thought Revolution

A transformer generates one token at a time. It takes your entire input, processes it through the network in a single forward pass, and produces one token. Then it takes the input plus that new token and runs another forward pass to produce the next one. And so on, token by token.

Each token gets one forward pass. That is a fixed amount of computation. No matter how hard the problem is, a single token receives the same computational budget.

So if you ask "what is 2 + 2?" and the model answers "4", one forward pass was sufficient. The reasoning fits comfortably within that fixed compute budget.

But if you ask the Sally puzzle and demand a one-word answer, you are asking the model to resolve the entire logical chain (parsing the family structure, recognising that brothers share sisters, identifying that Sally is one of those sisters, counting correctly) all within that same fixed computational budget. The same budget it used for "2 + 2 = 4". For a smaller or older model, it often cannot do it. The problem needs more sequential processing steps than a single forward pass can provide.

It might say 6. It might say 3. Not because it lacks the knowledge. Because it lacks the compute.

Chain-of-thought reasoning solves this. When you ask the model to think step by step, it starts producing intermediate tokens: "Sally has 3 brothers. Those brothers are all in the same family. Each brother has 2 sisters. Since they are in the same family, they share the same sisters..."

Each of those generated tokens becomes part of the context for the next prediction. Each intermediate conclusion gets baked into the context window, and the next token is predicted with the benefit of that accumulated reasoning. The model is not "showing its working" for your benefit. Each written step is a computational step. The text is the reasoning, not a description of reasoning happening somewhere else.

When you ask a model to "think step by step", you are not prompting it to display its method like a maths teacher insisting a student show their working. You are changing the computational process. You are giving the model more forward passes, more opportunities for each intermediate conclusion to inform the next one. The text it generates becomes a cognitive scaffold, a form of working memory that the architecture does not otherwise have.

This is why chain of thought works even when no one has told the model what the steps should be. It is also why forcing a short answer on a hard problem does not just hide the working. It removes the compute. You are not asking for a neater answer. You are cutting off the reasoning before it can happen.

Research has shown (Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2022) that this technique can take a model from near-chance performance on complex reasoning tasks to human-competitive accuracy. Not by changing the model. Not by adding a reasoning module. Simply by giving the model room to reason through language, one token at a time, each one earning the next.

Three panels explain a fixed compute budget for tokens in AI models, with steps: easy, overloaded, chain of thought. Blue and red text.

What Happens When You Select a "Thinking" Model

If chain of thought is the model reasoning through its visible output, the obvious question follows. What if you gave it a dedicated space to reason before it had to say anything at all?

This is waht is happening when you select a "thinking" model. With a standard model, you send a prompt and the model starts generating visible tokens immediately. Every token it produces is both part of the reasoning and part of the response you see. Essentially, the reasoning is performed in public.

With a thinking model, there is an additional phase. Before the model produces any visible output, it generates potentially thousands of tokens of internal deliberation. It explores the problem space. It decomposes the question. It considers alternatives. It checks its own logic. Sometimes it backtracks and tries a different path entirely. Only once this hidden reasoning is complete does it produce the clean answer you actually see.

Mechanically, this is not a different architecture. It is the same transformer, the same prediction mechanism, generating tokens in exactly the same way. What has changed is that the model has been trained, typically through reinforcement learning, to use that hidden reasoning space productively. It has learnt that spending more tokens thinking before answering produces better outcomes, and it has been rewarded for that behaviour during training.

There is a second, subtler difference. In a standard chain-of-thought response, the reasoning has to look coherent and presentable, because you are reading it. In the hidden thinking space, it can be messy. Exploratory. Redundant. It can circle back, contradict itself, try three different approaches to the same sub-problem, and discard two of them.

When you solve a hard problem yourself, you do not think in clean, presentable paragraphs. You scribble. You cross things out. You go down a dead end and come back. That is what these models are doing in their hidden reasoning space, and it turns out that freedom makes them considerably better at getting to the right answer.

The results are striking. These reasoning models can solve competition-level mathematics, write and debug complex software, work through multi-step logical puzzles, and handle ambiguity at a level that was not on the table two years ago. They do not always get it right. But the failure modes are becoming increasingly subtle, and increasingly similar to the kinds of mistakes a competent but imperfect human reasoner would make.

Comparison chart titled "Standard model vs thinking model." Describes public vs. private reasoning. Highlights different models and behaviors.

So Is It "Real" Reasoning?

There are two camps. The enthusiasts who claim the machine is thinking, and the sceptics who insist it is just statistics.

For me, it depends entirely on what you mean by reasoning.

If you mean conscious, deliberate, self-aware thought, the kind of reasoning that involves knowing that you are reasoning, then no. There is no credible evidence that these models have any form of subjective experience. They do not know they are solving a puzzle. They do not experience the satisfaction of finding an answer. And anyone who tells you otherwise is probably selling something.

But if you mean the ability to take a novel problem, decompose it into its constituent parts, apply relevant principles, handle constraints and exceptions, and arrive at a correct conclusion, then yes. These models do that.

The point that matters for anyone making decisions about AI deployment is this: the functional outcome is the same regardless of which philosophical position you hold.

Whether the model is "truly" reasoning or performing an extraordinarily sophisticated simulation of reasoning, the output either solves your problem or it does not. The contract it drafts either holds up to legal scrutiny or it does not. The code it writes either passes its tests or it does not.

The philosophical question is interesting. The operational question (can I rely on this output for this task, with this level of oversight?) is the one that actually matters.

The Sally Problem Is the Easy Part

I have used the Sally puzzle throughout this piece because it is a clean, self-contained example. But the Sally puzzle is easy for modern models. It was hard two years ago. It is not hard now.

The real frontier, and the real source of risk, is the messy, ambiguous, context-dependent reasoning that characterises most enterprise work. The kind of reasoning where there is not a single correct answer but a range of defensible positions. The kind where relevant information is spread across twelve documents and three conversations, and the model has to pull it all together. The kind where the question itself is poorly formed, and part of the job is figuring out what the right question actually is.

That is where there is still enormous human value. Someone says "we want AI to handle X", and X turns out to be one of those problems where the hard part is not the computation. It is the judgement. Knowing what matters. Understanding the organisational context. Reading the room. Distinguishing between what is technically correct and what is practically useful.

This is where models still struggle. Not because the prediction mechanism fails, butb because the task demands something that pure prediction, however sophisticated, cannot yet reliably provide.

What This Means for How You Deploy AI

Understanding the mechanism properly, not just the headline version, changes how you should think about three things.

Prompt design is not cosmetic. If each token gets a fixed computational budget, and if the model reasons by generating intermediate tokens, then how you structure your prompts directly affects the quality of its reasoning. Asking for a one-line answer to a complex question is not just unhelpful. It is actively removing the computational steps the model needs to reason well. Giving the model space to decompose, to consider alternatives, and to check its own logic is not a nice-to-have. It is an architectural decision about how much reasoning capacity you are giving the system. Selecting a thinking model for a complex task is the same principle applied at a deeper level.

The model's confidence tells you almost nothing about its correctness.

Because the model is optimised to produce coherent, well-structured text, its outputs always sound confident. A wrong answer sounds just as fluent as a right one. This is not a bug that will be fixed. It is a structural property of how the system works. Every deployment needs a verification layer. Every process that uses AI reasoning needs a human or automated check at the points where errors would actually hurt.

The gap between "can reason" and "can be trusted to reason unsupervised" is considerable.

This is the gap most organisations underestimate. The model can solve the Sally puzzle every time. But can it be trusted to reason through your procurement policy exceptions without oversight? Can it handle a customer complaint that has regulatory implications? Can it reason about when it should stop reasoning and escalate to a human?

These are not theoretical questions. They are the questions that separate organisations getting real value from AI reasoning from the organisations that have an expensive and occasionally embarrassing chatbot.

Retiring the "Just Autocomplete" Frame

The "just predicts the next word" framing is not wrong. It has become one of those explanations that is technically accurate and practically useless, a bit like describing a car as "just controlled explosions".

Underneath that surface-level description, there is a training process that forces the development of internal representations far richer than surface patterns. There are capabilities, including decomposition, abstraction, and multi-step inference, that nobody explicitly designed but that arise because they are necessary to predict well. There is a mechanism, chain of thought, that turns the model's own output into additional compute. And there is a new generation of thinking models that take this principle and build it into the training itself, dedicating thousands of hidden tokens to deliberation before producing a single visible word.

None of this means the model thinks the way you and I think. It does not. None of it means you can trust it blindly. You cannot. And none of it means the hard problems of enterprise deployment, including governance, verification, contextual judgement, and organisational readiness, are somehow solved because the reasoning is impressive.

But it does mean that dismissing these systems as "just autocomplete" is no longer a defensible position. What emerges from next-token prediction, at sufficient scale and with the right scaffolding, is something new in kind. Not a replacement for human reasoning, but a different kind of cognitive tool that needs its own operating manual.

The question is not whether these models are genuinely reasoning. It is whether your organisation is building the infrastructure around them to make their reasoning usable.

The organisations getting this right are not the ones most impressed by the capability. They are the ones that understand the mechanism well enough to know where it works, where it fails, and where the human still needs to be in the loop.