The Search Retriever is not the LLM

Oliver Nowak
Mar 23
9 min read

There's been a lot of talk about "context engineering" in 2026 i.e. giving AI access to documents and information. But what does it mean when an AI system "finds a document"? For you to fully understand whether one AI "finds documents" better than another, you first need to understand how it works.

The retriever is not the LLM

This is the first thing to get clear, because the conflation is almost universal.

When an AI assistant searches your Sharepoint, your knowledge base, or your internal documentation, the large language model (LLM) is not doing the searching. It cannot do the searching. LLMs generate text based on context they've been given; they don't reach into external systems and pull documents. Something else has to do that work first.

That something else is called a retriever. It's a distinct piece of software, a separate service, running its own logic, with its own architecture, whose entire job is to take a query and return the most relevant chunks of content from a corpus (body of source material the retrieval system has been setup to search across). The LLM only enters the picture once the retriever has done its job. It receives the retrieved content as context, and then generates a response.

Why does this matter? Because when two AI tools produce different quality answers from the same data source, your first instinct might be to attribute the difference to the model. Almost always, it's the retriever. The model is only as good as what it was given to work with.

Diagram titled "The Retrieval Pipeline" showing stages: User Query, Retriever, and LLM. Includes steps like query processing and response generation.

Keyword retrieval

The first and oldest retrieval strategy is keyword search. Understanding how it actually works is important, because the limitation follows directly from the mechanism.

When a corpus is set up for keyword search, a data structure called an inverted index is built. Think of it like the index at the back of a book, but comprehensive. Every word in every document is catalogued, along with a list of which documents contain it and how frequently. The index doesn't store documents as documents, it stores words as pointers to documents.

When a query arrives, the search engine (something like Elasticsearch, Apache Solr, or Azure Cognitive Search running in keyword mode) tokenises it, i.e. breaks it into individual terms, and looks up each term in the inverted index. Documents containing those terms are returned, ranked by scoring algorithms like BM25, which weight results based on how often the query terms appear in a document relative to how common those terms are across the whole corpus.

So when I say "it looks for documents that contain the words in your query" it's essentially a search engine service, running BM25 scoring against an inverted index it pre-built from your document corpus.

The limitation is structural, not incidental. If a user asks "how do we handle major incidents?" and the relevant document uses the phrase "P1 escalation procedure", the keyword retriever finds nothing useful. The concept is the same. The words are different. The mechanism has no way to bridge that gap.

Semantic retrieval

Semantic retrieval solves the vocabulary mismatch problem, but through a mechanism that needs careful explanation.

The central idea is that text can be converted into numbers, specifically, a list of numbers called a vector. This is done in such a way that text with similar meaning produces vectors that are numerically close together. A query about "P1 escalation procedures" and a document about "handling major incidents" end up near each other in this numerical space, even though they share almost no words. Proximity in vector space serves as a proxy for conceptual similarity.

The component that performs this conversion is called an embedding model. And this, again, is not the LLM.

An embedding model is a separate neural network, architecturally distinct from the generative models used to produce text. Where an LLM is trained to predict and generate sequences of tokens, an embedding model is trained specifically to map text into this numerical space in a semantically coherent way. A sequence of text goes in; a fixed-length list of numbers comes out. These are typically somewhere between 768 and 3,072 numbers depending on the model. That list is the embedding, and it represents the meaning of the input as the model has learnt to interpret it.

Examples of embedding models in common use: OpenAI's text-embedding-3-large, Cohere Embed, and a wide family of open-source BERT-based models from the sentence-transformers library. These are separate products from the LLMs with separate APIs, separate pricing, and separate performance characteristics.

When a user submits a query, the same embedding model converts it to a vector. The retrieval system then searches a database of pre-computed vectors built from the document corpus and looks for the vectors most similar to the query vector. Similarity is measured mathematically, typically using cosine similarity (the angle between two vectors). The documents corresponding to the closest vectors are returned as the retrieval results.

The "closeness" here is meaningful because of how the embedding model was trained. It has seen vast amounts of text and learnt that "P1 escalation" and "major incident handling" tend to appear in similar contexts, so they end up in similar regions of the vector space. The retrieval system is essentially asking: what content, when mapped through this model's understanding of meaning, sits nearest to this query?

Indexing

This brings us to indexing. It is the most consequential design step in the entire retrieval architecture.

Indexing is the offline process of preparing a document corpus for retrieval. It involves three steps, and each one shapes the quality of everything that follows.

Step one: chunking. Raw documents like SharePoint pages, PDFs, knowledge articles, wiki entries are broken into smaller pieces. These pieces are called chunks. This is necessary because embedding models have token limits (they can only process so much text at once), and because returning an entire 40-page document as "relevant context" is not useful. The goal is to return the specific section that answers the query.

Step two: embedding. Each chunk is passed through the embedding model. Out comes a vector representing that chunk's meaning.

Step three: storage. Each vector, along with the original chunk text and any metadata (source document, date, author, etc.), is stored in a vector database, the purpose-built system designed to store large numbers of vectors and perform fast nearest-neighbour searches across them. Pinecone, Weaviate, Qdrant, and Azure AI Search in vector mode are common examples.

The result is a searchable map: every chunk of your corpus has a precise location in a high-dimensional space, determined by its meaning as understood by the embedding model. At query time, the system locates the query on that same map and returns whatever sits nearby.

Diagram titled "The Indexing Process" with 3 steps: 1. Chunking, 2. Embedding, 3. Storage. Explains document indexing in vector databases.

Why chunking is the most underestimated variable

Each chunk only receives one vector. That means that single vector has to represent the meaning of everything in the chunk. If the chunk contains a coherent, focused piece of information, for example, a complete explanation of one concept, a single procedure, or a specific policy, then the vector will sit cleanly in the part of the space where that concept lives, and retrieval for relevant queries will be reliable.

But if the chunk is poorly constructed, the vector suffers for it. A chunk that cuts halfway through a table, or spans two unrelated topics because the document was split at arbitrary character counts, produces a vector that doesn't sit cleanly anywhere. It gets pulled in multiple semantic directions at once. It will be retrieved for some of the right things, missed for others, and returned for irrelevant queries it happens to partially match.

In other words, the chunking strategy determines embedding quality. Embedding quality determines where content sits in vector space. Where content sits in vector space determines whether it gets retrieved. Whether it gets retrieved determines what context the LLM receives. And what context the LLM receives determines the quality of the output.

Every answer a retrieval-augmented AI produces is, at some level, a downstream consequence of a chunking decision that was made when the corpus was first indexed. Most people using these tools have no idea that decision was even made, let alone what it was.

Good chunking respects semantic boundaries. It keeps complete arguments, procedures, or explanations together. More sophisticated strategies use hierarchical chunking: small, precise child chunks for accurate retrieval, with pointers to larger parent chunks that provide full context once the relevant section is found. The best implementations are document-structure-aware, treating headers, tables, and lists differently from flowing prose.

Poor chunking uses naive fixed-size splits, for example, every 500 tokens, regardless of where that lands in the content's logical structure. It's easy to implement, but it produces mediocre retrieval at best.

Hybrid search and the reranker

Neither keyword nor semantic search alone is sufficient, which is why the strongest retrieval architectures run both in parallel.

Keyword search finds precise matches that semantic search can miss, for example, specific product codes, named processes, or exact policy references. Semantic search finds conceptually relevant content that keyword search cannot reach. Running both and combining their result sets is called hybrid retrieval, and it consistently outperforms either approach individually.

But combining two ranked lists of results creates a new problem: how do you decide which of the combined results are actually most useful? This is where reranking enters. A reranker is, again, a separate component. This is typically a smaller model trained specifically for relevance scoring, rather than for generation or embedding. It takes the combined retrieval candidates and scores each one for how well it actually addresses the specific query. It doesn't find documents; it judges the quality of the documents already found. The top-scored results after reranking are what get passed to the LLM as context.

The difference this makes is significant. Without reranking, retrieved results are ordered by how closely their vector matched the query vector, or how frequently query terms appeared. With reranking, they're ordered by how well they actually answer the question. These are not the same thing.

Comparison of retrieval strategies: Inverted Index, Vector Similarity, and Hybrid. Includes their workings, results, and reranker info in blue/purple text.

MCP: Pre-built retrievers and the control trade-Off

The Model Context Protocol is worth understanding in this context because it changes the architecture of retrieval in a specific way.

MCP is an open standard that allows a language model to connect to external tools and data sources through a consistent interface. In the context of retrieval, an MCP server can expose a pre-built retriever for a given data source, for example, Sharepoint, a database, a ticketing system, so that an AI client (the interface you're used to working out of) can invoke it without having to build the retrieval pipeline from scratch.

This is genuinely useful. It reduces the development overhead of connecting an AI tool to an enterprise knowledge base. Instead of building and maintaining a custom indexing pipeline, a vector database, and a search service, you point the client at an MCP server and the retriever is already there.

But there is a trade-off. When you use a pre-built retriever exposed via MCP, you are accepting the indexing and retrieval decisions of whoever built that MCP server. Their chunking strategy. Their embedding model. Their search configuration. Their reranking logic.

You get faster integration. You lose control over the design variables that determine retrieval quality. If the MCP server for your Sharepoint connector was built with naive fixed-size chunking and no reranking, that is what you get, and there is no configuration surface to change it. The ceiling on retrieval quality is set by someone else's implementation decisions.

A bespoke retrieval pipeline that was built and owned by your organisation gives you full control over every layer: embedding model selection, chunking strategy, vector database configuration, hybrid search weighting, reranker choice. More development overhead, but a substantially higher ceiling for retrieval quality, and the ability to diagnose and improve it over time.

This is the architectural lens through which differences in AI tool performance often become explicable. Two tools can be connected to the same Sharepoint site, using models from the same underlying provider, and produce dramatically different answers, because one is querying a thoughtfully designed retrieval pipeline and the other is querying a default connector with no structural investment in retrieval quality. The model gets the blame, but the retriever is the cause.

What this means for how you evaluate AI tools

The practical implication of all of this is that the right question when evaluating an AI knowledge tool is not "which LLM does it use?" It is: "what does the retrieval layer look like, and how much control do I have over it?"

Most vendors will not volunteer this information. The retrieval architecture is considered implementation detail, not customer-facing. But it is the primary determinant of output quality for any retrieval-augmented application. Ask the question explicitly. Ask what chunking strategy is used. Ask whether retrieval is hybrid or semantic-only. Ask whether a reranker is applied. Ask whether the indexing pipeline is configurable.

If the answer is "we handle all of that", understand that you are accepting someone else's design decisions as the quality boundary for your AI investment.

The organisations making genuine progress with AI in 2026 are not doing so because they found the best model. They are doing so because they have understood that the quality of AI output is an infrastructure question, not a model question. The retrieval layer is that infrastructure.