RAG in 2025: From Retrieval to Context Engines

What Naive RAG Got Wrong

The original RAG recipe was seductive in its simplicity: embed your documents, store the vectors, embed the query, pull the top-k chunks by cosine distance, stuff them into the prompt. Dozens of frameworks sprouted to make this four-step process even easier. For demos, it worked beautifully.

Production was another story. Chunk boundaries sliced sentences mid-thought. Top-k retrieval surfaced semantically similar but contextually irrelevant passages. Long documents lost their structural coherence when broken into 512-token shards. Models hallucinated confidently when retrieved context partially addressed the question but not quite completely.

By early 2025, the industry had a vocabulary for the failure modes: context pollution, retrieval recall gaps, positional bias (models preferring content at the start and end of context), and semantic drift when query embeddings and document embeddings lived in misaligned spaces.

The Context Engineering Turn

The paradigm shift that defined 2025 RAG was reconceptualising the entire pipeline as context engineering rather than retrieval engineering. The question stopped being "how do I find the most similar text?" and became "how do I construct the optimal context window for this query?"

This reframe unlocked a wave of architectural innovations. Hybrid retrieval — combining dense vector search with sparse BM25 keyword matching — dramatically improved recall on specific terms, product names, and codes that embeddings flatten into vagueness. Re-ranking with cross-encoder models became a standard second pass, scoring retrieved candidates against the query jointly rather than independently.

Contextual compression emerged as a first-class concern: instead of feeding raw chunks verbatim, systems began post-processing retrieved text to strip irrelevant sentences, preserving only the spans directly relevant to the query. The context window is expensive — every token counts.

Agentic RAG and the Multi-Step Retrieval Loop

Static single-shot retrieval was the first thing to go in more demanding applications. Agentic RAG — where the model itself decides what to retrieve, evaluates the result, and issues follow-up queries — gained serious traction through 2025. The pattern looks less like a database lookup and more like a research workflow.

The model receives the initial query, formulates a retrieval strategy (what collections to search, what filters to apply, what keywords to emphasise), evaluates whether the returned context is sufficient to answer, and iterates if not. This adds latency but dramatically improves answer quality for complex questions that require synthesising information across multiple documents or resolving ambiguous queries.

Frameworks like LangGraph and LlamaIndex Workflows formalised this loop, giving developers explicit control over the retrieval-reasoning cycle without having to build the state machine from scratch. The tradeoff is clear: agentic RAG costs 3–5x more per query in tokens and latency compared to single-shot retrieval.

Multimodal and Structured Data RAG

Text-over-text RAG was only ever a partial solution for enterprises whose knowledge lived in PDFs with charts, spreadsheets, databases, and slide decks. By mid-2025, multimodal RAG had moved from research curiosity to production-viable pattern.

Vision-language models enabled document understanding that preserved layout and visual structure rather than degrading everything to raw OCR text. Tables could be queried as tables. Charts could be interpreted visually. This mattered enormously for financial, legal, and technical domains where structure carries as much meaning as text content.

Structured data RAG — using language models to generate SQL or API queries from natural language, execute them against live databases, and feed the results back into the model — also matured significantly. The text-to-SQL benchmark SPIDER saw state-of-the-art accuracy climb above 90% on several leaderboards by late 2025, making natural language database interfaces genuinely viable for internal enterprise tooling.

When RAG Isn't The Answer

The maturation of RAG also brought a clearer understanding of when not to use it. For knowledge that is small, stable, and globally relevant to every query, fine-tuning or system prompt injection often outperforms retrieval on both quality and latency. RAG shines when the knowledge base is large, dynamic, or needs to be attributable — users want to see which source the answer came from.

The emergence of million-token context windows in models like Claude and Gemini 2.5 Pro also changed the calculus for medium-sized knowledge bases. If your entire corpus fits in 200K tokens, stuffing it directly into the context can outperform RAG by eliminating retrieval errors entirely. The compute cost is higher, but the implementation complexity is dramatically lower.

The honest answer for 2025 is that RAG is now one tool in a broader context engineering toolkit rather than the default answer to every knowledge integration problem. Choosing it deliberately — and implementing it well — is what separates the teams getting real value from the ones debugging retrieval pipelines indefinitely.