Agent Builder is available now as a tech preview. Get started with an Elastic Cloud Trial, and check out the documentation for Agent Builder here.
For the last few years, prompt engineering has dominated the conversation in AI. The focus was on finding the perfect words and phrasing to coax the right response from an LLM. But as we build more sophisticated AI agents designed to tackle multi-step, complex tasks, the focus is shifting, the topic moves towards challenges with managing context.
An LLM's context is like a human's short-term memory. Trying to stuff it with every potentially relevant piece of information is challenging. It leads to context rot, a concept where the model's 'attention budget' gets exhausted, causing it to lose focus and derail its own reasoning. For example, one recent "needle in a haystack" (NIAH) benchmark in NOLIMA: Long-Context Evaluation Beyond Literal Matching noted that performance "degrades significantly as context length increases," observing that at 32K tokens, "11 models drop below 50% of their strong short-length baselines." Inefficiency within the context window directly affects the agent's reliability.
The solution isn't just better prompts, but an emerging discipline: context engineering. It’s the art of managing the model's limited attention.
Beyond RAG: the shift to "just-in-time" context for AI agents
Previously, the standard approach was RAG, which involved a form of retrieval before inference, where all potentially relevant data was processed and fed into the system prompt upfront for it to reason over.

As reasoning models become more intelligent, cheaper, and better at handling long contexts, the trend is shifting toward agents that use tools for 'just-in-time' retrieval. This allows the agent to iteratively probe sources, breaking down and reformulating queries to adapt its search as it learns. Engineering focus shifts from exhaustive data pre-processing to equipping the agent with the right tools and honing its ability to discover and steer towards the most relevant context on its own.
Context engineering isn't just about the instructions we give an agent; it's about curating and managing the context window to the model at any given moment. When an agent runs in a loop, it generates a constantly evolving universe of information - user messages, tool outputs, and its own internal thoughts. Context engineering is the art and science of deciding what from that universe makes it into the agent's limited "working memory."
This is where relevance in retrieval becomes the cornerstone of effective context engineering. For an agent to perform reliably - whether in financial research or customer support - it needs access to vast amounts of domain-specific knowledge. Relevance in retrieval acts as the essential bridge between an agent's powerful reasoning and the data that gives it purpose.
In this post, we'll break down the retrieval strategies that are key to mastering context engineering. We'll show how a deliberate focus on relevance helps build smarter, more efficient, and more reliable AI agents.
Relevance in context retrieval
To engineer for high relevance - improving both recall and precision - we recommend to start with a hybrid search strategy for retrieval. This approach combines the precision of lexical search for specific identifiers with the semantic recall of vector search, maximizing the signal-to-noise ratio of the retrieved context.
This hybrid approach acts as the first line of defense against context rot, ensuring that only the most potent information enters the agent's limited working memory.

Comparison of different strategies using the recall@10 and mrr@10 metrics
In an internal benchmark of retrieval strategies we found that hybrid approaches often offer the best balance between IR metrics like Recall (Completeness) and MRR (Ranking). The chart demonstrates that a hybrid approach, combining sparse ELSER with dense vector search, successfully located 84.3% of all relevant documents within the top 10 results (Recall@10 of 0.843). Furthermore, the MRR of 0.53 indicates that the single most important document was, on average, found within the top 2 positions.
Semantic chunking and retrieval
The best results, however, extend to how context is structured. LLM models perform better when they are provided with coherent context. By breaking content into chunks that group related concepts together and then retrieving not just the target chunk but its immediate surroundings, we equip the agent with richer, more complete context. This helps to mitigate the performance degradation and brittleness associated with fragmented or disjointed contexts, enabling more reliable and effective reasoning.
This strategy is adaptable across different types of agents:
- A knowledge agent processing legal documents might chunk by paragraphs or sections. When a specific clause is retrieved, providing the preceding and succeeding paragraphs & sections ensures the agent understands the full legal argument. The chunks can be linked with lightweight identifiers that point back to their source document and page number.
- A coding agent can apply this by chunking a codebase semantically - by functions, classes, or logical blocks. When a function is identified as relevant, retrieving the entire class it belongs to, along with its import statements and docstrings, gives the agent a complete, executable unit of logic to work with that can be accessible by another tool.

In another internal benchmark focused on long-context RAG, we evaluated how effective semantic highlighting is at retrieving the most relevant chunks from documents; reducing the need to pass entire documents to the agent. As shown in the chart, we experimented with different combinations of document counts and highlighted chunk counts, measuring their impact on retrieval accuracy (Hit Rate) and context efficiency. Hit Rate is a binary metric that indicates whether the agent received all the information necessary to complete its task.
For our dataset, the sweet spot was retrieving five semantic fragments from the top five documents (K=5). This setup allowed the agent to access all required information in 93.3% of cases, while reducing context size by over 40% significantly cutting down the agent’s processing load.
Agentic search
For more complex questions that require extensive exploration, a simple relevance search is insufficient. Answering these queries demands a deeper level of understanding that can only be achieved through a process of discovery and filtering. Traditional retrieval tools, often constrained by rigid query templates, lack the intelligence for this dynamic source selection and exploration. Instead, the agent needs to reason about the information it finds, filter out noise, and pursue new paths as insights emerge in an iterative loop.

To achieve this, we can leverage agent-based retrieval tools that operate outside the main agent's context window. This can be conceptualized as a specialized "sub-agent" architecture, as described in some multi-agent systems. This sub-agent is tasked with the intensive work of exploration. A core function is its ability to rewrite the main agent's high-level natural language request into a precise, structured query tailored for the selected data source. For example, it can translate a user's question into a specific ES|QL query to retrieve relevant content from an Elasticsearch index or datastream. The entire back-and-forth of discovery, reasoning, and filtering occurs within this sub-agent's separate context. It then returns the results or aggregations, keeping the main agent's "working memory" clean and allowing it to focus on the high-level plan.
This strategy is adaptable across different types of agents:
- A coding agent tasked with a complex search can use a search sub-agent to explore a large codebase. This sub-agent would spend its cycles grepping through folders, searching code chunk embeddings, and tracing dependencies to find a particular file or function. Instead of returning thousands of lines of raw code, it delivers a concise summary of the relevant file paths and code snippets.
- A knowledge agent researching a topic can use a sub-agent to sift through a vast document store. It would autonomously explore the data, disregard documents that aren't useful, and synthesize the key findings from multiple sources. The main agent then receives a curated summary of highlights, rather than the full text of every retrieved document.

In an internal experiment, we compared two approaches for answering analytical questions like, “Today is Jul 8 2025. How many support tickets were created in the last 365 days and how many of those are still open”. To answer this question, the LLM had to filter a small index of 360 records and 12 fields (including text, date, and integer types) by the date column and then count the number of tickets by their status to provide the necessary information.
1. The Brute-Force Method: This involved feeding the entire CSV directly into a state-of-the-art LLM's context window.
2. The Agentic Method: This approach provided an agent with a tool capable of translating natural language into a precise ES|QL query which was then run against an index containing the CSV. The findings, illustrated in the chart above, highlight that LLMs, when given the entire index as context, experienced catastrophic failure in most cases. In contrast, the agent, equipped with the ES|QL tool, achieved a 100% success rate.
The findings, illustrated in the chart above, highlight that LLMs, when given the entire index as context, experienced catastrophic failure in most cases. In contrast, the agent, equipped with the ES|QL tool, achieved a 100% success rate.
Domain-specific tools & workflows
While the exploratory power of agentic search is powerful for complex, open-ended questions, it's not always the most efficient or reliable approach. For many business-critical operations, especially in regulated domains like finance, tasks follow a predictable, repeatable sequence. In these scenarios, allowing a generalist agent to reason and explore from scratch every time is not only wasteful in terms of token consumption and latency but also introduces unnecessary variance.
This is where a more deterministic approach, using domain-specific tools & workflows, can provide a better solution. Instead of a single sub-agent, a workflow tool is a pre-defined chain of steps that are executed in sequence. The agent’s role shifts from being a planner to being a highly effective executor of these high level workflows. The plan is done upfront by the domain expert who designs the workflow, ensuring that each step receives precisely the context it needs from the previous one.
Customer support tool example: Get similar customer resolved tickets
This query optimizes the workflow by first filtering for high-quality data (like 'resolved' tickets), then using relevance search to pinpoint the best solutions, all while curating the exact fields returned to the model for context.
Finance workflow tool example: KYC checks
A "Know Your Customer" (KYC) process is a non-negotiable checklist.
- Verify Identity: The tool checks the customer's details against internal bank records.
- External Screening: It then queries external APIs for sanctions lists and adverse media mentions.
- Generate Report: Finally, it compiles all findings with an LLM into a standardized risk assessment for audit.
This deterministic flow also ensures every customer is vetted against the exact same criteria, essential for regulators.
By enforcing a deterministic execution path, this workflow-based methodology significantly enhances operational reliability and mitigates unpredictable agentic behavior. Consequently, it achieves gains in efficiency by reduced token consumption and lower latency and eliminating unnecessary reasoning cycles common in agentic behaviour.
Context compaction, long-term memory and semantic retrieval
One strategy to help with context limits is to manage this by context compaction or compression. This involves taking the existing conversation history, including multiple tool executions and their results, and condensing it into a concise summary. This compressed information can then be stored externally, creating a persistent memory that the agent can access later in the same conversation or even across different sessions.
However, this compression requires a delicate balance. The core challenge is that an agent must decide its next action based on all prior states, and it's impossible to know which seemingly minor observation might become critical later. Any irreversible compression, therefore, risks discarding subtle but crucial details. To navigate this trade-off, Anthropic offers this advice: “Start by maximizing recall to ensure your compaction prompt captures every relevant piece of information from the trace, then iterate to improve precision by eliminating superfluous content.” in Effective context engineering for agents
This very principle of first prioritizing recall and then refining for precision is where techniques like semantic search become critical. By enabling the agent to efficiently query its external memory, semantic search helps retrieve the most relevant context for the task at hand, ensuring efficient use of its limited context window.
Dynamic tool discovery
As the number of external tools and APIs available to large language models (LLMs) explodes, a significant challenge arises in effective tool discovery and selection. When an LLM is presented with a vast library of functions, two primary problems emerge: prompt bloat and decision overhead. Prompt bloat occurs because describing every available tool within the model's limited context window is impractical and consumes an enormous number of tokens. This not only strains the model's capacity but also leads to decision overhead, where the LLM becomes confused by the sheer volume of choices, especially when many tools have similar or overlapping functionalities. This complexity increases the likelihood of errors, causing the model to select a suboptimal tool or even hallucinate one that doesn't exist, ultimately degrading its performance.
To address this scaling problem, many are exploring solutions in semantic search and retrieval to dynamically manage the toolset. Rather than presenting every tool to the model at once, this approach uses a retrieval system to first understand the meaning of a user's query. This system then performs a semantic search across a large external index of tools to find and retrieve only the most relevant options for the task at hand. By pre-filtering the choices and presenting the LLM with a small, highly relevant subset, this method significantly reduces prompt size and simplifies the model's decision-making process. This focus on relevance ensures the LLM is not burdened with finding a "needle in a haystack" but is instead given a curated toolkit, which boosts its accuracy in selecting the correct tool.
There has been early research into this in RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation and our own research using semantic search for selecting the right index to use with agentic search.
Conclusion
The evolution to context engineering marks a crucial maturation in how we use LLMs today. It signifies a move away from the simple art of crafting a perfect prompt towards the art of building and managing the environment for agents. The challenge is no longer just what we ask, but how we help the agent to discover its own answers efficiently and reliably within its finite context.
The strategies we've explored—from simple foundational hybrid retrieval and semantic chunking to the sophisticated trade-offs between exploratory agentic search and reliable, deterministic workflows—are all components of this new discipline. Each technique is designed to better manage the agent context. By engineering the right retrieval tools, memory systems, and dynamic function-calling capabilities, we can improve the agent’s effectiveness in its environment. The ability to manage context will determine whether AI agents remain experimental tools or evolve into dependable systems that can tackle sophisticated, production-level tasks.
In the next part of this series, we will explore this further, focusing on Elastic AI agents and the specific strategies we employ to manage context throughout an agent's lifecycle, from short-term memory to long-term knowledge.
Acknowledgements
- Abhimanyu Anand, for compiling the benchmarks that helped support these strategies.
- The Anthropic engineering team, for their foundational insights on context compaction strategies and the "recall then precision" principle.
- The research team at Chroma, for their work in identifying and popularizing the concept of "context rot."
- The authors of NOLIMA: Long-Context Evaluation Beyond Literal Matching, for their valuable benchmarks on long-context performance degradation.
- The authors of RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation, for their early research into dynamic tool retrieval and prompt bloat.




