Our insights

26/8/2025

Building an intelligent agentic RAG system with LangGraph

Traditional RAG (Retrieval-Augmented Generation) systems follow a simple pattern: they retrieve documents, then generate an answer. But what if your system could think about when to search, evaluate the quality of retrieved information, and even rephrase queries when results aren’t good enough?

Enter agentic RAG: a more intelligent approach that uses autonomous agents to make decisions throughout the retrieval and generation process. Think of it as upgrading from a simple tool to an intelligent assistant that can reason about its own performance and adapt accordingly.

In this tutorial, we’ll walk you through building a sophisticated agentic RAG system using LangGraph, step by step. This system supervises itself, grades its own outputs, and iteratively improves its responses instead of just blindly retrieving and generating.

What makes this RAG system “agentic”?

The key difference between traditional RAG and agentic RAG lies in the decision-making capabilities. Our system demonstrates true agency through several key capabilities:

Intelligent query routing: A supervisor agent analyzes each query and decides whether to use retrieval or respond directly from its existing knowledge. Simple questions like “What is Python?” don’t need document search, while complex technical queries benefit from comprehensive retrieval.

Self-correcting queries: When initial retrieval fails to find relevant documents, the system doesn’t give up. Instead, it rephrases the query using different terminology and approaches, trying up to three times to find better results.

Multi-stage validation: The system implements a two-tier quality control process. First, it checks if retrieved documents are actually relevant to the query. Then, it evaluates whether the generated answer is both grounded in the retrieved information and useful to the user.

Graceful degradation: When confidence is low or information is insufficient, the system explicitly communicates its uncertainty rather than providing potentially misleading answers.

Architecture overview: the intelligent decision flow

Unlike linear RAG pipelines, our agentic RAG follows a sophisticated decision-making process that can adapt based on quality assessments at each stage:

Supervise → Retrieve Documents → Grade Documents → Generate → Grade Answer → Wrap Up

Each node in this workflow has specific responsibilities and can redirect the flow based on quality assessments. For example, if document grading reveals poor relevance, the system automatically triggers query rephrasing rather than proceeding with inadequate information.

Building the system step by step

Let’s dive into building this intelligent system. We’ll start with the foundation and gradually add the sophisticated reasoning capabilities.

Prerequisites

Before we begin, ensure you have:

Python 3.13+ installed
An AWS account with Bedrock access configured
Basic understanding of LangChain concepts
Familiarity with graph-based workflows (helpful but not required)

Step 1: Setting up the modern development environment

We’ll use uv, a fast Python package manager that’s become increasingly popular for its speed and reliability. Think of it as a more efficient alternative to pip and virtual environments combined.

First, install uv on your system:

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Now create your project structure:

mkdir agentic-rag-tutorial
cd agentic-rag-tutorial
uv init

The uv init command creates a basic Python project structure with proper configuration files, similar to how modern JavaScript tools work.

Step 2: Installing dependencies

One of uv’s strengths is its ability to manage dependencies efficiently. We’ll install our packages in logical groups to understand what each component does:

# Core workflow and AI components
uv add langgraph langchain-core langchain-community langchain-text-splitters
# AWS integration for production-ready LLMs
uv add langchain-aws boto3
# Vector database and document processing
uv add langchain-chroma chromadb pypdf
# Development tools
uv add --dev langgraph-cli[inmem]

What each dependency does:

LangGraph: Orchestrates our multi-agent workflow
LangChain components: Provide the AI and document processing infrastructure
AWS/Bedrock: Gives us access to production-grade language models
ChromaDB: Handle vector storage and similarity search
PyPDF: Processes PDF documents for knowledge ingestion

Step 3: Designing the project structure

A well-organized project structure is crucial for maintainability. Here’s how we’ll organize our code:

agentic-rag-tutorial/
├── models/          # Data schemas and state definitions
├── nodes/           # Individual agent nodes with specific responsibilities
├── utils/           # Shared utilities (AWS, vector store setup)
├── resources/       # Knowledge base documents
├── main.py          # Workflow orchestration
└── configuration files

This structure separates concerns clearly: each node handles one specific task, utilities are reusable, and the main file focuses purely on connecting components together.

Step 4: Defining the data flow schema

In any multi-agent system, defining how data flows between components is critical. Our state schema acts as the “memory” that agents share throughout the conversation.

Create models/state.py:

from pydantic import BaseModel
from langchain_core.messages import BaseMessage
from langgraph.graph.message import add_messages
from typing import List, Optional, Annotated, Sequence

class InputAgentState(BaseModel):
    """What the system receives from users"""
    messages: Annotated[Sequence[BaseMessage], add_messages] = []

class AgentState(InputAgentState):
    """Complete state that flows between all agents"""
    original_user_query: Optional[str] = None
    documents: Optional[str] = None
    generated_answer: Optional[str] = None
    rephrased_queries: List[str] = []  # Tracks iteration attempts

Why this structure matters: The AgentState acts like a shared notebook that every agent can read from and write to. The rephrased_queries list is particularly important: it prevents infinite loops by tracking how many times we've attempted to improve the query.

Step 5: Configuring production-ready AI models

AWS Bedrock provides enterprise-grade AI models with built-in security and compliance. Let's set up our AI infrastructure in utils/aws_bedrock.py:

from langchain_aws import ChatBedrockConverse, BedrockEmbeddings, BedrockRerank
from langchain_core.rate_limiters import InMemoryRateLimiter

# Configuration for EU region (adjust based on your needs)
aws_region = "eu-central-1"
model_id_chat = "eu.anthropic.claude-sonnet-4-20250514-v1:0"

# Claude 4 Sonnet with rate limiting for production use
chat_claude_4_sonnet = ChatBedrockConverse(
    model=model_id_chat,
    region_name=aws_region,
    temperature=0,  # Deterministic outputs
    max_tokens=4096,
    rate_limiter=InMemoryRateLimiter(
        requests_per_second=5,
        max_bucket_size=2  # Prevents burst requests
    ),
)

# Embedding model for semantic search
embeddings = BedrockEmbeddings(
    model_id="cohere.embed-multilingual-v3",
    region_name=aws_region,
)

# Reranker improves retrieval quality
compressor = BedrockRerank(
    model_arn="arn:aws:bedrock:eu-central-1::foundation-model/cohere.rerank-v3-5:0",
    region_name=aws_region,
    top_n=10,
)

Key design decisions explained:

Temperature=0: Ensures consistent, deterministic responses
Rate limiting: Prevents API quota issues in production
Reranking: Takes the top 10 retrieved documents and reorders them by relevance, significantly improving answer quality

Step 6: Building an intelligent document processing pipeline

The vector store is where we'll store and retrieve our knowledge base. This implementation includes smart document chunking and compression for optimal retrieval.

Create utils/vector_store.py:

from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever

# Optimized chunking strategy
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2048,      # Large enough for context
    chunk_overlap=200,    # Prevents information loss at boundaries
    length_function=len,
)

def load_pdf(file_path: str):
    """Load and intelligently chunk PDF documents"""
    loader = PyPDFLoader(file_path)
    pages = list(loader.lazy_load())

    # Convert pages to optimally-sized chunks
    documents = text_splitter.create_documents(
        [page.page_content for page in pages]
    )
    return documents

def load_vector_store(file_path: str):
    """Create or load existing vector database with compression"""
    vector_store = Chroma(
        embedding_function=embeddings,
        persist_directory=".chroma_db",  # Persistent storage
        collection_name="pdf_documents",
    )

    # Only process PDF if database is empty (efficiency optimization)
    if vector_store._collection.count() == 0:
        documents = load_pdf(file_path)
        vector_store.add_documents(documents)

    # Return compressed retriever for better results
    return ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=vector_store.as_retriever(search_kwargs={"k": 10}),
    )

Why compression matters: The contextual compression retriever first retrieves more documents than needed, then uses a reranking model to select only the most relevant ones. This two-stage approach significantly improves answer quality.

Step 7: Creating the brain of the system: supervisor agent

The supervisor is the most critical component. It decides whether a query needs document retrieval or can be answered directly. This intelligence prevents unnecessary API calls and improves response times.

Create nodes/supervise.py:

from langchain_core.prompts import ChatPromptTemplate
from nodes.retrieve_documents import retriever_tool

# The supervisor's decision-making prompt
prompt_template = ChatPromptTemplate([
    ("system", """
You are an intelligent routing agent with two options:

1. **Use the search tool** for queries requiring specific technical details:
   - Step-by-step procedures
   - Specific product information
   - Domain-specific knowledge

2. **Respond directly** for general questions:
   - Basic definitions
   - Simple explanations
   - Common knowledge topics
   """),
    ("human", "QUESTION: {user_query}")
])

def _supervise(state: AgentState) -> dict:
    """Make intelligent routing decisions"""
    # Extract current query (original or rephrased)
    query = state.rephrased_queries[-1] if state.rephrased_queries else state.messages[-1].content

    # Create tool-enabled model
    tool_model = chat_claude_4_sonnet.bind_tools([retriever_tool])

    response = (prompt_template | tool_model).invoke({"user_query": query})

    return {"original_user_query": query, "messages": [response]}

The intelligence behind routing: The supervisor uses function calling to decide whether to invoke the retrieval tool. LangGraph's tools_condition will automatically route to the appropriate next step based on whether tools were called.

Step 8: Building the knowledge retrieval system

Our retrieval system goes beyond simple keyword matching. It uses semantic search combined with reranking to find the most relevant information.

Create nodes/retrieve_documents.py:

from langchain.tools.retriever import create_retriever_tool
from langgraph.prebuilt import ToolNode

# Load knowledge base (replace with your document path)
vector_store = load_vector_store("resources/your_document.pdf")

# Create a tool that the supervisor can use
retriever_tool = create_retriever_tool(
    vector_store,
    "knowledge_base_search",
    "Search comprehensive knowledge base for detailed technical information" # Describe the available knowledge to help the Supervisor decide whether retrieval is necessary.
)

# ToolNode automatically handles tool execution
_retrieve_documents = ToolNode([retriever_tool])

Why this approach works: By wrapping retrieval as a tool, we let the supervisor model decide when to use it. The ToolNode handles all the complexity of tool execution and result formatting.

Step 9: Implementing quality control - document relevance assessment

Not all retrieved documents are useful. Our grading system ensures we only proceed with relevant information, preventing hallucinations and improving answer quality.

Create nodes/grade_documents.py:

from pydantic import BaseModel, Field
from langchain_core.prompts import ChatPromptTemplate

class GradingResult(BaseModel):
    documents_relevant: bool = Field(
        description="Are the documents relevant to answering the question?"
    )

grade_prompt = ChatPromptTemplate([
    ("system", """
Evaluate document relevance using these criteria:

RELEVANT if documents contain:
- Direct answers or related information
- Key concepts mentioned in the question
- Useful background context

NOT RELEVANT only if:
- Completely unrelated topic
- No useful information for the question
"""),
    ("human", "QUESTION: {query}\nDOCUMENTS: {documents}")
])

def _grade_documents(state: AgentState):
    """Assess whether retrieved documents can help answer the question"""
    model = chat_claude_4_sonnet.with_structured_output(GradingResult)

    result = (grade_prompt | model).invoke({
        "query": state.original_user_query,
        "documents": state.messages[-1].content
    })

    return {"documents": state.messages[-1].content}

The grading philosophy: We err on the side of inclusion. It's better to have potentially useful documents than to discard information that might be helpful. The structured output ensures consistent evaluation.

Step 10: Intelligent Answer Generation

The generation node creates answers that are both accurate and helpful. It's designed to be conversational while staying grounded in the retrieved information.

Create nodes/generate.py:

from pydantic import BaseModel, Field

class GenerateResponse(BaseModel):
    generated_answer: str = Field(description="Generated answer based on retrieved information")

generate_prompt = ChatPromptTemplate([
    ("system", """
Create helpful, accurate answers using the retrieved information.

Guidelines:
- Be clear and actionable
- Don't mention "retrieved information" or system processes
- Say "I don't know" if information is insufficient
- Focus on being genuinely helpful
"""),
    ("human", "QUESTION: {query}\nINFORMATION: {documents}")
])

def _generate(state: AgentState):
    """Generate grounded, helpful answers"""
    model = chat_claude_4_sonnet.with_structured_output(GenerateResponse)

    response = (generate_prompt | model).invoke({
        "query": state.original_user_query,
        "documents": state.documents,
    })

    return {"generated_answer": response.generated_answer, "rephrased_queries": []}

Design principle: The generation focuses on being naturally helpful rather than sounding like a system output. This makes the interaction feel more conversational and trustworthy.

Step 11: Multi-stage answer validation - ensuring quality and truth

Our validation system implements two critical checks: ensuring answers are grounded in retrieved facts (preventing hallucinations) and confirming they actually address the user's question (ensuring usefulness).

Create nodes/grade_answer.py:

from enum import Enum
from pydantic import BaseModel, Field

class GradingOutcome(str, Enum):
    USEFUL = "useful"              # Good answer, proceed
    NOT_USEFUL = "not useful"      # Try rephrasing query
    NOT_SUPPORTED = "not supported" # Express uncertainty

class GradingResult(BaseModel):
    grading: bool = Field(description="Positive or negative assessment")

# First validation: Hallucination detection
hallucination_prompt = ChatPromptTemplate([
    ("system", """
Check if the answer is supported by the retrieved facts.

GROUNDED: Answer directly supported by documents OR appropriately says "I don't know"
NOT GROUNDED: Answer contains unsupported claims or fabricated details
"""),
    ("human", "FACTS: {documents}\nANSWER: {answer}")
])

# Second validation: Usefulness assessment
usefulness_prompt = ChatPromptTemplate([
    ("system", """
Evaluate if the answer actually helps the user.

ADDRESSES: Directly responds to the question with sufficient detail
DOES NOT ADDRESS: Off-topic, too vague, or irrelevant
"""),
    ("human", "QUESTION: {query}\nANSWER: {answer}")
])

def _grade_answer(state: AgentState):
    """Two-stage validation: truth and usefulness"""
    model = chat_claude_4_sonnet.with_structured_output(GradingResult)

    # Stage 1: Truth check
    hallucination_check = (hallucination_prompt | model).invoke({
        "documents": state.documents,
        "answer": state.generated_answer,
    })

    if hallucination_check.grading:
        # Stage 2: Usefulness check
        usefulness_check = (usefulness_prompt | model).invoke({
            "query": state.original_user_query,
            "answer": state.generated_answer,
        })
        return GradingOutcome.USEFUL if usefulness_check.grading else GradingOutcome.NOT_USEFUL
    else:
        return GradingOutcome.NOT_SUPPORTED

Why two-stage validation matters: An answer can be factually correct but completely unhelpful, or it can be helpful but contain fabricated details. Our system catches both problems and responds appropriately.

Step 12: Smart query rephrasing - learning from failure

When retrieval fails, the system doesn't give up. Instead, it analyzes why the query might not have worked and tries alternative approaches.

Create nodes/rephrase_query.py:

class RephraseResponse(BaseModel):
    rephrased_user_query: str = Field(description="Improved query for better search results")

rephrase_prompt = ChatPromptTemplate([
    ("system", """
Analyze why the previous search failed and create a better query.

Improvement strategies:
- Use more specific technical terminology
- Include relevant synonyms and related concepts
- Try different angles on the same question
- Make the query more searchable while preserving intent
"""),
    ("human", """
ORIGINAL: {query}
PREVIOUS ATTEMPTS: {previous_attempts}
WHAT WE FOUND: {documents}
LAST RESPONSE: {answer}

Create a new search approach for this question.
""")
])

def _rephrase_query(state: AgentState):
    """Intelligently rephrase queries based on previous failures"""
    model = chat_claude_4_sonnet.with_structured_output(RephraseResponse)

    response = (rephrase_prompt | model).invoke({
        "query": state.original_user_query,
        "previous_attempts": state.rephrased_queries,
        "documents": state.documents,
        "answer": state.generated_answer,
    })

    # Add new attempt to history
    updated_attempts = state.rephrased_queries + [response.rephrased_user_query]
    return {"rephrased_queries": updated_attempts}

The learning mechanism: By providing context about previous attempts and results, the system can learn from its mistakes and try genuinely different approaches rather than minor variations.

Step 13: Graceful uncertainty handling

When the system lacks confidence in its answer, it's better to be honest than potentially misleading. Our uncertainty expression maintains user trust.

Create nodes/express_uncertainty.py and nodes/wrap_up.py:

# express_uncertainty.py
from langchain_core.messages import AIMessage

def _express_uncertainty(state: AgentState) -> AgentState:
    """Honestly communicate uncertainty when confidence is low"""
    uncertainty_message = """I'm not entirely confident about this answer.
The available information doesn't provide enough context for complete accuracy."""

    full_response = f"{uncertainty_message}\n\n{state.generated_answer}"
    return {"messages": [AIMessage(content=full_response)]}

# wrap_up.py
def _wrap_up(state: AgentState) -> AgentState:
    """Provide confident, final response"""
    return {"messages": [AIMessage(content=state.generated_answer)]}

Trust through transparency: Users appreciate honesty about limitations. This approach builds trust and allows users to make informed decisions about how to use the information.

Step 14: Orchestrating the complete intelligent workflow

Now we bring all components together into a coherent workflow that can make intelligent decisions at each step.

Create main.py:

from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import tools_condition

# Import all our intelligent agents
from models.state import AgentState, InputAgentState, OutputAgentState
from nodes.supervise import _supervise
# ... (other imports)

def decide_to_generate(state: AgentState) -> str:
    """Decide whether to generate or try rephrasing"""
    # Generate if we have good documents OR we've tried 3 times
    if (state.documents and len(state.documents) > 0) or len(state.rephrased_queries) >= 3:
        return "generate"
    return "rephrase_query"

def get_graph() -> StateGraph:
    """Create the intelligent workflow"""
    workflow = StateGraph(AgentState, input_schema=InputAgentState, output_schema=OutputAgentState)

    # Add all our intelligent agents
    workflow.add_node("supervise", _supervise)
    workflow.add_node("retrieve_documents", _retrieve_documents)
    workflow.add_node("grade_documents", _grade_documents)
    workflow.add_node("generate", _generate)
    workflow.add_node("rephrase_query", _rephrase_query)
    workflow.add_node("express_uncertainty", _express_uncertainty)
    workflow.add_node("wrap_up", _wrap_up)

    # Define the intelligent decision flow
    workflow.add_edge(START, "supervise")

    # Supervisor decides: search or respond directly
    workflow.add_conditional_edges(
        "supervise", tools_condition,
        {"tools": "retrieve_documents", END: END}
    )

    workflow.add_edge("retrieve_documents", "grade_documents")

    # Grade documents and decide next step
    workflow.add_conditional_edges(
        "grade_documents", decide_to_generate,
        ["generate", "rephrase_query"]
    )

    workflow.add_edge("rephrase_query", "supervise")  # Try again

    # Grade answer quality and route accordingly
    workflow.add_conditional_edges(
        "generate", _grade_answer,
        {
            GradingOutcome.NOT_SUPPORTED.value: "express_uncertainty",
            GradingOutcome.NOT_USEFUL.value: "rephrase_query",
            GradingOutcome.USEFUL.value: "wrap_up",
        }
    )

    workflow.add_edge("express_uncertainty", END)
    workflow.add_edge("wrap_up", END)

    return workflow.compile()

The intelligence in the flow: Each conditional edge represents a decision point where the system evaluates quality and chooses the best next action. This creates a self-improving loop that gets better results through iteration.

Step 15: Configuration and deployment

Create your configuration files for a production-ready deployment:

.env.example:

AWS_DEFAULT_REGION="eu-central-1"
LANGSMITH_API_KEY="your_key_here"  # Optional: for monitoring
LANGSMITH_TRACING="true"
LANGSMITH_PROJECT="agentic-rag"

langgraph.json:

{
  "dependencies": ["."],
  "graphs": {"agentic-rag": "./main.py:_graph"},
  "env": ".env"
}

Step 16: Running and testing your intelligent system

Set up AWS credentials (one-time setup):

aws configure

Configure your environment:

cp .env.example .env # Edit .env with your actual values

Add your knowledge base: Place your PDF in resources/ and update the path in retrieve_documents.py

Launch with LangGraph Studio for interactive testing:

uv run langgraph dev

Understanding the intelligence: how decisions flow

Let's trace through how our system handles different types of queries to understand its intelligence:

Simple query: “What is machine learning?”

Supervisor analyzes → recognizes as general knowledge
Responds directly without retrieval → efficient and fast

Complex query: “How do I configure the AWS Lambda timeout settings?”

Supervisor analyzes → identifies need for specific technical details
Retrieves documents → searches knowledge base
Grades documents → ensures relevance before proceeding
Generates answer → creates detailed, actionable response
Grades answer → validates quality and usefulness

Challenging query: “How do I fix the quantum flux capacitor?”

Supervisor → decides to search (technical-sounding query)
Retrieval → finds no relevant documents
Document grading → recognizes irrelevance
Query rephrasing → tries alternative terms (up to 3 attempts)
Still no good results → expresses appropriate uncertainty

Key features that make this system production-ready

Intelligent resource management: The supervisor prevents unnecessary API calls by routing simple queries directly, reducing costs and improving response times.

Quality assurance pipeline: Multi-stage validation ensures answers are both factually grounded and genuinely helpful to users.

Self-improvement loop: Failed queries trigger intelligent rephrasing rather than generic “I don't know” responses.

Honest uncertainty: When confidence is low, the system explicitly communicates limitations, building user trust.

Modular architecture: Each component has a single responsibility, making the system easy to maintain, test, and extend.

Production monitoring: Integration with LangSmith provides visibility into decision-making and performance metrics.

Extending your agentic RAG system

This foundation provides numerous opportunities for enhancement:

Multi-domain intelligence: Add routing logic to handle different knowledge domains (technical docs, policy documents, etc.) with specialized processing strategies.

Advanced quality metrics: Implement more sophisticated evaluation criteria, including user feedback loops and answer confidence scoring.

Personalization layer: Track user preferences and adapt responses based on expertise level and interaction history.

Tool integration: Add computational tools, API integrations, or real-time data sources to handle broader query types.

Collaborative intelligence: Implement multi-agent collaboration where different agents specialize in different aspects of query processing.

Conclusion: the future of intelligent RAG

Building an Agentic RAG system with LangGraph represents a fundamental shift from reactive to intelligent information systems. Instead of simply retrieving and generating, we've created a system that:

Thinks about the best approach for each query
Evaluates the quality of its own work
Adapts when initial approaches fail
Communicates honestly about its limitations

The combination of LangGraph's workflow orchestration, AWS Bedrock's production-ready LLMs, and thoughtful agent design creates a robust foundation for next-generation RAG applications. These systems go beyond just answer questions. They understand context, learn from failures, and continuously improve their responses.

As AI systems become more prevalent in business applications, this agentic approach will become essential for creating reliable, trustworthy, and genuinely helpful AI assistants that users can depend on for critical decisions.

The future of RAG is in intelligent systems that can reason about their own processes and adapt to provide the best possible outcomes for users. This tutorial provides the foundation for building such systems.

Full Implementation and Resources

You can find the complete implementation of this agentic RAG system, including all the code, configuration files, and additional production features, on GitHub: https://github.com/UnikooBelgium/ws-agentic-rag

Get further inspired

12/6/2026