Building a RAG System with LangChain & Docling: A Practical Guide

In today’s world, we’re drowning in information. From customer support queries to academic research, the sheer volume of data makes it challenging to find relevant answers quickly. Traditional language models, while powerful, often struggle to provide accurate and contextually relevant responses, especially when the required information isn’t part of their training data.

Enter Retrieval-Augmented Generation (RAG), a cutting-edge approach that combines the strengths of language models with document retrieval techniques. RAG systems enable AI to fetch relevant information from external sources and generate precise, context-aware responses. In this blog, we’ll walk you through building a RAG system using the LangChain library, integrating OpenAI’s language models with a vector database for efficient retrieval and generation.

The Problem: Information Overload and Inaccurate Responses

Imagine you’re building a customer support chatbot for an e-commerce platform. Customers frequently ask questions like:

What’s the return policy for electronics?
How long does shipping take for international orders?

While a traditional language model might generate a response based on its training data, it could miss critical details or provide outdated information. For instance, if the return policy was recently updated, the model might not know about it.

This is where RAG shines. By retrieving up-to-date information from a knowledge base (e.g., policy documents or product databases) and generating responses based on that data, RAG ensures that the chatbot provides accurate and relevant answers.

What is Retrieval-Augmented Generation?

RAG is a hybrid AI framework that enhances language models by allowing them to access external documents. It combines two key components:

Retriever
: Fetches relevant documents or data from a knowledge base.
Generator
: Uses the retrieved information to generate coherent and contextually relevant responses.

This approach is particularly useful for tasks like question answering, personalized recommendations, and content generation, where access to specific, up-to-date information is crucial.

Key components of RAG

1. Retriever

The retriever identifies and fetches the most relevant documents or data based on the user’s query. It can use:

Sparse Retrieval Models
: TF-IDF, BM25.
Dense Retrieval Models
: DPR (Dense Passage Retrieval), embeddings from models like Sentence-BERT or OpenAI Embeddings.

2. Knowledge Base

The knowledge base is the source of truth, containing structured or unstructured data for retrieval. Examples include:

Structured Data
: Databases, knowledge graphs.
Unstructured Data
: Text files, PDFs, websites, or any textual content.

3. Generator

The generator is a generative AI model (e.g., GPT-4) that uses the retrieved information to create coherent and contextually relevant responses.

Building a RAG Pipeline with LangChain

Let’s dive into the practical implementation of a RAG system using LangChain.

Step 1: Setting Up the Environment

To start, we need to set up our environment. The code uses dotenv to load environment variables, which is essential for managing sensitive information like API keys. In our case we use OpenAI API key

from dotenv import load_dotenv load_dotenv()

Step 2: Document Ingestion

Document loader

Docling supports parsing various document formats, including PDF, DOCX, PPTX, HTML, and more, into a unified, rich representation that captures essential elements like layout, tables, and content relationships. This makes it ideal for generative AI workflows, such as Retrieval-Augmented Generation (RAG)

This ensures that the retriever can effectively find the most relevant pieces of information, which are then used by the generator to produce high-quality, contextually informed outputs

class DoclingPDFLoader(BaseLoader): def __init__(self,files: str | list[str]): self.Filepath = files if isinstance(files,list) else [files] self.converter = DocumentConverter() def load(self): document = [] for src in self.Filepath: doc = self.converter.convert(src).document text = doc.export_to_markdown() document.append(Document(page_content=text)) return document

Chunking

Chunking is the process of dividing prompts or documents into smaller, manageable segments. These chunks can be defined based on fixed criteria, such as a specific number of characters, sentences, or paragraphs.

In RAG, each chunk is converted into an embedding vector for retrieval. Smaller and more precise chunks improve the match between the user’s query and the content, enhancing the accuracy and relevance of the retrieved information.

Recursive chunking is an advanced method that hierarchically splits text using various separators. It adapts dynamically to create chunks of consistent size or structure, ensuring optimal retrieval performance

from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=2000, chunk_overlap=200 ) document = text_splitter.split_documents(documents)

Embeddings

After chunking the prompt, the next step is embedding. In a RAG system, embedding involves converting both the user's query and the documents in the knowledge base into vectors that can be effectively compared for relevance. This process is essential for retrieving the most pertinent information in response to a user query.

Selecting the best embedding model is critical for optimal RAG performance. Regularly checking resources like the Hugging Face leaderboard for top-performing embedding models can help ensure the best results.

from langchain_openai.embeddings.base import OpenAIEmbeddings embedding = OpenAIEmbeddings()

Index

In a RAG application, using an index like ChromaDB is crucial for efficiently retrieving relevant information from large datasets or knowledge bases. The index acts as a structured representation of the documents or data, allowing the system to quickly search and retrieve the most relevant pieces of information based on a query

from langchain_chroma.vectorstores import Chroma vectordb = "./store" path=".<FILE>" if not os.path.exists(vectordb): loader = DoclingPDFLoader(files=path) documents = loader.load() vectorstore = Chroma.from_documents( documents=document, embedding=embedding, persist_directory=vectordb ) else: vectorstore = Chroma(persist_directory=vectordb, embedding_function=embedding)

Notice that we are using a persistent client to store the index in a folder called store./store. This is useful when you want to keep the index between runs, and we will use it in the next section to query the documents.

The vector store enables vector search, which allows the system to perform similarity searches based on the embeddings. When a query is made, its embedding can be compared against those stored in the vector store to find the most relevant chunks of text.

Step 3: Retrieval and Synthesis

The Retrieval and Synthesis components are the heart of a RAG system. While the retriever fetches relevant documents, the synthesis system integrates this information with the user’s query to generate coherent, context-aware responses. Let’s break this down further.

Retrieval: Fetching Relevant Information

The retriever is responsible for identifying and fetching the most relevant documents or data from the knowledge base. This step is crucial because the quality of the retrieved information directly impacts the accuracy of the generated response.

In our implementation, we use a vector store (ChromaDB) to perform similarity searches. When a user submits a query, its embedding is compared against the embeddings of the documents in the vector store to find the most relevant chunks of text.

retriever = vectorstore.as_retriever()

Synthesis: Generating Context-Aware Responses

Once the retriever fetches the relevant documents, the synthesis system takes over. This component combines the retrieved information with the user’s query to generate a coherent and contextually relevant response.

Why Synthesis Matters

• Context Integration: The synthesis system ensures that the generated response is grounded in the retrieved information, reducing the risk of hallucinations (i.e., generating fictional or unsupported facts). • Conversational Flow: It maintains the context of the conversation, especially in multi-turn dialogues, by considering the chat history. • Precision and Relevance: By refining raw information into concise and user-friendly responses, synthesis enhances the overall quality of the output.

Key Functions in Synthesis

1 . create_history_aware_retriever

from langchain.chains.history_aware_retriever import create_history_aware_retriever prompt = hub.pull("langchain-ai/chat-langchain-rephrase") retriever_chain = create_history_aware_retriever(llm, retriever, prompt)

This function creates a chain that processes the conversation history to improve retrieval. If there’s no chat history, the input query is passed directly to the retriever. If chat history exists, the function uses a language model to generate a refined search query based on the context of the conversation.

2. create_stuff_documents_chain

from langchain.chains.combine_documents import create_stuff_documents_chain prompt1 = hub.pull("langchain-ai/retrieval-qa-chat") document_chain = create_stuff_documents_chain(llm, prompt1)

This function constructs a chain that processes a list of retrieved documents. It formats the documents into a single prompt and passes it to the language model for response generation. This method is ideal when the combined content of the documents fits within the model’s context window.

3. create_retrieval_chain

from langchain.chains.retrieval import create_retrieval_chain retrieval_chain = create_retrieval_chain(retriever_chain, document_chain)

This function integrates the retriever and the document chain to create a seamless pipeline. It retrieves relevant documents and processes them with the language model to generate a final response.

Putting It All Together

Here’s how the synthesis system works in practice:

User Query: The user submits a query (e.g., “What’s the return policy for electronics?”).
Chat History: If this is part of a multi-turn conversation, the chat history is passed to the retriever chain to refine the search query.
Document Retrieval: The retriever fetches relevant documents from the knowledge base.
Response Generation: The synthesis system formats the retrieved documents into a prompt and generates a coherent response using the language model.
Output: The final response is returned to the user.

CHAT_HISTORY = [] CHAT_HISTORY.append(HumanMessage(content=query)) ans = retrieval_chain.invoke(input={ "chat_history": CHAT_HISTORY, "input": query, })['answer'] CHAT_HISTORY.append(AIMessage(content=ans)) return ans

Why Synthesis is Critical

The synthesis system ensures that the RAG pipeline delivers accurate, relevant, and context-aware responses. Without proper synthesis, the system might:

Generate responses that are irrelevant or incomplete.
Fail to maintain the context of multi-turn conversations.
Produce hallucinations or unsupported claims.

By integrating retrieval and generation, the synthesis system bridges the gap between raw information and user-friendly responses, making RAG systems highly effective for real-world applications.

Application of RAG

Question Answering Systems: Provide accurate, context-aware answers to user queries.
Personalized Recommendations: Fetch and generate tailored suggestions based on user preferences.

Conclusion

RAG systems are revolutionizing how we interact with AI by combining the power of language models with external knowledge retrieval. By building a RAG pipeline with LangChain, you can create applications that deliver accurate, relevant, and context-aware responses, even in information-heavy domains.

Whether you’re building a customer support chatbot, a research assistant, or a personalized recommendation engine, RAG technology can enhance the capabilities of your AI systems. With tools like LangChain and ChromaDB, the process becomes more accessible, enabling you to tackle real-world challenges effectively.