Retrieval-augmented generation (RAG) is a significant advancement in the ability of large language models (LLMs) to utilize external knowledge for more accurate and contextually rich responses.
However, in standard RAG systems, small text chunks are used to isolate precise information for retrieval.
This approach works well for focused facts but lacks the depth and cohesion needed for broader topics. The result can be fragmented or incomplete responses, which may not satisfy complex queries.
A Parent Document Retriever (PDR) addresses this limitation by structuring documents into manageable, smaller segments, known as "child chunks." These chunks are stored in a way that allows the system to compare specific parts of a document with a user’s query.
The original, larger document—referred to as the "parent"—is only retrieved when a relevant child chunk matches the query. This ensures the LLM can access the full context while maintaining efficient retrieval.
This guide will take you through a step-by-step process for implementing a PDR using Python and LangChain. We'll also cover the fundamental concepts and the benefits of PDR. Let’s get started!
Understanding Parent Document Retrieval
Retrieving smaller text chunks is useful in RAG systems, but it often risks missing the document’s broader context—especially when a user query touches on overarching themes or the interconnectedness of ideas.
For instance, consider an article discussing investment strategies. If a query about investment philosophy returns only a specific passage about bonds, the response might omit the larger context of the overall investment strategy.
In that case, the response may lack the depth required to answer broader investment-related queries.
Parent Document Retrieval (PDR) solves this by implementing a two-tier retrieval approach.
- First Tier: Retrieve Relevant Chunks–Similar to standard RAG, the system retrieves the most relevant chunks or passages based on the user's query.
- Second Tier: Retrieve Parent Documents–The system retrieves their corresponding parent documents once relevant chunks are identified. This provides the LLM with a broader context.
This dual-layered process system responds with precision and depth by focusing on specific information while understanding the document's broader message.
PDR offers several advantages, including:
- Improved Coherence: By considering the full document, the LLM can generate more coherent and well-structured responses.
- Enhanced Understanding: The broader context allows the LLM to understand the information better and generate more insightful responses.
- Reduced Ambiguity: PDR helps resolve ambiguities in shorter passages by considering the full document.
- Better for Long-Form Content: PDR excels at summarizing or generating long-form content, where context continuity is crucial.
Implementing Parent Document Retrieval using LangChain
Let’s walk through the implementation of a Parent Document Retriever (PDR) step-by-step.
Step 1: Set Up the Environment
→ Install the Necessary Libraries
To set up a PDR with LangChain, we’ll need to install a few libraries:
```
pip install langchain-community chromadb tiktoken
```
→ Import Required Modules
Now, import the necessary modules for loading documents, generating text embeddings using OpenAI's API, splitting the documents, and creating a vector store to index and search documents.
from langchain.document_loaders import DirectoryLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.document_loaders import WebBaseLoader
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
→ Set Up OpenAI API Key
Since we’re using OpenAI’s embeddings, set up API key as follows:
- Obtain an API key from the OpenAI website.
- Set the OPENAI_API_KEY environment variable:
```
import os
from getpass import getpass
# Initialize OpenAI API
os.environ["OPENAI_API_KEY"] = getpass('Enter your OpenAI API Key: ')
```
Step 2: Initialize the ChatOpenAI model
Let's initialize an instance of the ChatOpenAI class, which will be used to interact with OpenAI's GPT-4 model.
```
from langchain.chat_models import ChatOpenAI
chat = ChatOpenAI(temperature=0, model='gpt-4')
```
The temperature parameter is set to 0 to ensure deterministic outputs.
Step 3: Preparing the Documents
Next, we’ll load our documents for retrieval.
```
# Loading a single website
loader = WebBaseLoader("http://www.paulgraham.com/superlinear.html")
paul_graham_essay = loader.load()
print (f"You have {len(paul_graham_essay)} document with length {len(paul_graham_essay[0].page_content)} characters or roughly {len(paul_graham_essay[0].page_content) / 4} tokens")
```
WebBaseLoader loads the content of a web page into a Document object.
Output
The code then prints the number of documents loaded (which is 1 in this case) and the length of the document's content in characters and tokens.
```
You have 1 document with length 24854 characters or roughly 6213.5 tokens
```
Step 4: Define the Splitters
To create both child and parent documents, define the splitters:
```
# Define splitter for large (parent) chunks
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000 * 4, chunk_overlap=0)
# Define splitter for smaller (child) chunks
child_splitter = RecursiveCharacterTextSplitter(chunk_size=125*4)
```
Here we define two instances of RecursiveCharacterTextSplitter:
- parent_splitter: This will split the document into larger chunks (parent documents) with a chunk_size of 4000 characters and no overlap.
- child_splitter: Splits the document into smaller chunks (child documents) with a chunk_size of 500 characters.
Step 5: Create the Vector Store and Retriever
Now, set up a vector store to retrieve smaller chunks and store the parent documents separately for later retrieval.
```
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
collection_name="parent_document_splits",
embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
docstore = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
child_splitter=child_splitter,
parent_splitter=parent_splitter
)
# Add the document to the retriever
retriever.add_documents(paul_graham_essay)
```
Here’s a breakdown of each component:
- vectorstore: A Chroma vector store is created to store and index the child document chunks along with their embeddings.
- docstore: An InMemoryStore is created to store the parent documents.
- retriever: A ParentDocumentRetriever instance is initialized with the vectorstore, docstore, child_splitter, and parent_splitter.
- Retriever.add_documents adds the loaded document to the retriever.
Step 6: Test the Retriever
→ Count Parent and Child Documents
Check the number of parent and child documents created:
```
num_parent_docs = len(retriever.docstore.store.items())
num_child_docs = len(set(retriever.vectorstore.get()['documents']))
print (f"You have {num_parent_docs} parent docs and {num_child_docs} child docs")
```
Output
It prints the number of parent and child documents created.
```
You have 8 parent docs and 82 child docs
```
→ Retrieve Child Documents
Retrieve child documents based on a query:
```
child_docs = vectorstore.similarity_search("what is some investing advice?")
print (f"{len(child_docs)} child docs were found")
child_docs[0].page_content
```
Output
Retrieves and displays the first of the 4 most relevant child documents related to the query "What is some investing advice?".
```
4 child docs were found
Document(metadata={'doc_id': '251f7a70-04d9-4e76-b003-e33dee0377f6', 'language': 'No language found.', 'source': 'http://www.paulgraham.com/superlinear.html', 'title': 'Superlinear Returns'}, page_content="as true in investing, for example. It's only useful to believe that\na company will do well if most other investors don't; if everyone\nelse thinks the company will do well, then its stock price will\nalready reflect that, and there's no room to make money.What else can we learn from these fields? In all of them you have\nto put in the initial effort. Superlinear returns seem small at\nfirst. At this rate, you find yourself thinking, I'll never get")
```
→ Retrieve Parent Documents
Retrieve parent documents to provide a broader context.
```
retrieved_docs = retriever.invoke("what is some investing advice?")
print(retrieved_docs[0].page_content)
```
Output
Retrieved the first parent document's content relevant to the investment advice query.
Step 7: Define the Prompt and Generate a Response
→ Define Prompt Template
Create a prompt template to use the retrieved context for answering questions:
```
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Answer:"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)
```
→ Generate Response
Use the prompt template and retrieved documents to generate a response:
```
question = "what is some investing advice?"
response = chat.predict(text=PROMPT.format_prompt(
context=retrieved_docs,
question=question
).text)
Print(response)
```
Output
This language model uses the retrieved context (retrieved_docs) to answer the question about investing advice, generating a response based on the relevant information.
```
The document suggests that in investing, it's only useful to believe that a company will do well if most other investors don't; if everyone else thinks the company will do well, then its stock price will already reflect that, and there's no room to make money. It also suggests that it's important to put in the initial effort, as superlinear returns seem small at first. It's worth taking extraordinary measures to get there.
```
Where is Parent Document Retrieval Applied?
PDR is applicable in various domains where context-rich responses are important. Some common applications include:
- Customer Support: Enhancing automated systems to deliver comprehensive responses based on in-depth product documentation.
- Legal and Compliance: Assist with retrieving relevant legal documents or regulations that require a thorough understanding.
- Research and Academia: Facilitating access to complete research papers or articles when specific sections are referenced.
- Content Generation: Improving the quality of content produced by language models by providing them with extensive background information.
Conclusion
The Parent Document Retriever (PDR) considerably enhances the RAG models by generating responses that are both accurate and rich in context.
Using a two-tiered retrieval process incorporating both “child” and “parent” document layers, PDR equips systems to capture detailed and broader document contexts, providing more comprehensive answers to complex queries.
This guide outlined a step-by-step approach to implementing PDR using Python and LangChain.
By following these steps and understanding the core concepts, you can leverage PDR to elevate the performance of your RAG applications, ultimately improving response quality and relevance to meet the needs of sophisticated AI systems.
Additional Resources
- Hybrid Search: Combining Traditional and AI-Based Search Techniques
- How to Integrate Retrieval-Augmented Generation (RAG) in Your LLM Applications
- Integrating Multiple Data Sources for Better LLM Retrieval
- Grouped-Query Attention: Enhancing AI Model Efficiency
- Running evals as real-time guardrails
- Evaluation Best Practices