Original Paper: https://arxiv.org/abs/2402.09760
Code Sample: Embed GitHub
Semantic Chunking for Document Processing
Overview
Semantic chunking is an advanced technique for dividing text into coherent segments based on the underlying meaning and context, rather than relying on fixed character or word counts. Unlike traditional chunking methods, semantic chunking creates more meaningful and context-aware text segments, which is particularly important when using Large Language Models (LLMs) for tasks like Retrieval-Augmented Generation (RAG).
The key steps in semantic chunking are:
- Break the document into sentences
- Create sentence groups by including a set number of sentences before and after each sentence. These groups are "anchored" by the sentence used to create them.
- Generate embeddings for each sentence group and associate them with the anchor sentence.
- Compare distances between each group sequentially. Low distances indicate the topic or theme is the same, while higher distances suggest a change in topic, effectively delineating one chunk from the next.
Advantages
- Maintains semantic consistency within each chunk, preserving the inherent meaning of the text.
- Useful for tasks like text summarization, sentiment analysis, and document classification that require understanding the context and flow of information.
- Avoids issues with fixed-sized chunking, such as severing words, sentences, or paragraphs and disrupting the flow of information[1][2].
Limitations
- Requires more computational resources compared to fixed-sized chunking, as it relies on complex semantic algorithms to determine chunk boundaries[2].
- May not be as effective for text without a strong semantic structure, such as logs and data[2].
- Determining the optimal number of sentences to include in each group requires experimentation and may vary based on the specific use case and text characteristics.
Implementation
LangChain provides an implementation of semantic chunking based on the work of Greg Kamradt[3]. The key steps are:
- Break the document into sentences using a sentence tokenizer.
- Create sentence groups by including a specified number of sentences before and after each sentence.
- Generate embeddings for each sentence group using a text embedding model.
- Compare the cosine similarity between each group's embedding and the previous group's embedding. If the similarity falls below a certain threshold, mark that point as a chunk boundary.
This approach leverages text embeddings to capture the semantic meaning of each sentence group and uses the change in meaning between groups to determine chunk boundaries.
Use Cases
Semantic chunking is particularly useful in applications that require understanding the context and flow of information in text, such as:
- Document summarization: Chunking the document into semantically coherent segments can help identify the most important information and generate concise summaries.
- Sentiment analysis: By maintaining the context within each chunk, semantic chunking can provide more accurate sentiment scores for different parts of the document.
- Question answering: Retrieving relevant chunks of text based on the user's query and using them to generate answers can lead to more accurate and contextual responses.
In the context of RAG, semantic chunking can help optimize the relevance and quality of the retrieved text chunks, leading to better performance of the LLM-based system[4].
In summary, semantic chunking is a powerful technique for dividing text into meaningful segments based on the underlying context and meaning. While it requires more computational resources compared to fixed-sized chunking, it offers significant advantages in tasks that rely on understanding the flow and context of information in text.
This is an AI generated summary by Athina AI
Athina AI is a collaborative IDE for AI development.
Learn more about how Athina can help your team ship AI 10x faster →
1. https://blog.gopenai.com/mastering-rag-chunking-techniques-for-enhanced-document-processing-8d5fd88f6b72 2. https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-retrieval-augmented-generation?view=doc-intel-4.0.0 3. https://www.pinecone.io/learn/chunking-strategies/ 4. https://kdb.ai/learning-hub/articles/in-depth-review-of-chunking-methods/ 5. https://bitpeak.com/chunking-methods-in-rag-overview-of-available-solutions/ 6. https://www.researchgate.net/publication/234810279_Semantic-Chunks_a_middleware_for_ubiquitous_cooperative_work 7. https://www.linkedin.com/pulse/chunking-strategies-ai-data-kash-kashyap-0lghe