Chunk Strategies
Chunking is a crucial step in preparing data for vector indexes and large language models (LLMs). It involves breaking down large texts into smaller, more manageable pieces. The primary purpose of chunking is to improve the retrieval of relevant information when querying the data, which in turn enhances the quality of input provided to LLMs.
Simple Chunking
By Section
This method attempts to split the document into chunks based on its inherent structure and format. For example:
- In markdown files, it identifies headings and paragraphs.
- In HTML documents, it can use tags like
<div>
,<p>
, or<section>
. - For PDFs, it might use page breaks or formatting cues.
This approach preserves the logical structure of the document, maintaining context and readability. While it may result in chunks of varying sizes, it's particularly useful for documents with clear hierarchical organization.
By Token
This strategy divides the text into chunks based on a specified number of tokens (usually words or subwords). Tokens are the basic units that language models process, which can be words, parts of words, or even punctuation marks. For example, the sentence "I love NLP!" might be tokenized as ["I", "love", "NLP", "!"].
Token-based chunking is generally faster than custom chunking and can maintain better semantic coherence than character-based chunking.
Simple Chunking Options
-
Chunk Size: Determine the size of the text chunks that will be embedded. Larger chunk sizes (e.g., 512 tokens) provide more context but may result in less granular embeddings. Smaller chunk sizes (e.g., 128 tokens) allow for more precise matching but may lose some contextual information. Experiment with different chunk sizes to find the optimal balance for your specific application.
-
Chunk Overlap: Specify the amount of overlap between adjacent text chunks. Overlapping chunks help maintain continuity and prevent important information from being split across chunk boundaries. A common overlap setting is around 5-15% of the chunk size. For example, with a chunk size of 512 tokens, you might use an overlap of 50-75 tokens.
-
Chunk Language: (Optional) Forces language-specific preprocessing steps on all text regardless of the document type. By default, the chunking language is inferred from the file extension or content type.
Custom Chunking
This option allows you to send the extracted text of each document to a flow for custom chunking. While it may be slower to index than the other methods, it offers the most flexibility and control over the chunking process.
BotDojo offers a selection of prebuilt chunkers that you can browse and use directly. These chunkers are designed to handle various common scenarios and document types, providing a quick and efficient way to implement chunking without having to create a custom solution from scratch. You can explore these prebuilt chunkers at BotDojo Custom Chunkers.
Summary Chunker Example
The Summary Chunker is an enhanced chunker that improves the quality and context of chunks within an index through a two-step process:
- It summarizes each document, capturing the main ideas and key points.
- It then rewrites each chunk within the document, incorporating relevant context from the summary.
This approach creates more informative and context-aware chunks by:
- Improving coherence across chunks
- Maintaining original information while adding contextual details
- Enhancing search results and question-answering capabilities
Choosing the Right Chunking Strategy
When deciding on a chunking strategy, consider the following factors:
- Document Structure: For well-structured documents, section-based chunking may be more appropriate.
- Document Length: Longer documents may benefit from custom chunking to maintain context.
- Query Complexity: If your application needs to answer complex queries, consider strategies that preserve more context, like summary chunking.
- Processing Speed: Simple chunking methods are faster and may be sufficient for many use cases.
- Index Size: For large indexes, invest time in custom chunking to improve retrieval quality.
Evaluating Chunking Performance
To optimize your chunking strategy, consider the following evaluation metrics:
- Context Recall: Measure how often the system retrieves relevant chunks for a given query.
- Context Preservation: Assess whether important context is maintained within chunks.
- Query Response Time: Evaluate the speed of information retrieval and LLM response generation.
- Chunk Coherence: Analyze how well each chunk stands alone as a meaningful unit of information.
To compare different chunking strategies effectively, combine Batches and Evaluations.