Tutorial: Chunking
Dive deep into each of the chunking strategies supported by Aryn DocParse
In this tutorial we will walk through each of the chunking strategies supported by DocParse and provide examples of how to use them.
Context-Rich
The context rich chunking strategy combines adjacent elements with one another and adds the last seen section header or title to each outputted chunk. For example, let’s take the fifth page of the following document:
The bounding boxes shown above are the result of calling DocParse without any chunking options specified. Let’s say for your question/answering RAG application, you want to easily be able to retrieve certain formulas and ask questions such as “What linear transforms were used for Position-wise Feed-Forward Networks?” This would require the Section Header
“Position-wise Feed-Forward Networks,” to be in the same chunk as the Formula
(2) “FFN(x)=…”. Calling DocParse with the following chunking options will group the two chunks together:
If you inspect the elements in the return value, you’ll notice that the Section Header
and the Formula
are all chunked together into one element:
Maximize Within Limit
The maximize_within_limit
strategy is meant to be used when you want to merge several consecutive elements together into a large chunk. Take the following example:
The bounding boxes shown above are the result of calling DocParse without any chunking options specified. Let’s say for your question/answering RAG application, you want all the list items to be grouped together into one chunk to improve the quality of your embeddings. Calling DocParse with the following chunking options will group the entire list into one chunk:
If you inspect the elements list in the return value, you’ll notice that the entire bulleted list in the document is grouped into one chunk:
Was this page helpful?