Chunking Strategies
The chunking strategies supported by Aryn DocParse
When calling DocParse, you can specify a chunking strategy to the partition_file
call. You can enable the default chunking options by specifying an empty dict
:
Here is an example specifying a particular chunking option:
Options
The options that you can specify in the dict
include the following:
-
strategy
: A string specifying the strategy to use for chunking. Valid values arecontext_rich
andmaximize_within_limit
. The default and recommended chunker iscontext_rich
as{'strategy': 'context_rich'}
.-
Behavior of
context_rich
chunker: The goal of this strategy is to add context to chunks. It creates chunks by combining adjacent elements until the chunk reaches the length of the max-token limit specified. Each chunk will contain a copy of the most recently seen section header or title. Section headers or titles that are back to back will be grouped together as one large section header during chunking. -
Behavior of
maximize_within_limit
chunker: The goal of themaximize_within_limit
chunker is to make the chunks as large as possible. It merges elements into the last most recently merged set of elements unless doing so would make its token count exceedmax_tokens
. In that case, it would keep the new element separate and start merging subsequent elements into that one, following the same rule. All elements that are a result of mergers are assigned the type ‘Section’. It merges elements on different pages, unlessmerge_across_pages
is set toFalse
.
-
-
max_tokens
: An integer specifying the maximum token limit for a chunk. Default value is 512. -
tokenizer
: A string specifying the tokenizer to use when converting text into tokens. Valid values areopenai_tokenizer
,character_tokenizer
, andhuggingface_tokenizer
. Defaults toopenai_tokenizer
. -
tokenizer_options
: A tree with string keys specifying the options for the chosen tokenizer. Defaults to{'model_name': 'text-embedding-3-small'}
, which works with the OpenAI tokenizer.- Available options for
openai_tokenizer
:model_name
: Accepts all models supported by OpenAI’s tiktoken tokenizer. Default is “text-embedding-3-small”
- Available options for
HuggingFaceTokenizer
:model_name
: Accepts all huggingface tokenizers from the huggingface/tokenizers repo.
character_tokenizer
does not take any options.
- Available options for
-
merge_across_pages
: Aboolean
that whenTrue
the selected chunker will attempt to merge chunks across page boundaries. Defaults toTrue
.
Output
The output of DocParse when you specify a chunking strategy will be a JSON
list of objects consisting of the following fields:
Each entry in the list will always have a type
, bbox
, properties
, and text_representation
field. The type
field indicates the type of the element (e.g., text, image, table, etc.), the properties
field contains additional information about the element (e.g., confidence score, page number, etc.), and the text_representation
field contains the text content of the element. In the context of chunking, the properties.score
field and the bbox
field should be ignored.
An example element is given below:
To see examples of how to use these chunking strategies please read here.
Was this page helpful?