The chunking strategies supported by Aryn DocParse
partition_file
call. You can enable the default chunking options by specifying an empty dict
:
dict
include the following:
strategy
: A string specifying the strategy to use for chunking. Valid values are context_rich
and maximize_within_limit
. The default and recommended chunker is context_rich
as {'strategy': 'context_rich'}
.
context_rich
chunker: The goal of this strategy is to add context to chunks. It creates chunks by combining adjacent elements until the chunk reaches the length of the max-token limit specified. Each chunk will contain a copy of the most recently seen section header or title. Section headers or titles that are back to back will be grouped together as one large section header during chunking.
maximize_within_limit
chunker: The goal of the maximize_within_limit
chunker is to make the chunks as large as possible. It merges elements into the last most recently merged set of elements unless doing so would make its token count exceed max_tokens
. In that case, it would keep the new element separate and start merging subsequent elements into that one, following the same rule. All elements that are a result of mergers are assigned the type ‘Section’. It merges elements on different pages, unless merge_across_pages
is set to False
.
max_tokens
: An integer specifying the maximum token limit for a chunk. Default value is 512.
tokenizer
: A string specifying the tokenizer to use when converting text into tokens. Valid values are openai_tokenizer
, character_tokenizer
, and huggingface_tokenizer
. Defaults to openai_tokenizer
.
tokenizer_options
: A tree with string keys specifying the options for the chosen tokenizer. Defaults to {'model_name': 'text-embedding-3-small'}
, which works with the OpenAI tokenizer.
openai_tokenizer
:
model_name
: Accepts all models supported by OpenAI’s tiktoken tokenizer. Default is “text-embedding-3-small”HuggingFaceTokenizer
:
model_name
: Accepts all huggingface tokenizers from the huggingface/tokenizers repo.character_tokenizer
does not take any options.merge_across_pages
: A boolean
that when True
the selected chunker will attempt to merge chunks across page boundaries. Defaults to True
.
JSON
list of objects consisting of the following fields:
type
, bbox
, properties
, and text_representation
field. The type
field indicates the type of the element (e.g., text, image, table, etc.), the properties
field contains additional information about the element (e.g., confidence score, page number, etc.), and the text_representation
field contains the text content of the element. In the context of chunking, the properties.score
field and the bbox
field should be ignored.
An example element is given below: