partition_file call. You can enable the default chunking options by specifying an empty dict:
Options
The options that you can specify in thedict include the following:
-
strategy: A string specifying the strategy to use for chunking. Valid values arecontext_richandmaximize_within_limit. The default and recommended chunker iscontext_richas{'strategy': 'context_rich'}.-
Behavior of
context_richchunker: The goal of this strategy is to add context to chunks. It creates chunks by combining adjacent elements until the chunk reaches the length of the max-token limit specified. Each chunk will contain a copy of the most recently seen section header or title. Section headers or titles that are back to back will be grouped together as one large section header during chunking. -
Behavior of
maximize_within_limitchunker: The goal of themaximize_within_limitchunker is to make the chunks as large as possible. It merges elements into the last most recently merged set of elements unless doing so would make its token count exceedmax_tokens. In that case, it would keep the new element separate and start merging subsequent elements into that one, following the same rule. All elements that are a result of mergers are assigned the type ‘Section’. It merges elements on different pages, unlessmerge_across_pagesis set toFalse.
-
Behavior of
-
max_tokens: An integer specifying the maximum token limit for a chunk. Default value is 512. -
tokenizer: A string specifying the tokenizer to use when converting text into tokens. Valid values areopenai_tokenizer,character_tokenizer, andhuggingface_tokenizer. Defaults toopenai_tokenizer. -
tokenizer_options: A tree with string keys specifying the options for the chosen tokenizer. Defaults to{'model_name': 'text-embedding-3-small'}, which works with the OpenAI tokenizer.- Available options for
openai_tokenizer:model_name: Accepts all models supported by OpenAI’s tiktoken tokenizer. Default is “text-embedding-3-small”
- Available options for
HuggingFaceTokenizer:model_name: Accepts all huggingface tokenizers from the huggingface/tokenizers repo.
character_tokenizerdoes not take any options.
- Available options for
-
merge_across_pages: Abooleanthat whenTruethe selected chunker will attempt to merge chunks across page boundaries. Defaults toTrue.
Output
The output of DocParse when you specify a chunking strategy will be aJSON list of objects consisting of the following fields:
type, bbox, properties, and text_representation field. The type field indicates the type of the element (e.g., text, image, table, etc.), the properties field contains additional information about the element (e.g., confidence score, page number, etc.), and the text_representation field contains the text content of the element. In the context of chunking, the properties.score field and the bbox field should be ignored.
An example element is given below:
