Processing Options
Learn about the parameters you can use with Aryn DocParse
There are several options you can specify when calling DocParse. For example, we can extract the table structure from our document with the following curl command.
All of the available options are listed below, and are optional unless specified otherwise.
-
use_ocr
: A boolean value that, when set toTrue
, causes DocParse to extract text using an OCR model. This is useful when the text is not directly extractable from the PDF, such as when the text is part of an image or when the text is rotated. When set toFalse
, DocParse extracts embedded text from the input document. Default isFalse
. -
text_extraction_options
: A map with string keys specifying options for text extraction.ocr_text_mode
: A string that specifies the mode to use for OCR text extraction. The default isstandard
, which uses the conventional classical OCR pipeline to process documents. The other option isvision
, which uses a vision model for OCR. Note thatvision
is only available for non-table elements (standard
will be used for table elements) and for PAYG users only.
-
extract_table_structure
: A boolean that, whenTrue
, enables DocParse to extract tables and their structural content using a purpose built table extraction model. If set toFalse
, tables are still identified but not analyzed for their structure; as a result, table cells and their bounding boxes are not included in the response. Default isFalse
. -
add_to_docset_id
: A string that specifies the DocSet ID to store your parsed document in. By default, DocParse will use the DocSet nameddocparse_storage
unless you have disabled data retention. -
table_extraction_options
: A map with string keys specifying options for table extraction. Only applied whenextract_table_structure
isTrue
. Default is empty ({}
)-
include_additional_text
: Boolean. WhenTrue
, DocParse will attempt to enhance the table structure by merging in tokens from text extraction. This can be useful for working with tables that have missing or misaligned text. Default isFalse
-
model_selection
: String. An expression to instruct DocParse how to select the table model to use for extraction. Default is"pixels > 500 -> deformable_detr; table_transformer"
, which means “if the largest dimension of the table is more than 500 pixels, use deformable_detr; otherwise use table_transformer.” To use only deformable_detr or table_transformer, setmodel_selection="deformable_detr"
ormodel_selection="table_transformer"
. Selection expressions are of the formAnd should be read as a series of
if metric compares to threshold, then use model
statements. Statements are processed from left to right.- Supported models are
table_transformer
, which tends to do well with smaller tables, anddeformable_detr
, which tends to do better with larger tables. - Supported metrics are
pixels
, which corresponds to the maximum dimension of the bounding box containing the table (we find this to be easier to reason about than the total number of pixels which depends on two numbers), andchars
, which corresponds to the total number of characters within the table as determined by the OCR/text extraction step. - Thresholds must be numeric.
- Supported comparison operators are
<, >, <=, >=, ==, !=
.
A statement with no metric, comparison, and threshold can be thought of as a default, where statements after the default will not be processed. If no such ‘unconditional’ statement is included and no conditions match, DocParse will default to table_transformer. Anything after the unconditional statement will not be processed. Examples:
table_transformer
=> always use table transformerpixels > 500 -> deformable_detr; table_transformer
=> if the biggest dimension of the table is greater than 500 pixels use deformable detr. Otherwise use table_transformer.pixels>50->table_transformer; chars<30->deformable_detr;chars>35->table_transformer;pixels>2->deformable_detr;table_transformer;comment
=> if the biggest dimension is more than 50 pixels use table transformer. Else if the total number of chars in the table is less than 30 use deformable_detr. Else if there are mode than 35 chars use table transformer. Else if there are more than 2 pixels in the biggest dimension use deformable detr. Otherwise use table transformer. comment is not processed.
- Supported models are
-
-
extract_images
: A boolean that determines whether to extract images from the document. Default:False
. -
summarize_images
: (PAYG Only) A boolean that, whenTrue
, generates a summary of the images in the document and returns it as thetext_representation
. WhenFalse
, images are not summarized. Default isFalse
. -
ocr_language
: A string that specifies the language to use for OCR. The default isenglish
(English). The full list of supported languages can be found here. -
selected_pages
: A list specifying individual pages (1-indexed) and page ranges from the document to partition. Single pages are specified as integers and ranges are specified as lists with two integer entries in ascending order. A valid example value for selected_pages is[1, 10, [15, 20]]
which would include pages 1, 10, 15, 16, 17 …, 20.selected_pages
isNone
by default, which results in all pages of the document being parsed. -
chunking_options
: A dictionary of options for specifying chunking behavior. Chunking is only performed when this option is present, and default options are chosen whenchunking_options
is specified as{}
.strategy
: A string specifying the strategy to use to combine and split chunks. Valid values arecontext_rich
andmaximize_within_limit
. The default and recommended chunker iscontext_rich
as{'strategy': 'context_rich'}
.-
Behavior of
context_rich
chunker: The goal of this strategy is to add context to evenly-sized chunks. This is most useful for retrieval based GenAI applications. The context_rich chunking combines adjacentSection-header
andTitle
elements into a newSection-header
element. It merges elements into a chunk with its most recentSection-header
. If the chunk would contain too many tokens, it starts a new chunk by copying the Section-header to the start of this new chunk and continues. The chunker merges elements on different pages, unlessmerge_across_pages
is set toFalse
. -
Behavior of
maximize_within_limit
chunker: The goal of themaximize_within_limit
chunker is to make the chunks as large as possible. Merges elements into the last most recently merged set of elements unless doing so would make its token count exceedmax_tokens
. In that case, it would keep the new element separate and start merging subsequent elements into that one, following the same rule. Merges elements on different pages, unlessmerge_across_pages
is set toFalse
.
-
max_tokens
: An integer specifying the cutoff for splitting chunks that are too large. Default value is 512.tokenizer
: A string specifying the tokenizer to use when determining how characters in a chunk are grouped. Valid values areopenai_tokenizer
,character_tokenizer
, andhuggingface_tokenizer
. Defaults toopenai_tokenizer
.tokenizer_options
: A tree with string keys specifying the options for the chosen tokenizer. Defaults to{'model_name': 'text-embedding-3-small'}
, which works with the OpenAI tokenizer.- Available options for
openai_tokenizer
:model_name
: Accepts all models supported by OpenAI’s tiktoken tokenizer. Default is “text-embedding-3-small”
- Available options for
HuggingFaceTokenizer
:model_name
: Accepts all huggingface tokenizers from the huggingface/tokenizers repo.
character_tokenizer
does not take any options.
- Available options for
merge_across_pages
: Aboolean
that whenTrue
the selected chunker will attempt to merge chunks across page boundaries. Defaults toTrue
.
-
output_format
: A string controlling the output representation. Defaults tojson
which yields an array calledelements
which contains the partitioned elements, represented in JSON. If set tomarkdown
the service response will instead include a field calledmarkdown
that contains a string representing the entire document in Markdown format. -
threshold
: This represents the threshold for accepting the model’s predicted bounding boxes. It defaults toauto
, where the service uses a processing method to find the best prediction for each possible bounding box. This is the recommended setting. However, this can be overridden by specifying a numerical threshold between 0 and 1. If you specify a numerical threshold, only bounding boxes with confidence scores higher than the threshold will be returned (instead of using the processing method described above). A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects. If you do set the threshold manually, we recommend starting with a value of0.32
. Either the specificstring
auto
or afloat
between0.0
and1.0
, inclusive. This value specifies the cutoff for detecting bounding boxes. A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects. Default isauto
(DocParse will choose optimal bounding boxes). -
pages_per_call
: This is only available when using the Partition function in Sycamore. This option divides the processing of your document into batches of pages, and you specify the size of each batch (number of pages). This is useful when running OCR on large documents. -
output_label_options
: A dictionary of options to specify which heuristic to apply to enforce certain label outputs. If this option is not specified, no heuristic is applied. The options the dictionary supports are listed below.promote_title
: A boolean that specifies whether to promote an element to title if there’s no title in the output.title_candidate_elements
: A list of strings that are candidate elements to be promoted to title.orientation_correction
: A boolean value that specifies whether to correct the orientation of rotated pages during the preprocessing step.
-
markdown_options
: A dictionary of options to specify what to include in the markdown output.include_pagenum
: A boolean that specifies whether to include page numbers in the markdown output. Default isFalse
.include_headers
: A boolean that specifies whether to include headers in the markdown output. Default isFalse
.include_footers
: A boolean that specifies whether to include footers in the markdown output. Default isFalse
.
Here is an example of how you can use some of these options in a curl command or in Python code with the Aryn SDK.
Was this page helpful?