General
text_mode
A string that specifies the mode to use for text extraction.Default: "auto". Valid options are
auto, inline_fallback_to_ocr, ocr_standard, and ocr_vision.
autointelligently uses the best combination of OCR and inline text.inline_fallback_to_ocrextract the embedded text elementwise when present and falls back to performing OCR otherwiseocr_standarduses the classical OCR pipelineocr_visionuses a vision model for OCR. Note thatocr_visionis only available for PAYG users.
table_mode
A string specifying table structure extraction mode.Default: "standard". Valid options are
‘standard’, ‘vision’, ‘none’, and ‘custom’.
nonewill not extract table structure (equivalent toextract_table_structure = False)standardwill use the standard hybrid table structure extraction pipelinevisionwill use a vision model to extract table structure. Note thatvisionis only available for PAYG users.customwill use the custom expression described by the model_selection parameter in the table_extraction_options
property_extraction_options
This is a dictionary of options for extracting properties (key-value pairs) from documents such as invoices, purchase orders, contracts, etc. You can either provide aschema object which describes properties in the document
being processed that you want DocParse to extract, e.g. the total dollar amount in an invoice or ask DocParse to suggest
a schema which you can then use to extract properties.
schemais a list ofpropertieseach of which describes a specific occurrence of information appearing in the document.suggest_propertiesis a boolean option that tells DocParse to analyze the document and suggest an appropriate schema for it.suggest_properties_instructionsis astringoption that you can use to provide additional instructions to ensure that the schema DocParse infers from your document contains properties that you are interested in, e.g. “Make sure to include the full address of the vendor on the invoice”.
property_extraction_options dictionary must contain exactly one of schema or suggest_properties.
Note: suggest_properties only uses the first 20 pages of the processed document to infer the schema for the document.
chunking_options
chunking_options is a dictionary of options for specifying chunking behavior. Chunking is only performed when this
option is present, and default options are chosen when chunking_options is specified as {}. Specify the chunking
strategy using the strategy key.
strategy: A string specifying the strategy to use to combine and split chunks. Valid values are context_rich
and maximize_within_limit. The default and recommended chunker is context_rich.
-
Behavior of
context_richchunker: The goal of this strategy is to add context to evenly-sized chunks. This is most useful for retrieval based GenAI applications. The context_rich chunking combines adjacentSection-headerandTitleelements into a newSection-headerelement. It merges elements into a chunk with its most recentSection-header. If the chunk would contain too many tokens, it starts a new chunk by copying the Section-header to the start of this new chunk and continues. The chunker merges elements on different pages, unlessmerge_across_pagesis set toFalse. -
Behavior of
maximize_within_limitchunker: The goal of themaximize_within_limitchunker is to make the chunks as large as possible. Merges elements into the last most recently merged set of elements unless doing so would make its token count exceedmax_tokens. In that case, it would keep the new element separate and start merging subsequent elements into that one, following the same rule. Merges elements on different pages, unless the chunking optionmerge_across_pagesis set toFalse.
output_format
A string controlling the output representation. Options are:json(default): yields an array calledelementscontaining the partitioned elements, represented in JSON.markdown: the service response will include a field calledmarkdowncontaining a string representing the entire document in Markdown format.html: the service response will include a field calledhtmlcontaining a string representing the entire document in HTML format.
summarize_images (PAYG Only)
A boolean that, whenTrue, generates a summary of the images in the document and returns it as the
text_representation. When False, images are not summarized. Default: False. summarize_images is only
available for Pay-As-You-Go (PAYG) users.
use_ocr (Deprecated)
Usetext_mode instead.A boolean value that, when set to
True, causes DocParse to extract text using an OCR model.
This is useful when the text is not directly extractable from the PDF, such as when the text is part of an image or
when the text is rotated. When set to False, DocParse extracts embedded text from the input document. Default: False.
extract_table_structure (Deprecated)
Usetable_mode instead.A boolean that, when
True, enables DocParse to extract tables and their structural
content using a purpose built table extraction model. If set to False, tables are still identified but not analyzed
for their structure; as a result, table cells and their bounding boxes are not included in the response. Default: True.
Tables
table_extraction_options
A map with string keys specifying options for table extraction. Only applied whenextract_table_structure is True.
Default: {}.
table_extraction_options.include_additional_text
Boolean. WhenTrue, DocParse will attempt to enhance the table structure by merging in tokens from text extraction.
This can be useful for working with tables that have missing or misaligned text. Default: True.
table_extraction_options.model_selection
String. An expression to instruct DocParse how to select the table model to use for extraction.Default: "pixels > 500 -> deformable_detr; table_transformer", which means “if the largest dimension of the table
is more than 500 pixels, use deformable_detr; otherwise use table_transformer.” To use only deformable_detr or
table_transformer, set model_selection="deformable_detr" or model_selection="table_transformer". Selection
expressions are of the form
if metric compares to threshold, then use model statements. Statements are processed from left to right.
- Supported models are
table_transformer, which tends to do well with smaller tables, anddeformable_detr, which tends to do better with larger tables. - Supported metrics are
pixels, which corresponds to the maximum dimension of the bounding box containing the table (we find this to be easier to reason about than the total number of pixels which depends on two numbers), andchars, which corresponds to the total number of characters within the table as determined by the OCR/text extraction step. - Thresholds must be numeric.
- Supported comparison operators are
<, >, <=, >=, ==, !=.
table_transformer=> always use table transformerpixels > 500 -> deformable_detr; table_transformer=> if the biggest dimension of the table is greater than 500 pixels use deformable detr. Otherwise use table_transformer.pixels>50->table_transformer; chars<30->deformable_detr;chars>35->table_transformer;pixels>2->deformable_detr;table_transformer;comment=> if the biggest dimension is more than 50 pixels use table transformer. Else if the total number of chars in the table is less than 30 use deformable_detr. Else if there are mode than 35 chars use table transformer. Else if there are more than 2 pixels in the biggest dimension use deformable detr. Otherwise use table transformer. comment is not processed.
Text
ocr_language
A string that specifies the language model to use for OCR.Default: "english" (English). The full list of
supported languages can be found here.
text_extraction_options
A map with string keys specifying options for text extraction.text_extraction_options.remove_line_breaks
A boolean that specifies whether to remove line breaks from the text.Default: True.
text_extraction_options.ocr_text_mode (Deprecated)
Usetext_mode instead.A string that specifies the mode to use for OCR text extraction.
Default: "standard", which uses the conventional classical OCR pipeline to process documents. The other option is vision, which uses a vision model for OCR. Note that vision is only available for PAYG users only.
Property extraction
schema
A schema consists of properties that map to specific parts of interest in a document such as names, places, quantities, etc. Often, documents of the same type would have a common schema which can be shared when processing those documents for property extraction. A property is a dictionary ofname and type.
nameis the name of a property, e.g. “address”, “first_name”.typeis a dictionary of key value pairs describing the property which consists of:type: the type of the property. A simple type can be “int”, “float”, “date”, “string”, “bool”, “choice”. A nested type of “array” can consist of properties of simple types.description: a description of the property.examples: an array of examples.is_derived: ???validators: a validator is a way to perform a data quality check on the extracted value. You can use a regex to ensure a string property conforms to a specific format, e.g. “(123) 456 - 7890”.
Example 1: string with a regex validator
Example 2: int
Example 3: choice
Example 4: array
Chunking
chunking_options
A dictionary of options for specifying chunking behavior. Chunking is only performed when this option is present, and default options are chosen whenchunking_options is specified as {}.
chunking_options.strategy
A string specifying the strategy to use to combine and split chunks. Valid values arecontext_rich and maximize_within_limit. The recommended chunker Default: "context_rich" is denoted as {'strategy': 'context_rich'}.
- Behavior of
context_richchunker: The goal of this strategy is to add context to evenly-sized chunks. This is most useful for retrieval based GenAI applications. The context_rich chunking combines adjacentSection-headerandTitleelements into a newSection-headerelement. It merges elements into a chunk with its most recentSection-header. If the chunk would contain too many tokens, it starts a new chunk by copying the Section-header to the start of this new chunk and continues. The chunker merges elements on different pages, unlessmerge_across_pagesis set toFalse. - Behavior of
maximize_within_limitchunker: The goal of themaximize_within_limitchunker is to make the chunks as large as possible. Merges elements into the last most recently merged set of elements unless doing so would make its token count exceedmax_tokens. In that case, it would keep the new element separate and start merging subsequent elements into that one, following the same rule. Merges elements on different pages, unlessmerge_across_pagesis set toFalse.
chunking_options.max_tokens
An integer specifying the cutoff for splitting chunks that are too large.Default: 512.
chunking_options.tokenizer
A string specifying the tokenizer to use when determining how characters in a chunk are grouped. Valid values areopenai_tokenizer, character_tokenizer, and huggingface_tokenizer. Defaults to openai_tokenizer.
chunking_options.tokenizer_options
A tree with string keys specifying the options for the chosen tokenizer. Defaults to{'model_name': 'text-embedding-3-small'}, which works with the OpenAI tokenizer.
- Available options for
openai_tokenizer:model_name: Accepts all models supported by OpenAI’s tiktoken tokenizer. Default is “text-embedding-3-small”
- Available options for
HuggingFaceTokenizer:model_name: Accepts all huggingface tokenizers from the huggingface/tokenizers repo.
character_tokenizerdoes not take any options.
chunking_options.merge_across_pages
Aboolean that when True the selected chunker will attempt to merge chunks across page boundaries. Default: True.
Images
extract_images
A boolean that determines whether to extract images from the document. The format is determined by the value ofextract_image_format. Default: False.
image_extraction_options
A dictionary of options for specifying image extraction behavior.image_extraction_options.associate_captions
A boolean that specifies whether to associate captions with the images.Default: False.
image_extraction_options.extract_image_format
A string indicating in which format extracted images should be returned. Must be one ofppm, png, or jpeg. In all cases, the result will be base64 encoded before being returned. Default: "ppm". Deprecated out of image_extraction_options.
Advanced
add_to_docset_id
A string that specifies the DocSet ID to store your parsed document in. By default, DocParse will use the DocSet nameddocparse_storage unless you have disabled data retention.
threshold
This represents the threshold for accepting the model’s predicted bounding boxes.Default: "auto", where the service uses a processing method to find the best prediction for each possible bounding box. This is the recommended setting. However, this can be overridden by specifying a numerical threshold between 0 and 1. If you specify a numerical threshold, only bounding boxes with confidence scores higher than the threshold will be returned (instead of using the processing method described above). A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects. If you do set the threshold manually, we recommend starting with a value of 0.32.
Either the specific string auto or a float between 0.0 and 1.0, inclusive. This value specifies the cutoff for detecting bounding boxes. A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects. Default is auto (DocParse will choose optimal bounding boxes).
pages_per_call
This is only available when using the Partition function in Sycamore. This option divides the processing of your document into batches of pages, and you specify the size of each batch (number of pages). This is useful when running OCR on large documents.Default: -1. Use -1 to process all pages at once.
output_label_options
A dictionary of options to specify which heuristic to apply to enforce certain label outputs. If this option is not specified, no heuristic is applied. The options the dictionary supports are listed below.output_label_options.promote_title
A boolean that specifies whether to promote an element to title if there’s no title in the output.Default: False.
output_label_options.title_candidate_elements
A list of strings that are candidate elements to be promoted to title.Default: ["Section-header", "Caption"].
output_label_options.orientation_correction
A boolean value that specifies whether to correct the orientation of rotated pages during the preprocessing step.Default: False.
markdown_options
A dictionary of options to specify what to include in the markdown output.Default: {}.
markdown_options.include_pagenum
A boolean that specifies whether to include page numbers in the markdown output.Default: False.
markdown_options.include_headers
A boolean that specifies whether to include headers in the markdown output.Default: False.
markdown_options.include_footers
A boolean that specifies whether to include footers in the markdown output.Default: False.
markdown_options.group_by_page
A boolean that specifies whether to group markdown output by page. If set to True, markdown is returned as an array of pages of markdown.Default: False.
Examples
Here are examples of how you can use multiple of these options in a curl command or in Python code with the Aryn SDK.curl
aryn_sdk.py
