Learn about the parameters you can use with Aryn DocParse
Default: "inline_fallback_to_ocr"
. Valid options are
inline
, inline_fallback_to_ocr
, ocr_standard
, and ocr_vision
.
inline
extracts the embedded textinline_fallback_to_ocr
extract the embedded text elementwise when present and falls back to performing OCR otherwiseocr_standard
uses the classical OCR pipelineocr_vision
uses a vision model for OCR. Note that ocr_vision
is only available for PAYG users.Default: "standard"
. Valid options are
‘standard’, ‘vision’, ‘none’, and ‘custom’.
none
will not extract table structure (equivalent to extract_table_structure = False
)standard
will use the standard hybrid table structure extraction pipelinevision
will use a vision model to extract table structurecustom
will use the custom expression described by the model_selection parameter in the table_extraction_optionschunking_options
is a dictionary of options for specifying chunking behavior. Chunking is only performed when this
option is present, and default options are chosen when chunking_options
is specified as {}
. Specify the chunking
strategy using the strategy
key.
strategy
: A string specifying the strategy to use to combine and split chunks. Valid values are context_rich
and maximize_within_limit
. The default and recommended chunker is context_rich
.
context_rich
chunker: The goal of this strategy is to add context to evenly-sized chunks. This
is most useful for retrieval based GenAI applications. The context_rich chunking combines adjacent
Section-header
and Title
elements into a new Section-header
element. It merges elements into a chunk with
its most recent Section-header
. If the chunk would contain too many tokens, it starts a new chunk by copying
the Section-header to the start of this new chunk and continues. The chunker merges elements on different
pages, unless merge_across_pages
is set to False
.
maximize_within_limit
chunker: The goal of the maximize_within_limit
chunker is to make the
chunks as large as possible. Merges elements into the last most recently merged set of elements unless doing so
would make its token count exceed max_tokens
. In that case, it would keep the new element separate and start
merging subsequent elements into that one, following the same rule. Merges elements on different pages, unless
the chunking option merge_across_pages
is set to False
.
Default: "json"
, which yields an array called elements
which
contains the partitioned elements, represented in JSON. If set to markdown
the service response will instead include
a field called markdown
that contains a string representing the entire document in Markdown format.
True
, generates a summary of the images in the document and returns it as the
text_representation
. When False
, images are not summarized. Default: False
. summarize_images
is only
available for Pay-As-You-Go (PAYG) users.
text_mode
instead.True
, causes DocParse to extract text using an OCR model.
This is useful when the text is not directly extractable from the PDF, such as when the text is part of an image or
when the text is rotated. When set to False
, DocParse extracts embedded text from the input document. Default: False
.
table_mode
instead.True
, enables DocParse to extract tables and their structural
content using a purpose built table extraction model. If set to False
, tables are still identified but not analyzed
for their structure; as a result, table cells and their bounding boxes are not included in the response. Default: True
.
extract_table_structure
is True
.
Default: {}
.
True
, DocParse will attempt to enhance the table structure by merging in tokens from text extraction.
This can be useful for working with tables that have missing or misaligned text. Default: True
.
Default: "pixels > 500 -> deformable_detr; table_transformer"
, which means “if the largest dimension of the table
is more than 500 pixels, use deformable_detr; otherwise use table_transformer.” To use only deformable_detr or
table_transformer, set model_selection="deformable_detr"
or model_selection="table_transformer"
. Selection
expressions are of the form
if metric compares to threshold, then use model
statements. Statements are processed from left to right.
table_transformer
, which tends to do well with smaller tables, and deformable_detr
, which tends to do better with larger tables.pixels
, which corresponds to the maximum dimension of the bounding box containing the table (we find this to be easier to reason about than the total number of pixels which depends on two numbers), and chars
, which corresponds to the total number of characters within the table as determined by the OCR/text extraction step.<, >, <=, >=, ==, !=
.table_transformer
=> always use table transformerpixels > 500 -> deformable_detr; table_transformer
=> if the biggest dimension of the table is greater than 500 pixels use deformable detr. Otherwise use table_transformer.pixels>50->table_transformer; chars<30->deformable_detr;chars>35->table_transformer;pixels>2->deformable_detr;table_transformer;comment
=> if the biggest dimension is more than 50 pixels use table transformer. Else if the total number of chars in the table is less than 30 use deformable_detr. Else if there are mode than 35 chars use table transformer. Else if there are more than 2 pixels in the biggest dimension use deformable detr. Otherwise use table transformer. comment is not processed.Default: "english"
(English). The full list of
supported languages can be found here.
Default: True
.
text_mode
instead.Default: "standard"
, which uses the conventional classical OCR pipeline to process documents. The other option is vision
, which uses a vision model for OCR. Note that vision
is only available for PAYG users only.
chunking_options
is specified as {}
.
context_rich
and maximize_within_limit
. The recommended chunker Default: "context_rich"
is denoted as {'strategy': 'context_rich'}
.
context_rich
chunker: The goal of this strategy is to add context to evenly-sized chunks. This is most useful for retrieval based GenAI applications. The context_rich chunking combines adjacent Section-header
and Title
elements into a new Section-header
element. It merges elements into a chunk with its most recent Section-header
. If the chunk would contain too many tokens, it starts a new chunk by copying the Section-header to the start of this new chunk and continues. The chunker merges elements on different pages, unless merge_across_pages
is set to False
.maximize_within_limit
chunker: The goal of the maximize_within_limit
chunker is to make the chunks as large as possible. Merges elements into the last most recently merged set of elements unless doing so would make its token count exceed max_tokens
. In that case, it would keep the new element separate and start merging subsequent elements into that one, following the same rule. Merges elements on different pages, unless merge_across_pages
is set to False
.Default: 512
.
openai_tokenizer
, character_tokenizer
, and huggingface_tokenizer
. Defaults to openai_tokenizer
.
{'model_name': 'text-embedding-3-small'}
, which works with the OpenAI tokenizer.
openai_tokenizer
:
model_name
: Accepts all models supported by OpenAI’s tiktoken tokenizer. Default is “text-embedding-3-small”HuggingFaceTokenizer
:
model_name
: Accepts all huggingface tokenizers from the huggingface/tokenizers repo.character_tokenizer
does not take any options.boolean
that when True
the selected chunker will attempt to merge chunks across page boundaries. Default: True
.
extract_image_format
. Default: False
.
ppm
, png
, or jpeg
. In all cases, the result will be base64 encoded before being returned. Default: "ppm"
.
docparse_storage
unless you have disabled data retention.
Default: "auto"
, where the service uses a processing method to find the best prediction for each possible bounding box. This is the recommended setting. However, this can be overridden by specifying a numerical threshold between 0 and 1. If you specify a numerical threshold, only bounding boxes with confidence scores higher than the threshold will be returned (instead of using the processing method described above). A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects. If you do set the threshold manually, we recommend starting with a value of 0.32
.
Either the specific string
auto
or a float
between 0.0
and 1.0
, inclusive. This value specifies the cutoff for detecting bounding boxes. A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects. Default is auto
(DocParse will choose optimal bounding boxes).
Default: -1
. Use -1
to process all pages at once.
Default: False
.
Default: ["Section-header", "Caption"]
.
Default: False
.
Default: {}
.
Default: False
.
Default: False
.
Default: False
.