> ## Documentation Index
> Fetch the complete documentation index at: https://docs.aryn.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Processing Options

> Learn about the parameters you can use with Aryn DocParse

There are several options you can specify when calling DocParse. For example, we can extract the table structure from
our document with the following curl command.

```bash theme={null}
export ARYN_API_KEY="PUT API KEY HERE"
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" \
    -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.pdf" \
    -F 'options={"table_mode": "standard"}' | tee document.json
```

All of the available options are listed below, and are optional unless specified otherwise.

## General

### pipeline

The pipeline is a string that overrides the parsing pipeline to allow the use of alternative parsing models. Currently the only supported values are `standard` and `vision`. `Default: "standard"`.

* `standard` will use the existing DocParse pipeline and honor all of the existing configuration parameters.
* `vision` will use a VLM, currently PaddleOCR-VL-1.5 to perform parsing and table extraction. When `vision` is selected, values for `threshold`, `text_mode`, `table_mode`, `text_extraction_options`, `table_extraction_options`, and `extract_images` will be ignored.

### text\_mode

A string that specifies the mode to use for text extraction. `Default: "auto"`. Valid options are
`auto`, `inline_fallback_to_ocr`, `ocr_standard`, and `ocr_vision`.

* `auto` intelligently uses the best combination of OCR and inline text.
* `inline_fallback_to_ocr` extract the embedded text elementwise when present and falls back to performing OCR otherwise
* `ocr_standard` uses the classical OCR pipeline
* `ocr_vision` uses a vision model for OCR. Note that `ocr_vision` is only available for PAYG users.

### table\_mode

A string specifying table structure extraction mode. `Default: "standard"`. Valid options are
'standard', 'vision', 'none', and 'custom'.

* `none` will not extract table structure (equivalent to `extract_table_structure = False`)
* `standard` will use the standard hybrid table structure extraction pipeline
* `vision` will use a vision model to extract table structure. Note that `vision` is only available for PAYG users.
* `custom` will use the custom expression described by the model\_selection parameter in the table\_extraction\_options

### property\_extraction\_options

This is a dictionary of options for extracting properties (key-value pairs) from documents such as invoices, purchase
orders, contracts, etc.  You can either provide a `schema` object which describes properties in the document
being processed that you want DocParse to extract, e.g. the total dollar amount in an invoice or ask DocParse to suggest
a schema which you can then use to extract properties.

* `schema` is a list of `properties` each of which describes a specific occurrence of information appearing in the
  document.
* `suggest_properties` is a boolean option that tells DocParse to analyze the document and suggest an appropriate schema for it.
* `suggest_properties_instructions` is a `string` option that you can use to provide additional instructions to ensure
  that the schema DocParse infers from your document contains properties that you are interested in,
  e.g. "Make sure to include the full address of the vendor on the invoice".
* `voting` is a boolean option. When `True`, DocParse performs property extraction with three LLMs and returns a voted response.
  Additional metadata about all of the responses will be present in the `properties_metadata` field in the response. `Default: False`.

The `property_extraction_options` dictionary must contain exactly one of `schema` or `suggest_properties`.

**Note: `suggest_properties` only uses the first 20 pages of the processed document to infer the schema for the document.**

### chunking\_options

`chunking_options` is a dictionary of options for specifying chunking behavior. Chunking is only performed when this
option is present, and default options are chosen when `chunking_options` is specified as `{}`. Specify the chunking
strategy using the `strategy` key.

`strategy`: A string specifying the strategy to use to combine and split chunks. Valid values are `context_rich`
and `maximize_within_limit`. The default and recommended chunker is `context_rich`.

* Behavior of `context_rich` chunker: The goal of this strategy is to add context to evenly-sized chunks. This
  is most useful for retrieval based GenAI applications. The context\_rich chunking combines adjacent
  `Section-header` and `Title` elements into a new `Section-header` element. It merges elements into a chunk with
  its most recent `Section-header`. If the chunk would contain too many tokens, it starts a new chunk by copying
  the Section-header to the start of this new chunk and continues. The chunker merges elements on different
  pages, unless `merge_across_pages` is set to `False`.

* Behavior of `maximize_within_limit` chunker: The goal of the `maximize_within_limit` chunker is to make the
  chunks as large as possible. Merges elements into the last most recently merged set of elements unless doing so
  would make its token count exceed `max_tokens`. In that case, it would keep the new element separate and start
  merging subsequent elements into that one, following the same rule. Merges elements on different pages, unless
  the chunking option `merge_across_pages` is set to `False`.

### output\_format

A string controlling the output representation. Options are:

* `json` (default): yields an array called `elements` containing the partitioned elements, represented in JSON.
* `markdown`: the service response will include a field called `markdown` containing a string representing the entire document in Markdown format.
* `html`: the service response will include a field called `html` containing a string representing the entire document in HTML format.

### summarize\_images (PAYG Only)

A boolean that, when `True`, generates a summary of the images in the document and returns it as the
`text_representation`. When `False`, images are not summarized. `Default: False`. `summarize_images` is only
available for Pay-As-You-Go (PAYG) users.

### use\_ocr (Deprecated)

Use `text_mode` instead.<br />A boolean value that, when set to `True`, causes DocParse to extract text using an OCR model.
This is useful when the text is not directly extractable from the PDF, such as when the text is part of an image or
when the text is rotated. When set to `False`, DocParse extracts embedded text from the input document. `Default: False`.

### extract\_table\_structure (Deprecated)

Use `table_mode` instead.<br />A boolean that, when `True`, enables DocParse to extract tables and their structural
content using a purpose built table extraction model. If set to `False`, tables are still identified but not analyzed
for their structure; as a result, table cells and their bounding boxes are not included in the response. `Default: True`.

## Tables

### table\_extraction\_options

A map with string keys specifying options for table extraction. Only applied when `extract_table_structure` is `True`.
`Default: {}`.

#### table\_extraction\_options.include\_additional\_text

Boolean. When `True`, DocParse will attempt to enhance the table structure by merging in tokens from text extraction.
This can be useful for working with tables that have missing or misaligned text. `Default: True`.

#### table\_extraction\_options.model\_selection

String. An expression to instruct DocParse how to select the table model to use for extraction.

`Default: "pixels > 500 -> deformable_detr; table_transformer"`, which means "if the largest dimension of the table
is more than 500 pixels, use deformable\_detr; otherwise use table\_transformer." To use only deformable\_detr or
table\_transformer, set `model_selection="deformable_detr"` or `model_selection="table_transformer"`. Selection
expressions are of the form

```
metric cmp threshold -> model; metric cmp threshold -> model; model
```

And should be read as a series of `if metric compares to threshold, then use model` statements. Statements are processed from left to right.

* Supported models are `table_transformer`, which tends to do well with smaller tables, and `deformable_detr`, which tends to do better with larger tables.
* Supported metrics are `pixels`, which corresponds to the maximum dimension of the bounding box containing the table (we find this to be easier to reason about than the total number of pixels which depends on two numbers), and `chars`, which corresponds to the total number of characters within the table as determined by the OCR/text extraction step.
* Thresholds must be numeric.
* Supported comparison operators are `<, >, <=, >=, ==, !=`.

A statement with no metric, comparison, and threshold can be thought of as a default, where statements after the default will not be processed. If no such 'unconditional' statement is included and no conditions match, DocParse will default to table\_transformer. Anything after the unconditional statement will not be processed.
Examples:

* `table_transformer` => always use table transformer
* `pixels > 500 -> deformable_detr; table_transformer` => if the biggest dimension of the table is greater than 500 pixels use deformable detr. Otherwise use table\_transformer.
* `pixels>50->table_transformer; chars<30->deformable_detr;chars>35->table_transformer;pixels>2->deformable_detr;table_transformer;comment` => if the biggest dimension is more than 50 pixels use table transformer. Else if the total number of chars in the table is less than 30 use deformable\_detr. Else if there are mode than 35 chars use table transformer. Else if there are more than 2 pixels in the biggest dimension use deformable detr. Otherwise use table transformer. comment is not processed.

## Text

### ocr\_language

A string that specifies the language model to use for OCR. `Default: "english"` (English). The full list of
supported languages can be found [here](./formats_supported).

### text\_extraction\_options

A map with string keys specifying options for text extraction.

#### text\_extraction\_options.remove\_line\_breaks

A boolean that specifies whether to remove line breaks from the text. `Default: True`.

#### text\_extraction\_options.ocr\_text\_mode (Deprecated)

Use `text_mode` instead.<br />A string that specifies the mode to use for OCR text extraction. `Default: "standard"`, which uses the conventional classical OCR pipeline to process documents. The other option is `vision`, which uses a vision model for OCR. Note that `vision` is only available for PAYG users only.

## Property extraction

### schema

A schema consists of properties that map to specific parts of interest in a document such as names, places, quantities, etc.
Often, documents of the same type would have a common schema which can be shared when processing those documents for property extraction.
A property is a dictionary of `name` and `type`.

* `name` is the name of a property, e.g. "address", "first\_name".
* `type` is a dictionary of key value pairs describing the property which consists of:
  * `type`: the type of the property.  A simple type can be "int", "float", "date", "string", "bool", "choice".  A nested type of "array" can consist of properties of simple types.
  * `description`: a description of the property.
  * `examples`: an array of examples.
  * `is_derived`: ???
  * `validators`: a validator is a way to perform a data quality check on the extracted value.  You can use a regex to ensure a string property conforms to a specific format, e.g. "(123) 456 - 7890".

#### Example 1: string with a regex validator

```json theme={null}
{
    "name": "state",
    "type": {
        "type": "string",
        "description": "State where the property is located (two-letter code)",
        "examples": ["IL", "TX", "CA"],
        "validators": [
          {
            "type": "regex",
            "regex": "^[A-Z]{2}$"
          }
        ]
    }
}
```

#### Example 2: int

```json theme={null}
{
      "name": "number_of_units",
      "type": {
        "type": "int",
        "description": "Total number of rental units (for multifamily properties), if not applicable, return null",
        "examples": [150, 85, 300],
      }
}
```

#### Example 3: choice

```json theme={null}
{
      "name": "offering_type",
      "type": {
        "type": "choice",
        "description": "Type of securities offering",
        "examples": ["Regulation D", "Private Placement", "506(b)", "506(c)"],
        "choices": ["Regulation D", "Private Placement", "506(b)", "506(c)", "Regulation A", "Other"]
      }
}
```

#### Example 4: array

```json theme={null}
{
      "name": "line_items",
      "type": {
        "type": "array",
        "description": "Array of individual line items on the invoice",
        "item_type": {
          "type": "object",
          "properties": [
            {
              "name": "description",
              "type": {
                "type": "string",
                "description": "Description of the product or service",
                "examples": ["Software Development - 40 hours", "Office Supplies", "Consulting Services"],
              }
            },
            {
              "name": "quantity",
              "type": {
                "type": "float",
                "description": "Quantity of items or hours",
                "examples": [1.0, 40.0, 2.5],
              }
            }
          ]
        }
    }
}
```

## Chunking

### chunking\_options

A dictionary of options for specifying chunking behavior. Chunking is only performed when this option is present, and default options are chosen when `chunking_options` is specified as `{}`.

#### chunking\_options.strategy

A string specifying the strategy to use to combine and split chunks. Valid values are `context_rich` and `maximize_within_limit`. The recommended chunker `Default: "context_rich"` is denoted as `{'strategy': 'context_rich'}`.

* Behavior of `context_rich` chunker: The goal of this strategy is to add context to evenly-sized chunks. This is most useful for retrieval based GenAI applications. The context\_rich chunking combines adjacent `Section-header` and `Title` elements into a new `Section-header` element. It merges elements into a chunk with its most recent `Section-header`. If the chunk would contain too many tokens, it starts a new chunk by copying the Section-header to the start of this new chunk and continues. The chunker merges elements on different pages, unless `merge_across_pages` is set to `False`.
* Behavior of `maximize_within_limit` chunker: The goal of the `maximize_within_limit` chunker is to make the chunks as large as possible. Merges elements into the last most recently merged set of elements unless doing so would make its token count exceed `max_tokens`. In that case, it would keep the new element separate and start merging subsequent elements into that one, following the same rule. Merges elements on different pages, unless `merge_across_pages` is set to `False`.

#### chunking\_options.max\_tokens

An integer specifying the cutoff for splitting chunks that are too large. `Default: 512`.

#### chunking\_options.tokenizer

A string specifying the tokenizer to use when determining how characters in a chunk are grouped. Valid values are `openai_tokenizer`, `character_tokenizer`, and `huggingface_tokenizer`. Defaults to `openai_tokenizer`.

#### chunking\_options.tokenizer\_options

A tree with string keys specifying the options for the chosen tokenizer. Defaults to `{'model_name': 'text-embedding-3-small'}`, which works with the OpenAI tokenizer.

* Available options for `openai_tokenizer`:
  * `model_name`: Accepts all models supported by OpenAI's [tiktoken tokenizer](https://github.com/openai/tiktoken). Default is "text-embedding-3-small"
* Available options for `HuggingFaceTokenizer`:
  * `model_name`: Accepts all huggingface tokenizers from the [huggingface/tokenizers repo](https://github.com/huggingface/tokenizers).
* `character_tokenizer` does not take any options.

#### chunking\_options.merge\_across\_pages

A `boolean` that when `True` the selected chunker will attempt to merge chunks across page boundaries. `Default: True`.

## Images

### extract\_images

A boolean that determines whether to extract images from the document. The format is determined by the value of `extract_image_format`. `Default: False`.

### image\_extraction\_options

A dictionary of options for specifying image extraction behavior.

#### image\_extraction\_options.associate\_captions

A boolean that specifies whether to associate captions with the images. `Default: False`.

#### image\_extraction\_options.extract\_image\_format

A string indicating in which format extracted images should be returned. Must be one of `ppm`, `png`, or `jpeg`. In all cases, the result will be base64 encoded before being returned. `Default: "ppm"`. Deprecated out of `image_extraction_options`.

## Advanced

### add\_to\_docset\_id

A string that specifies the DocSet ID to store your parsed document in. By default, DocParse will use the DocSet named `docparse_storage` unless you have disabled data retention.

### threshold

This represents the threshold for accepting the model's predicted bounding boxes. `Default: "auto"`, where the service uses a processing method to find the best prediction for each possible bounding box. This is the recommended setting. However, this can be overridden by specifying a numerical threshold between 0 and 1. If you specify a numerical threshold, only bounding boxes with confidence scores higher than the threshold will be returned (instead of using the processing method described above). A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects. If you do set the threshold manually, we recommend starting with a value of `0.32`.
Either the specific `string` `auto` or a `float` between `0.0` and `1.0`, inclusive. This value specifies the cutoff for detecting bounding boxes. A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects. Default is `auto` (DocParse will choose optimal bounding boxes).

### pages\_per\_call

This is only available when using the Partition function in Sycamore. This option divides the processing of your document into batches of pages, and you specify the size of each batch (number of pages). This is useful when running OCR on large documents. `Default: -1`. Use `-1` to process all pages at once.

### output\_label\_options

A dictionary of options to specify which heuristic to apply to enforce certain label outputs. If this option is not specified, no heuristic is applied. The options the dictionary supports are listed below.

#### output\_label\_options.promote\_title

A boolean that specifies whether to promote an element to title if there's no title in the output. `Default: False`.

#### output\_label\_options.title\_candidate\_elements

A list of strings that are candidate elements to be promoted to title. `Default: ["Section-header", "Caption"]`.

#### output\_label\_options.orientation\_correction

A boolean value that specifies whether to correct the orientation of rotated pages during the preprocessing step. `Default: False`.

### markdown\_options

A dictionary of options to specify what to include in the markdown output. `Default: {}`.

#### markdown\_options.include\_pagenum

A boolean that specifies whether to include page numbers in the markdown output. `Default: False`.

#### markdown\_options.include\_headers

A boolean that specifies whether to include headers in the markdown output. `Default: False`.

#### markdown\_options.include\_footers

A boolean that specifies whether to include footers in the markdown output. `Default: False`.

#### markdown\_options.group\_by\_page

A boolean that specifies whether to group markdown output by page.  If set to True, markdown is returned as an array of pages of markdown. `Default: False`.

# Examples

Here are examples of how you can use multiple of these options in a curl command or in Python code with the [Aryn SDK](./aryn_sdk).

```bash curl theme={null}
export ARYN_API_KEY="PUT API KEY HERE"
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.pdf" -F 'options={"text_mode": "standard_ocr", "table_mode": "standard", "threshold": 0.2}' | tee document.json
```

```python aryn_sdk.py theme={null}
import os
import json
from aryn_sdk.partition import partition_file
os.environ["ARYN_API_KEY"] = "PUT API KEY HERE"
with open("document.pdf", "rb") as f:
    ans = partition_file(
        file=f,
        text_mode="standard_ocr",
        table_mode="standard",
        threshold=0.2
    )
with open("document.json", "w") as f:
    json.dump(ans, f)
```
