> ## Documentation Index
> Fetch the complete documentation index at: https://docs.aryn.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Partition

> Documentation for Aryn SDK Partition

Please find the documentation for the Aryn SDK Partition module below. All parameters are optional unless specified otherwise.

## Synchronous Partitioning Functions

### partition\_file

Sends a file to Aryn DocParse and returns a Python dictionary with elements containing its document structure and text.

<AccordionGroup>
  <Accordion title="Parameters">
    * `file`: Required. A `file` opened in binary mode to parse or a path expressed as a `str` or a `PathLike` object
      specifying where the file to parse is.

    * `pipeline`: A string that specifies the parsing pipeline to use. Default: `standard`
      * `standard` will use the existing DocParse pipeline and honor all of the existing configuration parameters.
      * `vision` will use a VLM, currently PaddleOCR-VL-1.5 to perform parsing and table extraction. When `vision` is selected, values for `threshold`, `text_mode`, `table_mode`, `text_extraction_options`, `table_extraction_options`, and `extract_images` will be ignored.

    * `threshold`: A `float` between `0.0` and `1.0`, inclusive, which serves as a cutoff to determine which bounding boxes for the file are returned or a string `auto` (the default) where the service uses a processing method to find the best prediction for each possible bounding box.
      Only bounding boxes that the model predicts with a confidence score higher than the threshold specified will be returned. A lower value will
      include more objects but may have overlaps, while a higher value will reduce the number of overlaps but may miss legitimate objects. If you do
      set the threshold manually, we recommend starting with a value of `0.32`.

    * `text_mode`: A string that specifies the mode to use for text extraction. The default is `auto`.
      * `auto` intelligently uses the best combination of OCR and embedded text.
      * `inline_fallback_to_ocr` tries to extract the embedded text elementwise and falls back to performing OCR otherwise.
      * `ocr_standard` uses the classical OCR pipeline
      * `ocr_vision` uses a vision model for OCR. Note that `ocr_vision` is only available for PAYG users only.

    * `table_mode`: A string that specifies the mode to use for table structure extraction. Default: `standard`.
      * `none` will not extract table structure (equivalent to `extract_table_structure = False`)
      * `standard` will use the standard hybrid table structure extraction pipeline
      * `vision` will use a vision model to extract table structure
      * `custom` will use the custom expression described by the model\_selection parameter in the table\_extraction\_options

    * `text_extraction_options`: A map with string keys specifying options for text extraction.
      * `ocr_text_mode` (deprecated): A string that specifies the mode to use for OCR text extraction. The default is `standard`, which uses the conventional classical OCR pipeline to process documents. The other option is `vision`, which uses a vision model for OCR. Note that `vision` is only available for PAYG users only.
      * `remove_line_breaks`: A boolean that specifies whether to remove line breaks from the text. Default is `True`.

    * `property_extraction_options`: A dictionary of options for extracting properties (key-value pairs) from documents such as invoices, purchase orders, contracts, etc. Currently, the only option allowed is `schema` which describes properties in the document being processed that you want DocParse to extract, e.g. the total dollar amount in an invoice.
      * `schema`: A list of `properties` each of which describes a specific occurrence of information appearing in the document. Each property is a dictionary with `name` and `type` keys.
        * `name`: The name of a property, e.g. "address", "first\_name".
        * `type`: A dictionary describing the property which consists of:
          * `type`: The type of the property. Simple types can be "int", "float", "date", "object", "string", "bool", "choice". A nested type of "array" can consist of properties of simple types.
          * `description`: A description of the property.
          * `examples`: An array of examples.
      * `voting`: Boolean. When `True`, DocParse will use three different LLMs from three different providers to improve the qualty and reliaibility of property extraction.

    * `table_extraction_options`: A map with string keys specifying options for table extraction. Only applied when `extract_table_structure` is `True`. Default is empty (`{}`)
      * `include_additional_text`: Boolean. When `True`, DocParse will attempt to enhance the table structure by merging in tokens from text extraction. This can be useful for working with tables that have missing or misaligned text. Default is `True`
      * `model_selection`: String. An expression to instruct DocParse how to select the table model to use for extraction. Default is `"pixels > 500 -> deformable_detr; table_transformer"`, which means "if the largest dimension of the table is more than 500 pixels, use deformable\_detr; otherwise use table\_transformer." To use only deformable\_detr or table\_transformer, set `model_selection="deformable_detr"` or `model_selection="table_transformer"`. Selection expressions are of the form

        ```
        metric cmp threshold -> model; metric cmp threshold -> model; model
        ```

        And should be read as a series of `if metric compares to threshold, then use model` statements. Statements are processed from left to right.

        * Supported models are `table_transformer`, which tends to do well with smaller tables, and `deformable_detr`, which tends to do better with larger tables.
        * Supported metrics are `pixels`, which corresponds to the maximum dimension of the bounding box containing the table (we find this to be easier to reason about than the total number of pixels which depends on two numbers), and `chars`, which corresponds to the total number of characters within the table as determined by the OCR/text extraction step.
        * Thresholds must be numeric.
        * Supported comparison operators are `<, >, <=, >=, ==, !=`.

        A statement with no metric, comparison, and threshold can be thought of as a default, where statements after the default will not be processed. If no such 'unconditional' statement is included and no conditions match, DocParse will default to table\_transformer. Anything after the unconditional statement will not be processed.
        Examples:

        * `table_transformer` => always use table transformer
        * `pixels > 500 -> deformable_detr; table_transformer` => if the biggest dimension of the table is greater than 500 pixels use deformable detr. Otherwise use table\_transformer.
        * `pixels>50->table_transformer; chars<30->deformable_detr;chars>35->table_transformer;pixels>2->deformable_detr;table_transformer;comment` => if the biggest dimension is more than 50 pixels use table transformer. Else if the total number of chars in the table is less than 30 use deformable\_detr. Else if there are mode than 35 chars use table transformer. Else if there are more than 2 pixels in the biggest dimension use deformable detr. Otherwise use table transformer. comment is not processed.

    * `extract_images`: A boolean that determines whether to extract images from the document. The format is determined by the value of `extract_image_format`. Default: `False`.

    * `image_extraction_options`: A map with string keys specifying options for image extraction.
      * `associate_captions`: A boolean that specifies whether to associate captions with the images. Default is `False`.
      * `extract_image_format`: A string indicating in which format extracted images should be returned. Must be one of `ppm`, `png`, or `jpeg`. In all cases, the result will be base64 encoded before being returned. Default: `ppm`.

    * `summarize_images`: (PAYG Only) A boolean that, when `True`, generates a summary of the images in the document and returns it as the `text_representation`. When `False`, images are not summarized. Default is `False`.

    * `selected_pages`: A list specifying individual pages (1-indexed) and page ranges from the document to partition.
      Single pages are specified as integers and ranges are specified as lists with two integer entries in ascending order. A
      valid example value for selected\_pages is `[1, 10, [15, 20]]` which would include pages 1, 10, 15, 16, 17 ..., 20.
      `selected_pages` is `None` by default, which results in all pages of the document being parsed.

    * `chunking_options`: A dictionary of options for specifying chunking behavior. Chunking is only performed when this
      option is present, and default options are chosen when `chunking_options` is specified as `{}`.
      * `strategy`: A string specifying the strategy to use to combine and split chunks. Valid values are `context_rich`
        and `maximize_within_limit`. The default and recommended chunker is `context_rich` as
        `{'strategy': 'context_rich'}`.
        * Behavior of `context_rich` chunker: The goal of this strategy is to add context to evenly-sized chunks. This
          is most useful for retrieval based GenAI applications. Context\_rich chunking combines adjacent `section-header`
          and `title` elements into a new `section-header` element. Merges elements into a chunk with its most recent
          `section-header`. If the chunk would contain too many tokens, then it starts a new chunk copying the
          section-header to the start of this new chunk and continues. Merges elements on different pages, unless
          `merge_across_pages` is set to `False`.
        * Behavior of `maximize_within_limit` chunker: The goal of the `maximize_within_limit` chunker is to make the
          chunks as large as possible. Merges elements into the last most recently merged set of elements unless doing so
          would make its token count exceed `max_tokens`. In that case, it would keep the new element separate and start
          merging subsequent elements into that one, following the same rule. Merges elements on different pages, unless
          `merge_across_pages` is set to `False`.
      * `max_tokens`: An integer specifying the cutoff for splitting chunks that are too large. Default value is 512.
      * `tokenizer`: A string specifying the tokenizer to use when determining how characters in a chunk are grouped.
        Valid values are `openai_tokenizer`, `character_tokenizer`, and `huggingface_tokenizer`. Defaults to `openai_tokenizer`.
      * `tokenizer_options`: A nested dictionary with string keys specifying the options for the chosen tokenizer. Defaults to
        `{'model_name': 'text-embedding-3-small'}`, which works with the OpenAI tokenizer.
        * Available options for `openai_tokenizer`:
          * `model_name`: Accepts all models supported by OpenAI's
            [tiktoken tokenizer](https://github.com/openai/tiktoken). Default is "text-embedding-3-small"
        * Available options for `HuggingFaceTokenizer`:
          * `model_name`: Accepts all huggingface tokenizers from the
            [huggingface/tokenizers repo](https://github.com/huggingface/tokenizers).
        * `character_tokenizer` does not take any options.
      * `merge_across_pages`: A `boolean` that when `True` the selected chunker will attempt to merge chunks across page
        boundaries. Does not apply to the `mixed_multi_column` merger, which never merges across pages. Defaults to
        `True`.

    * `output_format`: A string controlling the output representation. Options are:
      * `json` (default): yields an array called `elements` containing the partitioned elements, represented in JSON.
      * `markdown`: the service response will include a field called `markdown` containing a string representing the entire document in Markdown format.
      * `html`: the service response will include a field called `html` containing a string representing the entire document in HTML format.

    * `output_label_options`: A dictionary of options to specify which heuristic to apply to enforce certain label outputs. If this option is not specified, no heuristic is applied. The options the dictionary supports are listed below.
      * `promote_title`: A boolean that specifies whether to promote an element to title if there's no title in the output.
      * `title_candidate_elements`: A list of strings that are candidate elements to be promoted to title.
      * `orientation_correction`: A boolean value that specifies whether to correct the orientation of rotated pages during the preprocessing step.

    * `markdown_options`: A dictionary of options to specify what to include in the markdown output.
      * `include_pagenum`: A boolean that specifies whether to include page numbers in the markdown output. Default is `False`.
      * `include_headers`: A boolean that specifies whether to include headers in the markdown output. Default is `False`.
      * `include_footers`: A boolean that specifies whether to include footers in the markdown output. Default is `False`.

    * `ssl_verify`: A `boolean` that controls whether the client verifies the SSL certificate of the chosen DocParse
      server. ssl\_verify is `True` by default, enforcing SSL verification.

    * `aryn_config`: An ArynConfig object (defined in
      [aryn\_sdk/config.py](https://github.com/aryn-ai/sycamore/blob/main/lib/aryn-sdk/aryn_sdk/config.py)), used for finding
      an api key. If aryn\_api\_key is set it will override this. The default ArynConfig looks in the env var `ARYN_API_KEY`
      and then in the file \~/.aryn/config.yaml. Default is None (aryn-sdk will look in the aryn\_api\_key parameter, in your
      environment variables, and then in \~/.aryn/config.yaml).

    * `aryn_api_key`: An Aryn API key, provided as a string. You can get one for free at [aryn.ai/get-started](https://www.aryn.ai/get-started). Default is `None` (If not provided, the sdk will check for it in the environment variable `ARYN_API_KEY` or will look in aryn\_config as specified above).

    * `region`: A string that specifies the region to use for the DocParse server. Valid values are `US` and `None`. Default is `None`, which uses the US region.

    * `extract_image_format` (deprecated out of `image_extraction_options`): A string indicating in which format extracted images should be returned. Must be one of `ppm`, `png`, or `jpeg`. In all cases, the result will be base64 encoded before being returned. Default: `ppm`.

    * `use_ocr` (deprecated): Use `text_mode` instead. A boolean value that, when set to `True`, causes DocParse to extract text using an OCR model. This is
      useful when the text is not directly extractable from the PDF, such as when the text is part of an image or when the
      text is rotated. When set to `False`, DocParse extracts embedded text from the input document. Default is `False`.

    * `extract_table_structure` (deprecated): Use `table_mode` instead. A boolean that, when `True`, enables DocParse to extract tables and their structural
      content using a purpose built table extraction model. If set to `False`, tables are still identified but
      not analyzed for their structure; as a result, table cells and their bounding boxes are not included in the response.
      Default is `True`.
  </Accordion>

  <Accordion title="Example">
    ```python theme={null}
    from aryn_sdk.partition import partition_file

    with open("my-favorite-pdf.pdf", "rb") as f:
        data = partition_file(
            f,
            aryn_api_key="MY-ARYN-API-KEY",
            text_mode="ocr_standard",
            extract_table_structure=True,
            extract_images=True
        )
    elements = data['elements']
    ```
  </Accordion>

  <Accordion title="Return Value">
    A dictionary containing keys `status` and `elements`. If output\_format is `markdown`, returns a dictionary of `status` and `markdown`. If output\_format is `html`, returns a dictionary of `status` and `html`.
  </Accordion>

  <Accordion title="Exceptions">
    User errors:

    * `HTTPError: Error:status_code 403`. Reason: `"This async action requires you to upgrade your account plan"`
      * Fix: Please upgrade your account [here](https://console.aryn.ai/billing)
    * `HTTPError: Error:status_code 403`. Reason: `"No Aryn API Key provided"`
      * Fix: Please provide an API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
    * `HTTPError: Error:status_code 403`. Reason: `"Invalid Aryn API key"`
      * Fix: Please provide a valid API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
    * `HTTPError: Error:status_code 403`. Reason: `"Expired Aryn API key"`
      * Fix: Please get a new API key [here](https://console.aryn.ai/api-keys).

    Other errors:

    * `HTTPError: Error:status_code 5xx`. Reason: `Internal Server Error`
  </Accordion>
</AccordionGroup>

## Asynchronous Partitioning Functions

### partition\_file\_async\_submit

Submit a document for asynchronous partitioning and get its `task_id`. The results of the task will remain available in the system for 48 hours. Meant to be used with `partition_file_async_result`.
Note: sending multiple asynchronous partitioning tasks at the same time does not guarantee that they will run simultaneously.

<AccordionGroup>
  <Accordion title="Parameters">
    * `file`: Required. A `file` opened in binary mode to parse or a path expressed as a `str` or a `PathLike` object
      specifying where the file to parse is.

    * `pipeline`: A string that specifies the parsing pipeline to use. Default: `standard`
      * `standard` will use the existing DocParse pipeline and honor all of the existing configuration parameters.
      * `vision` will use a VLM, currently PaddleOCR-VL-1.5 to perform parsing and table extraction. When `vision` is selected, values for `threshold`, `text_mode`, `table_mode`, `text_extraction_options`, `table_extraction_options`, and `extract_images` will be ignored.

    * `threshold`: A `float` between `0.0` and `1.0`, inclusive, which serves as a cutoff to determine which bounding boxes for the file are returned or a string `auto` (the default) where the service uses a processing method to find the best prediction for each possible bounding box.
      Only bounding boxes that the model predicts with a confidence score higher than the threshold specified will be returned. A lower value will
      include more objects but may have overlaps, while a higher value will reduce the number of overlaps but may miss legitimate objects. If you do
      set the threshold manually, we recommend starting with a value of `0.32`.

    * `text_mode`: A string that specifies the mode to use for text extraction. The default is `auto`.
      * `auto` intelligently uses the best combination of OCR and embedded text.
      * `inline_fallback_to_ocr` tries to extract the embedded text elementwise and falls back to performing OCR otherwise.
      * `ocr_standard` uses the classical OCR pipeline
      * `ocr_vision` uses a vision model for OCR. Note that `ocr_vision` is only available for PAYG users only.

    * `table_mode`: A string that specifies the mode to use for table structure extraction. Default: `standard`.
      * `none` will not extract table structure (equivalent to `extract_table_structure = False`)
      * `standard` will use the standard hybrid table structure extraction pipeline
      * `vision` will use a vision model to extract table structure
      * `custom` will use the custom expression described by the model\_selection parameter in the table\_extraction\_options

    * `text_extraction_options`: A map with string keys specifying options for text extraction.
      * `ocr_text_mode` (deprecated): A string that specifies the mode to use for OCR text extraction. The default is `standard`, which uses the conventional classical OCR pipeline to process documents. The other option is `vision`, which uses a vision model for OCR. Note that `vision` is only available for PAYG users only.
      * `remove_line_breaks`: A boolean that specifies whether to remove line breaks from the text. Default is `True`.

    * `property_extraction_options`: A dictionary of options for extracting properties (key-value pairs) from documents such as invoices, purchase orders, contracts, etc. Currently, the only option allowed is `schema` which describes properties in the document being processed that you want DocParse to extract, e.g. the total dollar amount in an invoice.
      * `schema`: A list of `properties` each of which describes a specific occurrence of information appearing in the document. Each property is a dictionary with `name` and `type` keys.
        * `name`: The name of a property, e.g. "address", "first\_name".
        * `type`: A dictionary describing the property which consists of:
          * `type`: The type of the property. Simple types can be "int", "float", "date", "object", "string", "bool", "choice". A nested type of "array" can consist of properties of simple types.
          * `description`: A description of the property.
          * `examples`: An array of examples.
      * `voting`: Boolean. When `True`, DocParse will use three different LLMs from three different providers to improve the qualty and reliaibility of property extraction.

    * `table_extraction_options`: A map with string keys specifying options for table extraction. Only applied when `extract_table_structure` is `True`. Default is empty (`{}`)
      * `include_additional_text`: Boolean. When `True`, DocParse will attempt to enhance the table structure by merging in tokens from text extraction. This can be useful for working with tables that have missing or misaligned text. Default is `True`
      * `model_selection`: String. An expression to instruct DocParse how to select the table model to use for extraction. Default is `"pixels > 500 -> deformable_detr; table_transformer"`, which means "if the largest dimension of the table is more than 500 pixels, use deformable\_detr; otherwise use table\_transformer." To use only deformable\_detr or table\_transformer, set `model_selection="deformable_detr"` or `model_selection="table_transformer"`. Selection expressions are of the form

        ```
        metric cmp threshold -> model; metric cmp threshold -> model; model
        ```

        And should be read as a series of `if metric compares to threshold, then use model` statements. Statements are processed from left to right.

        * Supported models are `table_transformer`, which tends to do well with smaller tables, and `deformable_detr`, which tends to do better with larger tables.
        * Supported metrics are `pixels`, which corresponds to the maximum dimension of the bounding box containing the table (we find this to be easier to reason about than the total number of pixels which depends on two numbers), and `chars`, which corresponds to the total number of characters within the table as determined by the OCR/text extraction step.
        * Thresholds must be numeric.
        * Supported comparison operators are `<, >, <=, >=, ==, !=`.

        A statement with no metric, comparison, and threshold can be thought of as a default, where statements after the default will not be processed. If no such 'unconditional' statement is included and no conditions match, DocParse will default to table\_transformer. Anything after the unconditional statement will not be processed.
        Examples:

        * `table_transformer` => always use table transformer
        * `pixels > 500 -> deformable_detr; table_transformer` => if the biggest dimension of the table is greater than 500 pixels use deformable detr. Otherwise use table\_transformer.
        * `pixels>50->table_transformer; chars<30->deformable_detr;chars>35->table_transformer;pixels>2->deformable_detr;table_transformer;comment` => if the biggest dimension is more than 50 pixels use table transformer. Else if the total number of chars in the table is less than 30 use deformable\_detr. Else if there are mode than 35 chars use table transformer. Else if there are more than 2 pixels in the biggest dimension use deformable detr. Otherwise use table transformer. comment is not processed.

    * `extract_images`: A boolean that determines whether to extract images from the document. The format is determined by the value of `extract_image_format`. Default: `False`.

    * `image_extraction_options`: A map with string keys specifying options for image extraction.
      * `associate_captions`: A boolean that specifies whether to associate captions with the images. Default is `False`.
      * `extract_image_format`: A string indicating in which format extracted images should be returned. Must be one of `ppm`, `png`, or `jpeg`. In all cases, the result will be base64 encoded before being returned. Default: `ppm`.

    * `summarize_images`: (PAYG Only) A boolean that, when `True`, generates a summary of the images in the document and returns it as the `text_representation`. When `False`, images are not summarized. Default is `False`.

    * `selected_pages`: A list specifying individual pages (1-indexed) and page ranges from the document to partition.
      Single pages are specified as integers and ranges are specified as lists with two integer entries in ascending order. A
      valid example value for selected\_pages is `[1, 10, [15, 20]]` which would include pages 1, 10, 15, 16, 17 ..., 20.
      `selected_pages` is `None` by default, which results in all pages of the document being parsed.

    * `chunking_options`: A dictionary of options for specifying chunking behavior. Chunking is only performed when this
      option is present, and default options are chosen when `chunking_options` is specified as `{}`.
      * `strategy`: A string specifying the strategy to use to combine and split chunks. Valid values are `context_rich`
        and `maximize_within_limit`. The default and recommended chunker is `context_rich` as
        `{'strategy': 'context_rich'}`.
        * Behavior of `context_rich` chunker: The goal of this strategy is to add context to evenly-sized chunks. This
          is most useful for retrieval based GenAI applications. Context\_rich chunking combines adjacent `section-header`
          and `title` elements into a new `section-header` element. Merges elements into a chunk with its most recent
          `section-header`. If the chunk would contain too many tokens, then it starts a new chunk copying the
          section-header to the start of this new chunk and continues. Merges elements on different pages, unless
          `merge_across_pages` is set to `False`.
        * Behavior of `maximize_within_limit` chunker: The goal of the `maximize_within_limit` chunker is to make the
          chunks as large as possible. Merges elements into the last most recently merged set of elements unless doing so
          would make its token count exceed `max_tokens`. In that case, it would keep the new element separate and start
          merging subsequent elements into that one, following the same rule. Merges elements on different pages, unless
          `merge_across_pages` is set to `False`.
      * `max_tokens`: An integer specifying the cutoff for splitting chunks that are too large. Default value is 512.
      * `tokenizer`: A string specifying the tokenizer to use when determining how characters in a chunk are grouped.
        Valid values are `openai_tokenizer`, `character_tokenizer`, and `huggingface_tokenizer`. Defaults to `openai_tokenizer`.
      * `tokenizer_options`: A nested dictionary with string keys specifying the options for the chosen tokenizer. Defaults to
        `{'model_name': 'text-embedding-3-small'}`, which works with the OpenAI tokenizer.
        * Available options for `openai_tokenizer`:
          * `model_name`: Accepts all models supported by OpenAI's
            [tiktoken tokenizer](https://github.com/openai/tiktoken). Default is "text-embedding-3-small"
        * Available options for `HuggingFaceTokenizer`:
          * `model_name`: Accepts all huggingface tokenizers from the
            [huggingface/tokenizers repo](https://github.com/huggingface/tokenizers).
        * `character_tokenizer` does not take any options.
      * `merge_across_pages`: A `boolean` that when `True` the selected chunker will attempt to merge chunks across page
        boundaries. Does not apply to the `mixed_multi_column` merger, which never merges across pages. Defaults to
        `True`.

    * `output_format`: A string controlling the output representation. Options are:
      * `json` (default): yields an array called `elements` containing the partitioned elements, represented in JSON.
      * `markdown`: the service response will include a field called `markdown` containing a string representing the entire document in Markdown format.
      * `html`: the service response will include a field called `html` containing a string representing the entire document in HTML format.

    * `output_label_options`: A dictionary of options to specify which heuristic to apply to enforce certain label outputs. If this option is not specified, no heuristic is applied. The options the dictionary supports are listed below.
      * `promote_title`: A boolean that specifies whether to promote an element to title if there's no title in the output.
      * `title_candidate_elements`: A list of strings that are candidate elements to be promoted to title.
      * `orientation_correction`: A boolean value that specifies whether to correct the orientation of rotated pages during the preprocessing step.

    * `markdown_options`: A dictionary of options to specify what to include in the markdown output.
      * `include_pagenum`: A boolean that specifies whether to include page numbers in the markdown output. Default is `False`.
      * `include_headers`: A boolean that specifies whether to include headers in the markdown output. Default is `False`.
      * `include_footers`: A boolean that specifies whether to include footers in the markdown output. Default is `False`.

    * `ssl_verify`: A `boolean` that controls whether the client verifies the SSL certificate of the chosen DocParse
      server. ssl\_verify is `True` by default, enforcing SSL verification.

    * `aryn_config`: An ArynConfig object (defined in
      [aryn\_sdk/config.py](https://github.com/aryn-ai/sycamore/blob/main/lib/aryn-sdk/aryn_sdk/config.py)), used for finding
      an api key. If aryn\_api\_key is set it will override this. The default ArynConfig looks in the env var `ARYN_API_KEY`
      and then in the file \~/.aryn/config.yaml. Default is None (aryn-sdk will look in the aryn\_api\_key parameter, in your
      environment variables, and then in \~/.aryn/config.yaml).

    * `aryn_api_key`: An Aryn API key, provided as a string. You can get one for free at [aryn.ai/get-started](https://www.aryn.ai/get-started). Default is `None` (If not provided, the sdk will check for it in the environment variable `ARYN_API_KEY` or will look in aryn\_config as specified above).

    * `region`: A string that specifies the region to use for the DocParse server. Valid values are `US` and `None`. Default is `None`, which uses the US region.

    * `extract_image_format` (deprecated out of `image_extraction_options`): A string indicating in which format extracted images should be returned. Must be one of `ppm`, `png`, or `jpeg`. In all cases, the result will be base64 encoded before being returned. Default: `ppm`.

    * `use_ocr` (deprecated): Use `text_mode` instead. A boolean value that, when set to `True`, causes DocParse to extract text using an OCR model. This is
      useful when the text is not directly extractable from the PDF, such as when the text is part of an image or when the
      text is rotated. When set to `False`, DocParse extracts embedded text from the input document. Default is `False`.

    * `extract_table_structure` (deprecated): Use `table_mode` instead. A boolean that, when `True`, enables DocParse to extract tables and their structural
      content using a purpose built table extraction model. If set to `False`, tables are still identified but
      not analyzed for their structure; as a result, table cells and their bounding boxes are not included in the response.
      Default is `True`.

    - `webhook_url`: a string url for Aryn to visit when the async task has stopped. POSTs a body like this: `{"done": [{"task_id": "aryn:t-47gpd3604e5tz79z1jro5fc"}]}`
  </Accordion>

  <Accordion title="Example">
    ```python theme={null}
    from aryn_sdk.partition import partition_file_async_submit

    with open("my-favorite-pdf.pdf", "rb") as f:
        response = partition_file_async_submit(
            f,
            text_mode="ocr_standard",
            extract_table_structure=True,
        )

    # get task id
    task_id = response["task_id"]
    ```
  </Accordion>

  <Accordion title="Return Value">
    A dict containing the `task_id` of the submitted request.

    ```json theme={null}
    {
      "task_id": "aryn:t-47gpd3604e5tz79z1jro5fc"
    }
    ```
  </Accordion>

  <Accordion title="Exceptions">
    User errors:

    * `HTTPError: Error:status_code 403`. Reason: `"This async action requires you to upgrade your account plan"`
      * Fix: Please upgrade your account [here](https://console.aryn.ai/billing)
    * `HTTPError: Error:status_code 403`. Reason: `"No Aryn API Key provided"`
      * Fix: Please provide an API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
    * `HTTPError: Error:status_code 403`. Reason: `"Invalid Aryn API key"`
      * Fix: Please provide a valid API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
    * `HTTPError: Error:status_code 403`. Reason: `"Expired Aryn API key"`
      * Fix: Please get a new API key [here](https://console.aryn.ai/api-keys).
    * `HTTPError: Error:status_code 429`. Reason: `"Too many requests"`
      * Fix: Please try again after some time. Each account is allowed 1000 tasks to run at a time.

    Other errors:

    * `HTTPError: Error:status_code 5xx`. Reason: `Internal Server Error`
  </Accordion>
</AccordionGroup>

### partition\_file\_async\_result

Gets the results of an asynchronous partitioning task by `task_id`. Meant to be used with `partition_file_async_submit`.

<AccordionGroup>
  <Accordion title="Parameters">
    * `task_id`: Required. A string of the task id to poll and attempt to get the result for.
    * `aryn_api_key`: An Aryn API key, provided as a string. You can get one for free at [aryn.ai/get-started](https://www.aryn.ai/get-started). Default is `None` (If not provided, the sdk will check for it in the environment variable `ARYN_API_KEY` or will look in aryn\_config as specified above).
    * `region`: A string that specifies the region to use for the DocParse server. Valid values are `US` and `None`. Default is `None`, which uses the US region. Via the API, you can specify the region by modifying the base URL of the DocParse server.
    * `aryn_config`: An ArynConfig object (defined in
      [aryn\_sdk/config.py](https://github.com/aryn-ai/sycamore/blob/main/lib/aryn-sdk/aryn_sdk/config.py)), used for finding
      an api key. If `aryn_api_key` is set it will override this. The default ArynConfig looks in the env var `ARYN_API_KEY`
      and then in the file `~/.aryn/config.yaml`. Default is None (aryn-sdk will look in the aryn\_api\_key parameter, in your
      environment variables, and then in `~/.aryn/config.yaml`).
    * `ssl_verify`: A `bool` that controls whether the client verifies the SSL certificate of the chosen DocParse server.
      ssl\_verify is `True` by default, enforcing SSL verification.
  </Accordion>

  <Accordion title="Example">
    ```python theme={null}
    import time
    from aryn_sdk.partition import partition_file_async_result

    # Poll for the results
    while True:
        result = partition_file_async_result(task_id)
        if result["task_status"] != "pending":
            break
        time.sleep(1)
    ```
  </Accordion>

  <Accordion title="Return Value">
    A dict like the one in the example below containing "task\_status". When "task\_status" is "done", the returned
    dict also contains "result" which contains what would have been returned had `partition_file` been called directly. If there is an error with partitioning the file itself, then the "task\_status" will still be "done" but the
    "result" will contain an "error" field indicating what went wrong.

    "task\_status" can be "done" or "pending". <br />

    ```json theme={null}
    {
      "task_status":"done",
      "result": ...
    }
    ```
  </Accordion>

  <Accordion title="Exceptions">
    User errors:

    * `HTTPError: Error:status_code 403`. Reason: `"This async action requires you to upgrade your account plan"`
      * Fix: Please upgrade your account [here](https://console.aryn.ai/billing)
    * `HTTPError: Error:status_code 403`. Reason: `"No Aryn API Key provided"`
      * Fix: Please provide an API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
    * `HTTPError: Error:status_code 403`. Reason: `"Invalid Aryn API key"`
      * Fix: Please provide a valid API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
    * `HTTPError: Error:status_code 403`. Reason: `"Expired Aryn API key"`
      * Fix: Please get a new API key [here](https://console.aryn.ai/api-keys).
    * `aryn_sdk.partition.partition.PartitionTaskNotFoundError`. Reason: `"No such task"`
      * Fix: Check to make sure the task\_id specified is correct.

    Other errors:

    * `HTTPError: Error:status_code 5xx`. Reason: `Internal Server Error`
  </Accordion>
</AccordionGroup>

### partition\_file\_async\_cancel

Cancels the task associated with the task\_id specified.

<AccordionGroup>
  <Accordion title="Parameters">
    * `task_id`: Required. A string of the task id to cancel.
    * `aryn_api_key`: An Aryn API key, provided as a string. You can get one for free at [aryn.ai/get-started](https://www.aryn.ai/get-started). Default is `None` (If not provided, the sdk will check for it in the environment variable `ARYN_API_KEY` or will look in aryn\_config as specified above).
    * `region`: A string that specifies the region to use for the DocParse server. Valid values are `US` and `None`. Default is `None`, which uses the US region. Via the API, you can specify the region by modifying the base URL of the DocParse server.
    * `aryn_config`: An ArynConfig object (defined in
      [aryn\_sdk/config.py](https://github.com/aryn-ai/sycamore/blob/main/lib/aryn-sdk/aryn_sdk/config.py)), used for finding
      an api key. If `aryn_api_key` is set it will override this. The default ArynConfig looks in the env var `ARYN_API_KEY`
      and then in the file `~/.aryn/config.yaml`. Default is None (aryn-sdk will look in the aryn\_api\_key parameter, in your
      environment variables, and then in `~/.aryn/config.yaml`).
    * `ssl_verify`: A `bool` that controls whether the client verifies the SSL certificate of the chosen DocParse server.
      ssl\_verify is `True` by default, enforcing SSL verification.
  </Accordion>

  <Accordion title="Example">
    ```python theme={null}
    from aryn_sdk.partition import partition_file_async_submit, partition_file_async_cancel
    task_id = partition_file_async_submit(
                "path/to/file.pdf",
                text_mode="ocr_standard",
                extract_table_structure=True,
                extract_images=True,
            )["task_id"]

    partition_file_async_cancel(task_id)
    ```
  </Accordion>

  <Accordion title="Return Value">
    No return value. Asynchronous tasks may only be successfully cancelled once. Once a task has been
    cancelled, any `partition_file_async_result` calls using that task's id will throw an exception.
  </Accordion>

  <Accordion title="Exceptions">
    User errors:

    * `HTTPError: Error:status_code 403`. Reason: `"This async action requires you to upgrade your account plan"`
      * Fix: Please upgrade your account [here](https://console.aryn.ai/billing)
    * `HTTPError: Error:status_code 403`. Reason: `"No Aryn API Key provided"`
      * Fix: Please provide an API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
    * `HTTPError: Error:status_code 403`. Reason: `"Invalid Aryn API key"`
      * Fix: Please provide a valid API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
    * `HTTPError: Error:status_code 403`. Reason: `"Expired Aryn API key"`
      * Fix: Please get a new API key [here](https://console.aryn.ai/api-keys).
    * `aryn_sdk.partition.partition.PartitionTaskNotFoundError`. Reason: `"No such task"`
      * Fix: Check to make sure the task\_id specified is correct.

    Other errors:

    * `HTTPError: Error:status_code 5xx`. Reason: `Internal Server Error`
  </Accordion>
</AccordionGroup>

### partition\_file\_async\_list

Lists all the partition\_file tasks still running in your account.

<AccordionGroup>
  <Accordion title="Parameters">
    * `aryn_api_key`: An Aryn API key, provided as a string. You can get one for free at [aryn.ai/get-started](https://www.aryn.ai/get-started). Default is `None` (If not provided, the sdk will check for it in the environment variable `ARYN_API_KEY` or will look in aryn\_config as specified above).
    * `region`: A string that specifies the region to use for the DocParse server. Valid values are `US` and `None`. Default is `None`, which uses the US region. Via the API, you can specify the region by modifying the base URL of the DocParse server.
    * `aryn_config`: An ArynConfig object (defined in
      [aryn\_sdk/config.py](https://github.com/aryn-ai/sycamore/blob/main/lib/aryn-sdk/aryn_sdk/config.py)), used for finding
      an api key. If `aryn_api_key` is set it will override this. The default ArynConfig looks in the env var `ARYN_API_KEY`
      and then in the file `~/.aryn/config.yaml`. Default is None (aryn-sdk will look in the aryn\_api\_key parameter, in your
      environment variables, and then in `~/.aryn/config.yaml`).
    * `ssl_verify`: A `bool` that controls whether the client verifies the SSL certificate of the chosen DocParse server.
      ssl\_verify is `True` by default, enforcing SSL verification.
  </Accordion>

  <Accordion title="Example">
    ```python theme={null}
    from aryn_sdk.partition import partition_file_async_list

    l = partition_file_async_list()
    ```
  </Accordion>

  <Accordion title="Return Value">
    A dict like the one below which maps task\_ids to a dict containing details of the respective task.

    ```json theme={null}
    {
        "aryn:t-sc0v0lglkauo774pioflp4l": {
            "task_status": "pending"
        },
        "aryn:t-b9xp7ny0eejvqvbazjhg8rn": {
            "task_status": "pending"
        }
    }
    ```
  </Accordion>

  <Accordion title="Exceptions">
    User errors:

    * `HTTPError: Error:status_code 403`. Reason: `"This async action requires you to upgrade your account plan"`
      * Fix: Please upgrade your account [here](https://console.aryn.ai/billing)
    * `HTTPError: Error:status_code 403`. Reason: `"No Aryn API Key provided"`
      * Fix: Please provide an API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
    * `HTTPError: Error:status_code 403`. Reason: `"Invalid Aryn API key"`
      * Fix: Please provide a valid API key either as a parameter or specify it in the environment variable `ARYN_API_KEY`.
    * `HTTPError: Error:status_code 403`. Reason: `"Expired Aryn API key"`
      * Fix: Please get a new API key [here](https://console.aryn.ai/api-keys).

    Other errors:

    * `HTTPError: Error:status_code 5xx`. Reason: `Internal Server Error`
  </Accordion>
</AccordionGroup>

## Helper Functions

### convert\_image\_element

Convert an image element to a more usable format. If no format is specified, create a PIL Image object. If a format is specified, output the bytes of the image in that format. If `b64encode` is set to `True`, base64-encode the bytes and return them as a `string`.

<AccordionGroup>
  <Accordion title="Parameters">
    * `elem`: Required. An image element from the `elements` field of a `partition_file` response.
    * `format`: A `string` specifying the format to output bytes to. Default is `PIL`.
    * `b64encode`: A `boolean` that when set to True enables base64-encoding of the output bytes of this function. Format cannot be `PIL` when this option is `True`. Default is `False`.
  </Accordion>

  <Accordion title="Example">
    ```python theme={null}
    from aryn_sdk.partition import partition_file, convert_image_element

    with open("my-favorite-pdf.pdf", "rb") as f:
        data = partition_file(
            f,
            extract_images=True
        )
    image_elts = [e for e in data['elements'] if e['type'] == 'Image']

    pil_img = convert_image_element(image_elts[0])
    jpg_bytes = convert_image_element(image_elts[1], format='JPEG')
    png_str = convert_image_element(image_elts[2], format="PNG", b64encode=True)
    ```
  </Accordion>

  <Accordion title="Return Value">
    Either a PIL `Image` object, bytes of an image, or a base64-encoded image as a `str`.
  </Accordion>

  <Accordion title="Exceptions">
    * `ValueError` - "b64encode was True but format was PIL. Cannot b64-encode a PIL Image".
      * Fix: If you're calling the function with a PIL image, please set b64encode to False.
  </Accordion>
</AccordionGroup>

### draw\_with\_boxes

Create a list of images from the provided PDF, one for each page, with bounding boxes detected by the partitioner drawn on.

<AccordionGroup>
  <Accordion title="Parameters">
    * `pdf_file`: Required. A PDF file opened in binary mode or a path to a PDF file expressed as a `string` or a `PathLike` object upon which to draw.
      -`partitioning_data`: Required. The output from `partition_file`.
    * `draw_table_cells`: A boolean that when `True`, makes the function draw individually detected cells of tables. When `False`, the bounding boxes of table cells are not drawn but the outer bounding boxes of tables and the bounding boxes of all other elements are still drawn. Default is False.
  </Accordion>

  <Accordion title="Example">
    ```python theme={null}
    from aryn_sdk.partition import partition_file, draw_with_boxes

    with open("my-favorite-pdf.pdf", "rb") as f:
        data = partition_file(
            f,
            aryn_api_key="MY-ARYN-API-KEY",
            text_mode="ocr_standard",
            extract_table_structure=True,
            extract_images=True
        )
    pages = draw_with_boxes("my-favorite-pdf.pdf", data, draw_table_cells=True)
    ```
  </Accordion>

  <Accordion title="Return Value">
    A list of images of pages of the PDF, each with bounding boxes drawn on.
  </Accordion>

  <Accordion title="Exceptions">
    * Will throw an exception if the function is not called with a PDF file.
  </Accordion>
</AccordionGroup>

### table\_elem\_to\_dataframe

Create a `pandas` DataFrame representing the tabular data inside the provided table element. If the element is not of type `table` or doesn't contain any table data, return `None` instead.

<AccordionGroup>
  <Accordion title="Parameters">
    * `elem`: Required. An element from the 'elements' field of a `partition_file` response.
  </Accordion>

  <Accordion title="Example">
    ```python theme={null}
    from aryn_sdk.partition import partition_file, table_elem_to_dataframe

    with open("partition-me.pdf", "rb") as f:
        data = partition_file(
            f,
            text_mode="ocr_standard",
            extract_table_structure=True,
            extract_images=True
        )

    # Find the first table and convert it to a dataframe
    df = None
    for element in data['elements']:
        if element['type'] == 'table':
            df = table_elem_to_dataframe(element)
            break
    ```
  </Accordion>

  <Accordion title="Return Value">
    A Pandas DataFrame representing the tabular data inside the provided table element. If the element is not of type 'table' or doesn't contain any table data, returns None instead.
  </Accordion>
</AccordionGroup>

### table\_elem\_to\_html

Convert the tabular data inside the provided table element into an HTML string. If the element is not of type `table` or doesn't contain any table data, return `None` instead.

<AccordionGroup>
  <Accordion title="Parameters">
    * `elem`: Required. An element from the 'elements' field of a `partition_file` response.
    * `pretty`: A boolean indicating whether to pretty-print the returned HTML. Default is `False`.
  </Accordion>

  <Accordion title="Example">
    ```python theme={null}
    from aryn_sdk.partition import partition_file, table_elem_to_html

    with open("partition-me.pdf", "rb") as f:
        data = partition_file(
            f,
            text_mode="ocr_standard",
            extract_table_structure=True,
            extract_images=True
        )

    # Find the first table and convert it to a dataframe
    html_str = None
    for element in data['elements']:
        if element['type'] == 'table':
            html_str = table_elem_to_html(element)
            break
    ```
  </Accordion>

  <Accordion title="Return Value">
    An HTML string containing the contents of the table element or `None` if the provided element is not a table. The HTML string will contain just the table tag, starting with `<table>` and ending with `</table>`.
  </Accordion>
</AccordionGroup>

### tables\_to\_pandas

For every table element in the provided partitioning response, create a pandas DataFrame representing the tabular data. Return a list containing all the elements, with tables paired with their corresponding DataFrames.

<AccordionGroup>
  <Accordion title="Parameters">
    * `data`: A response from `partition_file`
  </Accordion>

  <Accordion title="Example">
    ```python theme={null}
    from aryn_sdk.partition import partition_file, tables_to_pandas

    with open("my-favorite-pdf.pdf", "rb") as f:
        data = partition_file(
            f,
            aryn_api_key="MY-ARYN-API-KEY",
            text_mode="ocr_standard",
            extract_table_structure=True,
            extract_images=True
        )
    elts_and_dataframes = tables_to_pandas(data)
    ```
  </Accordion>

  <Accordion title="Return Value">
    A list of tuples, where each tuple contains an element from the 'elements' field of a `partition_file` response and a Pandas DataFrame representing the tabular data inside the provided table element. If the element is not of type 'table' or doesn't contain any table data, the DataFrame will be `None`.
  </Accordion>
</AccordionGroup>

### tables\_to\_html

For every table element in the provided partitioning response, create an HTML string representing the tabular data. Return a list containing all the elements, with tables paired with their corresponding HTML.

<AccordionGroup>
  <Accordion title="Parameters">
    * `data`: A response from `partition_file`
  </Accordion>

  <Accordion title="Example">
    ```python theme={null}
    from aryn_sdk.partition import partition_file, tables_to_html

    with open("my-favorite-pdf.pdf", "rb") as f:
        data = partition_file(
            f,
            aryn_api_key="MY-ARYN-API-KEY",
            text_mode="ocr_standard",
            extract_table_structure=True,
            extract_images=True
        )
    elts_and_html = tables_to_html(data)
    ```
  </Accordion>

  <Accordion title="Return Value">
    A list of tuples, where each tuple contains an element from the 'elements' field of a `partition_file` response and an HTML string representing the tabular data inside the provided table element. If the element is not of type 'table' or doesn't contain any table data, the HTML will be `None`.
  </Accordion>
</AccordionGroup>