> ## Documentation Index
> Fetch the complete documentation index at: https://docs.aryn.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Using the Aryn SDK

> Using DocParse with the Aryn SDK

## Installation

We recommend installing the Aryn SDK library using `pip`:

```python theme={null}
pip install aryn-sdk
```

## Partitioning a Document

Partition a document like so:

```python theme={null}
from aryn_sdk.partition import partition_file
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f)
```

`partition_file` takes the same options as curl, except as keyword arguments. You can find a list of options [here](./processing_options).

## Extract properties from a document

Create a schema by hand, or get help from DocParse [Suggest Properties](docparse/tutorials/suggestion_tutorial). Provide the `schema` in `property_extraction_options` like so:

```python theme={null}
from aryn_sdk.partition import partition_file

property_email = {
      "name": "vendor_email",
      "type": {
        "type": "string",
        "description": "Vendor's email address, if not provided, return null",
        "examples": ["billing@abcservices.com", "john@consulting.com"],
        "validators": [
          {
            "type": "regex",
            "regex": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
          }
        ]
      }
    }
schema = {"properties": [property_email]}

with open("my_invoice.pdf", "rb") as f:
   data = partition_file(f, property_extraction_options={"schema": schema})
```

You can add `"voting": True` to the `property_extraction_options` dictionary to enable to enable voting across multiple LLMs.

## Key management

By default, `aryn-sdk` looks for Aryn API keys first in the environment variable `ARYN_API_KEY`, and then in `~/.aryn/config.yaml`. You can override this behavior by specifying a key directly or a different path to the Aryn config file:

```python theme={null}
from aryn_sdk.partition import partition_file
from aryn_sdk.config import ArynConfig
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f, aryn_api_key="YOUR-API-KEY")
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f, aryn_config=ArynConfig(aryn_config_path="~/dotfiles/.aryn/config.yaml"))
```

## Helper Functions

`aryn_sdk` provides some helper functions to make working with and visualizing the output of `partition_file` easier.

```python theme={null}
from aryn_sdk.partition import partition_file, table_elem_to_dataframe, table_elem_to_html, draw_with_boxes
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f, extract_table_structure=True, text_mode="standard_ocr", extract_images=True, threshold=0.35)

# Produce a pandas DataFrame representing one of the extracted tables
table_elements = [elt for elt in data['elements'] if elt['type'] == 'table']
dataframe = table_elem_to_dataframe(table_elements[0])

# Convert the first table into HTML.
html = table_elem_to_html(table_elements[0])

# Draw the detected bounding boxes on the pages. requires poppler
images = draw_with_boxes("mydocument.pdf", data)
```

## Different File Formats

It is easy to process files with different formats using the aryn-sdk:

```python theme={null}
from aryn_sdk.partition import partition_file
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f)
with open("mydocument.docx", "rb") as f:
   data = partition_file(f)
with open("mypresentation.doc", "rb") as f:
   data = partition_file(f)
with open("mypresentation.pptx", "rb") as f:
   data = partition_file(f)
with open("mypresentation.ppt", "rb") as f:
   data = partition_file(f)
```

## URLs as input

The Aryn SDK also supports document input via a file path or URL to a document hosted on a remote server:

```python theme={null}
from aryn_sdk.partition import partition_file
data = partition_file("/home/jdoe/Documents/Memo.pdf")
data = partition_file("file:///home/jdoe/Documents/Memo.pdf")
data = partition_file("https://www.example.com/assets/proposal.pdf")
data = partition_file("https://aryn-public.s3.amazonaws.com/partitioner-blog-data/crispr.pdf")
```

## Chunking a document

Chunking support has been added in v0.1.9. You can enable the default chunking options by specifying an empty dict:

```python theme={null}
from aryn_sdk.partition import partition_file
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f, chunking_options={})
```

Here is an example specifying certain chunking options:

```python theme={null}
from aryn_sdk.partition import partition_file
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f, 
      chunking_options={
         "strategy": "context_rich",
         "tokenizer": "openai_tokenizer",
         "tokenizer_options": {
            "model_name": "text-embedding-3-small"
         },
         "merge_across_pages": True,
         "max_tokens": 512,
      }
   )
```

The full chunking options are documented [here](/docparse/chunking_strategies).

## Asynchronous requests

If you need to submit a large number of partitioning requests at once, we recommend using the asynchronous version of the API: `partition_file_async_submit`. The asynchronous API submits a file partitioning task to Aryn and returns with its `task_id`.
You can use the returned `task_id` to keep track of your request to partition the file using the `partition_file_async_result`. To learn more check out the [documentation](/sdk-reference/partition#partition-file-async-submit) and the [tutorial](/docparse/tutorials/async_requests_tutorial).
