Installation

We recommend installing the Aryn SDK library using pip:

pip install aryn-sdk

Partitioning a Document

Partition a document like so:

from aryn_sdk.partition import partition_file
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f)

partition_file takes the same options as curl, except as keyword arguments. You can find a list of options here.

Key management

By default, aryn-sdk looks for Aryn API keys first in the environment variable ARYN_API_KEY, and then in ~/.aryn/config.yaml. You can override this behavior by specifying a key directly or a different path to the Aryn config file:

from aryn_sdk.partition import partition_file
from aryn_sdk.config import ArynConfig
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f, aryn_api_key="YOUR-API-KEY")
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f, aryn_config=ArynConfig(aryn_config_path="~/dotfiles/.aryn/config.yaml"))

Helper Functions

aryn_sdk provides some helper functions to make working with and visualizing the output of partition_file easier.

from aryn_sdk.partition import partition_file, table_elem_to_dataframe, draw_with_boxes
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f, extract_table_structure=True, use_ocr=True, extract_images=True, threshold=0.35)

# Produce a pandas DataFrame representing one of the extracted tables
table_elements = [elt for elt in data['elements'] if elt['type'] == 'table']
dataframe = table_elem_to_dataframe(table_elements[0])

# Draw the detected bounding boxes on the pages. requires poppler
images = draw_with_boxes("mydocument.pdf", data)

Different File Formats

It is easy to process files with different formats using the aryn-sdk:

from aryn_sdk.partition import partition_file
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f)
with open("mydocument.docx", "rb") as f:
   data = partition_file(f)
with open("mypresentation.doc", "rb") as f:
   data = partition_file(f)
with open("mypresentation.pptx", "rb") as f:
   data = partition_file(f)
with open("mypresentation.ppt", "rb") as f:
   data = partition_file(f)

Chunking a document

Chunking support has been added in v0.1.9. You can enable the default chunking options by specifying an empty dict:

from aryn_sdk.partition import partition_file
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f, chunking_options={})

Here is an example specifying certain chunking options:

from aryn_sdk.partition import partition_file
with open("mydocument.pdf", "rb") as f:
   data = partition_file(f, 
      chunking_options={
         "strategy": "context_rich",
         "tokenizer": "openai_tokenizer",
         "tokenizer_options": {
            "model_name": "text-embedding-3-small"
         },
         "merge_across_pages": True,
         "max_tokens": 512,
      }
   )

The full chunking options are documented here.

Partitioning asynchronously

partition_file_async_submit is a function that submits a file partitioning task to Aryn and returns with its task_id. You can use the returned task_id to keep track of your request to partition the file. Poll for the result using another function, partition_file_async_result, (it returns a dict with a “status” key that indicates whether the task is “done”, “pending” or in “error” state). Once the “status” is “done” the dict returned by partition_file_async_result will have a “result” key with the actual results. The following code snippet is an example of how you can use these functions:

import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result

with open("my-favorite-pdf.pdf", "rb") as f:
    response = partition_file_async_submit(
        f,
        use_ocr=True,
        extract_table_structure=True,
    )

task_id = response["task_id"]

# Poll for the results
while True:
    result = partition_file_async_result(task_id)
    if result["status"] != "pending":
        break
    time.sleep(5)

One way to do this with multiple requests at a time is shown below:

import logging
import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result

files = [open("file1.pdf", "rb"), open("file2.docx", "rb")]
task_ids = [None] * len(files)
for i, f in enumerate(files):
    try:
        task_ids[i] = partition_file_async_submit(f)["task_id"]
    except Exception as e:
        logging.warning(f"Failed to submit {f}: {e}")

results = [None] * len(files)
for i, task_id in enumerate(task_ids):
    while True:
        result = partition_file_async_result(task_id)

        if result["status"] != "pending":
            break

        time.sleep(5)

    if result["status"] == "done":
        results[i] = result

## print the results
## will be None if task failed
for result in results:
    print(result)

Optionally, you can also set a webhook for Aryn’s services to call when your task is completed:

partition_file_async_submit("path/to/my/file.docx", webhook_url="https://example.com/alert")

Aryn will POST a request containing a body like the below:

{"done": [{"task_id": "aryn:t-47gpd3604e5tz79z1jro5fc"}]}

For more information, see the Aryn SDK documentation.