Documentation Index
Fetch the complete documentation index at: https://docs.aryn.ai/llms.txt
Use this file to discover all available pages before exploring further.
Installation
We recommend installing the Aryn SDK library using pip:
Partitioning a Document
Partition a document like so:
from aryn_sdk.partition import partition_file
with open("mydocument.pdf", "rb") as f:
data = partition_file(f)
partition_file takes the same options as curl, except as keyword arguments. You can find a list of options here.
Create a schema by hand, or get help from DocParse Suggest Properties. Provide the schema in property_extraction_options like so:
from aryn_sdk.partition import partition_file
property_email = {
"name": "vendor_email",
"type": {
"type": "string",
"description": "Vendor's email address, if not provided, return null",
"examples": ["billing@abcservices.com", "john@consulting.com"],
"validators": [
{
"type": "regex",
"regex": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
}
]
}
}
schema = {"properties": [property_email]}
with open("my_invoice.pdf", "rb") as f:
data = partition_file(f, property_extraction_options={"schema": schema})
You can add "voting": True to the property_extraction_options dictionary to enable to enable voting across multiple LLMs.
Key management
By default, aryn-sdk looks for Aryn API keys first in the environment variable ARYN_API_KEY, and then in ~/.aryn/config.yaml. You can override this behavior by specifying a key directly or a different path to the Aryn config file:
from aryn_sdk.partition import partition_file
from aryn_sdk.config import ArynConfig
with open("mydocument.pdf", "rb") as f:
data = partition_file(f, aryn_api_key="YOUR-API-KEY")
with open("mydocument.pdf", "rb") as f:
data = partition_file(f, aryn_config=ArynConfig(aryn_config_path="~/dotfiles/.aryn/config.yaml"))
Helper Functions
aryn_sdk provides some helper functions to make working with and visualizing the output of partition_file easier.
from aryn_sdk.partition import partition_file, table_elem_to_dataframe, table_elem_to_html, draw_with_boxes
with open("mydocument.pdf", "rb") as f:
data = partition_file(f, extract_table_structure=True, text_mode="standard_ocr", extract_images=True, threshold=0.35)
# Produce a pandas DataFrame representing one of the extracted tables
table_elements = [elt for elt in data['elements'] if elt['type'] == 'table']
dataframe = table_elem_to_dataframe(table_elements[0])
# Convert the first table into HTML.
html = table_elem_to_html(table_elements[0])
# Draw the detected bounding boxes on the pages. requires poppler
images = draw_with_boxes("mydocument.pdf", data)
It is easy to process files with different formats using the aryn-sdk:
from aryn_sdk.partition import partition_file
with open("mydocument.pdf", "rb") as f:
data = partition_file(f)
with open("mydocument.docx", "rb") as f:
data = partition_file(f)
with open("mypresentation.doc", "rb") as f:
data = partition_file(f)
with open("mypresentation.pptx", "rb") as f:
data = partition_file(f)
with open("mypresentation.ppt", "rb") as f:
data = partition_file(f)
The Aryn SDK also supports document input via a file path or URL to a document hosted on a remote server:
from aryn_sdk.partition import partition_file
data = partition_file("/home/jdoe/Documents/Memo.pdf")
data = partition_file("file:///home/jdoe/Documents/Memo.pdf")
data = partition_file("https://www.example.com/assets/proposal.pdf")
data = partition_file("https://aryn-public.s3.amazonaws.com/partitioner-blog-data/crispr.pdf")
Chunking a document
Chunking support has been added in v0.1.9. You can enable the default chunking options by specifying an empty dict:
from aryn_sdk.partition import partition_file
with open("mydocument.pdf", "rb") as f:
data = partition_file(f, chunking_options={})
Here is an example specifying certain chunking options:
from aryn_sdk.partition import partition_file
with open("mydocument.pdf", "rb") as f:
data = partition_file(f,
chunking_options={
"strategy": "context_rich",
"tokenizer": "openai_tokenizer",
"tokenizer_options": {
"model_name": "text-embedding-3-small"
},
"merge_across_pages": True,
"max_tokens": 512,
}
)
The full chunking options are documented here.
Asynchronous requests
If you need to submit a large number of partitioning requests at once, we recommend using the asynchronous version of the API: partition_file_async_submit. The asynchronous API submits a file partitioning task to Aryn and returns with its task_id.
You can use the returned task_id to keep track of your request to partition the file using the partition_file_async_result. To learn more check out the documentation and the tutorial.