You can use Aryn DocParse to easily chunk and extract data from complex documents, and return a structured output in JSON or Markdown. DocParse can process 30+ document formats, including PDF, Microsoft Word (.docx and .doc), Microsoft PowerPoint (.pptx and .ppt) and more.

We show you how to get started with DocParse through the Aryn Playground, the Python aryn-sdk client, or curl. For building document ETL pipelines with DocParse, visit the DocPrep Quickstart. You will need an Aryn Cloud API Key, which you can get and use for free at aryn.ai/get-started. You will receive the API key via email after you sign up.

Using the Aryn Playground

After you sign-up and get your Aryn Cloud API key, go to the Aryn Playground. Click on DocParse to go to the DocParse Playground UI.

Next, select a document to parse, and choose the options for DocParse (e.g. OCR). Click on “Chunk document,” and DocParse will process the first 25 pages of your PDF. If you have a larger document, use the aryn-sdk (the UI is limited to 25 pages per document).

Once the document is processed, you will see a visualized document segmentation with labeled bounding boxes. You can choose to download and check out the structured JSON output, which is the output of DocParse (and can be used in various document processing and ETL workflows). Additionally, you can download the visual of the segmented PDF.

Now that you have seen how DocParse can segment complex documents, extract tables, and more, you can use the aryn-sdk with your application or DocPrep to generate document ETL code to load the output into vector databases.

For additional questions on getting started, please join the Slack community here or email us.

Using the aryn-sdk

The aryn-sdk client is a thin python library that calls Aryn DocParse and provides a few utility methods around it. It is the easiest way to add Aryn DocParse to your applications or custom data processing pipelines. You can view an example in this notebook.

For more information, see the Aryn SDK documentation or API reference.

Using curl

We recommend using the aryn-sdk, but you can also use curl to access Aryn DocParse directly.

curl an example document to use with DocParse, if you do not have one already.

curl http://arxiv.org/pdf/1706.03762 -o document.pdf

Change PUT API KEY HERE below to your Aryn Cloud API key. If you have a different document, change @document.pdf to @/path/to/your/document.pdf below.

export ARYN_API_KEY="PUT API KEY HERE"
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.pdf" | tee document.json

Your results have been saved to document.json.

cat document.json

Different File Formats

export ARYN_API_KEY="PUT API KEY HERE"
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.pdf" | tee document.json
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.docx" | tee document.json
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.doc" | tee document.json
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.pptx" | tee document.json
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.ppt" | tee document.json

Next steps

  • To load your parsed documents into a vector database, use DocPrep to create a document ETL pipeline in Python for additional processing and loading. It leverages Sycamore, a scalable, open source document ETL library.

  • To use DocParse with Langchain, you can check out this example notebook here.

  • To extract tables from your documents and run analytics on them, visit here.

  • To extract images from your documents and process them directly, visit here.

  • To use the Sycamore document ETL library directly with DocParse, you can check out an example notebook here. This notebook walks through an example where you can use Sycamore to transform your data and load it into a vector database.

Was this page helpful?