You can use Aryn DocParse to easily chunk and extract data from complex documents, and return a structured output in JSON or Markdown. DocParse can process 30+ document formats, including PDF, Microsoft Word (.docx and .doc), Microsoft PowerPoint (.pptx and .ppt) and more.

We show you how to get started with DocParse through the DocParse UI, the Python aryn-sdk client, or curl. For building multi-stage document ETL pipelines that use DocParse for parsing, visit the Sycamore documentation.

You will need an Aryn DocParse API Key, which you can get and use for free at aryn.ai/get-started.

Using the DocParse UI

After you sign-up and get your Aryn DocParse API key, go to the DocParse UI.

Next, select a document to parse, and choose the options for DocParse (e.g. OCR). Click on “Chunk document,” and DocParse will process the first 25 pages of your PDF. If you have a larger document, use the aryn-sdk (the UI is limited to 25 pages per document).

Once the document is processed, you will see a visualized document segmentation with labeled bounding boxes. You can choose to download and check out the structured JSON output, which is the output of DocParse. Additionally, you can download the visual of the segmented PDF. If you prefer markdown output, please use the aryn-sdk.

Now that you have seen how DocParse can segment complex documents, extract tables, and more, you can use the aryn-sdk to leverage DocParse in your application or the Sycamore document ETL library to load the output into vector databases.

For additional questions on getting started, please join the Slack community here or email us.

Using the DocParse aryn-sdk

The DocParse aryn-sdk client is a thin python library that calls Aryn DocParse and provides a few utility methods around it. It is the easiest way to add Aryn DocParse to your applications or custom data processing pipelines. You can view an example in this notebook.

For more information, see the Aryn SDK documentation or API reference.

Using curl

We recommend using the aryn-sdk, but you can also use curl to access Aryn DocParse directly.

curl an example document to use with DocParse, if you do not have one already.

curl http://arxiv.org/pdf/1706.03762 -o document.pdf

Change PUT API KEY HERE below to your Aryn DocParse API key. If you have a different document, change @document.pdf to @/path/to/your/document.pdf below.

export ARYN_API_KEY="PUT API KEY HERE"
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.pdf" | tee document.json

Your results have been saved to document.json.

cat document.json

Different File Formats

export ARYN_API_KEY="PUT API KEY HERE"
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.pdf" | tee document.json
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.docx" | tee document.json
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.doc" | tee document.json
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.pptx" | tee document.json
curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.ppt" | tee document.json

Next steps

  • To load your parsed documents into a vector database, use Sycamore to create a document ETL pipeline in Python for additional processing and loading. Sycamore is a scalable, open source document ETL library that integrates with DocParse. You can check out an example notebook here.

  • To use DocParse with Langchain, you can check out this example notebook here.

  • To extract tables from your documents and run analytics on them, visit here.

  • To extract images from your documents and process them directly, visit here.