Quickstart
Getting Started with Aryn DocParse
You can use Aryn DocParse to easily chunk and extract data from complex documents, and return a structured output in JSON or Markdown. DocParse can process 30+ document formats, including PDF, Microsoft Word (.docx and .doc), Microsoft PowerPoint (.pptx and .ppt) and more.
We show you how to get started with DocParse through the Aryn Playground, the Python aryn-sdk
client, or curl
. For building document ETL pipelines with DocParse, visit the DocPrep Quickstart. You will need an Aryn Cloud API Key, which you can get and use for free at aryn.ai/get-started. You will recieve the API key via email after you sign up.
Using the Aryn Playground
After you sign-up and get your Aryn Cloud API key, go to the Aryn Playground. Click on DocParse to go to the DocParse Playground UI.
Next, select a document to parse, and choose the options for DocParse (e.g. OCR). Click on “Chunk document,” and DocParse will process the first 25 pages of your PDF. If you have a larger document, use the aryn-sdk
(the UI is limited to 25 pages per document).
Once the document is processed, you will see a visualized document segmentation with labeled bounding boxes. You can choose to download and check out the structured JSON output, which is the output of DocParse (and can be used in various document processing and ETL workflows). Additionally, you can download the visual of the segmented PDF.
Now that you have see how DocParse can segment complex documents, extract tables, and more, you can use the aryn-sdk
with your application or DocPrep to generate document ETL code to load the output into vector databases.
For additional questions on getting started, please join the Slack community here or email us.
Using the aryn-sdk
The aryn-sdk
client is a thin python library that calls Aryn DocParse and provides a few utility methods around it. It is the easiest way to add Aryn DocParse to your applications or custom data processing pipelines. You can view an example in this notebook.
For more information, see the Aryn SDK documentation or API reference.
Using curl
We recommend using the aryn-sdk
, but you can also use curl
to access Aryn DocParse directly.
curl
an example document to use with DocParse, if you do not have one already.
Change PUT API KEY HERE
below to your Aryn Cloud API key. If you have a different document, change @document.pdf
to @/path/to/your/document.pdf
below.
Your results have been saved to document.json
.
Different File Formats
Next steps
-
To load your parsed documents in to a vector database, use DocPrep to create a document ETL pipeline in Python for additional processing and loading. It leverages Sycamore, a scalable, open source document ETL library.
-
To use DocParse with Langchain, you can check out this example notebook here.
-
To extract tables from your documents and run analytics on them, visit here.
-
To extract images from you documents and process them directly, visit here.
-
To use the Sycamore document ETL library direclty with DocParse, you can check out an example notebook here. This notebook walks through an example where you can use Sycamore to transform your data and load it into a vector database.
Was this page helpful?