Introduction
Welcome to Aryn DocParse!
Aryn DocParse is a composite AI system for parsing, chunking, enriching, and storing unstructured documents at scale. It uses a set of purpose-built AI models for document segmentation, optical character recognition (OCR), and extracting tables, images, metadata, and more.
Key Features
-
Return the structured output of each document in JSON or Markdown, and provide labeled bounding boxes for titles, tables, table rows and columns, images, and regular text.
-
High quality AI models for complex table extraction, optical character recognition (OCR), image summarization, and more.
-
Process over 30 types of document formats, including PDFs, Microsoft Word, Microsoft PowerPoint, text, and more.
-
Store and index processed documents, extract metadata using GenAI, search your documents at scale with vector (semantic) or keyword search.
-
Optional integration with Python document ETL pipelines using the open source Sycamore document ETL library. Customize your pipeline with additional data transforms, LLM-based entity extraction, data enrichment, data cleaning, and loading vector databases and search engines.
You can use DocParse to prepare complex, unstructured data for retrieval-augmented generation (RAG) applications, document processing workflows, extracting content from documents (like tables), and semantic search systems.
Sign-up here for free to get an API Key and use the DocParse Playground UI to visualize how your document is processed.
You can learn more from our introduction video) or get started with a Quickstart.
Getting started
Sign-up here for free) for an API Key to get started with DocParse.
Quickstart
Get Started with Aryn DocParse
Use the Aryn-SDK
Using the Aryn-SDK to call DocParse
Use DocParse UI
Access the DocParse UI to visualize how your documents will be partitioned
Slack Community
Join the Slack community for any questions
API Reference
Aryn DocParse API Reference
Aryn DocParse SDK Reference
Aryn DocParse Python SDK Reference
Was this page helpful?