Introduction - Aryn Documentation

Aryn DocParse is a compound AI system for parsing, chunking, enriching, and storing unstructured documents at scale. It uses a set of purpose-built AI models for document segmentation, optical character recognition (OCR), and extracting tables, images, metadata, and more. Key Features

Return the structured output of each document in JSON or Markdown, and provide labeled bounding boxes for titles, tables, table rows and columns, images, and regular text.
High quality AI models for complex table extraction, optical character recognition (OCR), image summarization, and more.
Process over 30 types of document formats, including PDFs, Microsoft Word, Microsoft PowerPoint, text, and more.
Store and index processed documents, extract metadata using GenAI, search your documents at scale with vector (semantic) or keyword search.
Optional integration with Python document ETL pipelines using the open source Sycamore document ETL library. Customize your pipeline with additional data transforms, LLM-based entity extraction, data enrichment, data cleaning, and loading vector databases and search engines.

You can use DocParse to prepare complex, unstructured data for retrieval-augmented generation (RAG) applications, document processing workflows, extracting content from documents (like tables), and semantic search systems. Sign-up here for free to use DocParse. You can use the DocParse UI to visualize your parsing and extraction, or get an API key and use the Aryn SDK. You can learn more from our introduction video) or get started with a Quickstart. If you are interested in the Aryn Platform — an agentic unstructured data warehouse — visit the Aryn Platform documentation. Aryn uses DocParse under the hood to parse and process documents when ingesting them.