Aryn DocParse is a composite AI system for parsing, chunking, enriching, and storing unstructured documents at scale. It uses a set of purpose-built AI models for document segmentation, optical character recognition (OCR), and extracting tables, images, metadata, and more.

Key Features

  • Return the structured output of each document in JSON or Markdown, and provide labeled bounding boxes for titles, tables, table rows and columns, images, and regular text.

  • High quality AI models for complex table extraction, optical character recognition (OCR), image summarization, and more.

  • Process over 30 types of document formats, including PDFs, Microsoft Word, Microsoft PowerPoint, text, and more.

  • Store and index processed documents, extract metadata using GenAI, search your documents at scale with vector (semantic) or keyword search.

  • Optional integration with Python document ETL pipelines using the open source Sycamore document ETL library. Customize your pipeline with additional data transforms, LLM-based entity extraction, data enrichment, data cleaning, and loading vector databases and search engines.

You can use DocParse to prepare complex, unstructured data for retrieval-augmented generation (RAG) applications, document processing workflows, extracting content from documents (like tables), and semantic search systems.

Sign-up here for free to get an API Key and use the DocParse Playground UI to visualize how your document is processed.

You can learn more from our introduction video) or get started with a Quickstart.

Getting started

Sign-up here for free) for an API Key to get started with DocParse.