API Documentation
DocParse
- Partition
Aryn Platform
- Document
- DocSet
- Transform
- Query
Partition Document
curl --request POST \
--url https://api.aryn.cloud/v1/document/partition \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: multipart/form-data' \
--form 'options={
"selected_pages": [
123
],
"extract_images": true,
"extract_image_format": "ppm",
"extract_table_structure": true,
"table_extraction_options": {
"include_additional_text": true,
"model_selection": "<string>"
},
"summarize_images": true,
"use_ocr": true,
"text_extraction_options": {
"ocr_text_mode": "vision"
},
"ocr_language": "abaza",
"threshold": "auto",
"chunking_options": {
"strategy": "context_rich",
"tokenizer": "openai_tokenizer",
"tokenizer_options": {},
"max_tokens": 123,
"merge_across_pages": true
},
"output_format": "json",
"output_label_options": {
"title_candidate_elements": [
"<string>"
],
"promote_title": true,
"orientation_correction": true
},
"markdown_options": {
"include_pagenum": true,
"include_headers": true,
"include_footers": true
}
}'
{
"status": [
"<string>"
],
"status_code": 123,
"error": "<string>",
"elements": [
{
"type": "<string>",
"bbox": [
123
],
"properties": {},
"text_representation": "<string>"
}
],
"markdown": "<string>"
}
This is the Aryn DocParse API for partitioning (and optionally chunking) a document synchronously.
Authorizations
Bearer authentication header of the form Bearer <token>
, where <token>
is your auth token.
Headers
Body
An array containing single integers (e.g., 1) and/or arrays with exactly two integers representing a range (e.g., [1, 10]).
A boolean value indicating whether to crop images detected in the document and return them in the specified format converted to base64 within the binary_representation of returned image elements.
The format to use for extracted images. Defaults to ppm.
ppm
, png
, jpeg
A boolean value indicating whether to extract table structure from the document. This means detecting cells of a table broken into rows and columns.
Options for table extraction
Attempts to merge text within the table bounding box but missed by the table due to misalignment issues.
An expression to instruct DocParse how to select the table model to use for extraction. Default is "pixels > 500 -> deformable_detr; table_transformer"
, which means "if the largest dimension of the table is more than 500 pixels, use deformable_detr; otherwise use table_transformer." To use only deformable_detr or table_transformer, set model_selection="deformable_detr"
or model_selection="table_transformer"
. Refer to the full documentation for more details.
(PAYG Only) A boolean value indicating whether to summarize images detected in the document and return them as the text representation of the image elements.
A boolean value indicating whether to use OCR or not on the document.
Options for text extraction
The mode to use for OCR text extraction on non-table elements. Defaults to standard
. Note that the vision
mode is only available for PAYG users.
vision
, standard
The language to use for OCR. Defaults to english
.
abaza
, adyghe
, afrikaans
, albanian
, angika
, arabic
, avar
, azerbaijani
, belarusian
, bhojpuri
, bihari
, bosnian
, bulgarian
, chinese
, chinese_traditional
, croatian
, czech
, danish
, dargwa
, dutch
, english
, estonian
, french
, german
, hindi
, hungarian
, icelandic
, indonesian
, ingush
, irish
, italian
, japanese
, kabardian
, korean
, konkani
, kurdish
, lak
, latvian
, lezghian
, lithuanian
, magahi
, maithili
, malay
, maltese
, maori
, marathi
, mongolian
, nagpuri
, nepali
, newari
, norwegian
, occitan
, persian
, polish
, portuguese
, romanian
, russian
, serbian_cyrillic
, serbian_latin
, slovak
, slovenian
, spanish
, swahili
, swedish
, tabassaran
, tagalog
, tamil
, telugu
, turkish
, ukrainian
, urdu
, uyghur
, uzbek
, vietnamese
, welsh
A number between 0 and 1 indicating the threshold for document segmentation. Defaults to auto, which uses an automatic threshold.
auto
The options for chunking the document. If not specified, then chunking will not be performed.
The strategy to use for merging chunks. Defaults to context_rich.
context_rich
, mixed_multi_column
, maximize_within_limit
The tokenizer to use for chunking. Defaults to openai_tokenizer.
openai_tokenizer
, character_tokenizer
, huggingface_tokenizer
The options for the tokenizer. See the full documentation here
The maximum number of tokens per chunk. Defaults to 512.
A boolean value indicating whether to merge chunks across pages. Defaults to true. Not supported for the 'mixed_multi_column' strategy.
The format of the output. Defaults to json.
json
, markdown
A dictionary of options to specify which heuristic to apply to enforce certain label outputs.
An array of strings representing the elements that should be considered as title candidates. Defaults to ["Section-header", "Caption"]
A boolean that specifies whether to promote an element to title. Defaults to false.
A boolean value indicating whether to correct the orientation of the pages. Defaults to false.
A dictionary of options to specify what to include in the markdown output.
A boolean value indicating whether to include page numbers in the markdown output. Defaults to false.
A boolean value indicating whether to include page headers in the markdown output. Defaults to false.
A boolean value indicating whether to include page footers in the markdown output. Defaults to false.
Response
The type of the element.
The bounding box of the element.
The properties of the element.
The text representation of the element.
The binary representation of the element.
The error message if the partitioning is not successful.
Was this page helpful?
curl --request POST \
--url https://api.aryn.cloud/v1/document/partition \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: multipart/form-data' \
--form 'options={
"selected_pages": [
123
],
"extract_images": true,
"extract_image_format": "ppm",
"extract_table_structure": true,
"table_extraction_options": {
"include_additional_text": true,
"model_selection": "<string>"
},
"summarize_images": true,
"use_ocr": true,
"text_extraction_options": {
"ocr_text_mode": "vision"
},
"ocr_language": "abaza",
"threshold": "auto",
"chunking_options": {
"strategy": "context_rich",
"tokenizer": "openai_tokenizer",
"tokenizer_options": {},
"max_tokens": 123,
"merge_across_pages": true
},
"output_format": "json",
"output_label_options": {
"title_candidate_elements": [
"<string>"
],
"promote_title": true,
"orientation_correction": true
},
"markdown_options": {
"include_pagenum": true,
"include_headers": true,
"include_footers": true
}
}'
{
"status": [
"<string>"
],
"status_code": 123,
"error": "<string>",
"elements": [
{
"type": "<string>",
"bbox": [
123
],
"properties": {},
"text_representation": "<string>"
}
],
"markdown": "<string>"
}