cURL Python JavaScript PHP Go Java
curl --request POST \
--url https://api.aryn.cloud/v1/async/submit/document/partition \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: multipart/form-data' \
--form 'options={
"selected_pages": [
123
],
"extract_images": true,
"extract_table_structure": true,
"use_ocr": true,
"ocr_images": true,
"ocr_language": "abaza",
"threshold": "auto",
"chunking_options": {
"strategy": "context_rich",
"tokenizer": "openai_tokenizer",
"tokenizer_options": {},
"max_tokens": 123,
"merge_across_pages": true
},
"output_label_options": {
"title_candidate_elements": [
"<string>"
],
"promote_title": true,
"orientation_correction": true
},
"output_format": "json"
}'
{
"task_id" : "aryn:t-47gpd3604e5tz79z1jro5fc"
}
This is the Aryn DocParse API for submitting a document to be partitioned (and optionally chunked) asynchronously. Use
the async partition result API to poll the asynchronous partitioning job and get its result
when it’s done.
This takes all the same parameters as the synchronous partitioning endpoint and in addition accepts a webhook url in the
optional header X-Aryn-Webhook
. When the task stops running, Aryn will POST to the provided webhook url with a body
like the one below:
Bearer authentication header of the form Bearer <token>
, where <token>
is your auth token.
An array containing single integers (e.g., 1) and/or arrays with exactly two integers representing a range (e.g., [1, 10]).
A boolean value indicating whether to crop images detected in the document and return them in ppm format converted to base64 within the binary_representation of returned image elements.
options. extract_table_structure
A boolean value indicating whether to extract table structure from the document. This means detecting cells of a table broken into rows and columns.
A boolean value indicating whether to use OCR or not on the document.
A boolean value indicating whether to OCR images in the document.
The language to use for OCR. Defaults to english
.
Available options:
abaza
,
adyghe
,
afrikaans
,
albanian
,
angika
,
arabic
,
avar
,
azerbaijani
,
belarusian
,
bhojpuri
,
bihari
,
bosnian
,
bulgarian
,
chinese
,
chinese_traditional
,
croatian
,
czech
,
danish
,
dargwa
,
dutch
,
english
,
estonian
,
french
,
german
,
hindi
,
hungarian
,
icelandic
,
indonesian
,
ingush
,
irish
,
italian
,
japanese
,
kabardian
,
korean
,
konkani
,
kurdish
,
lak
,
latvian
,
lezghian
,
lithuanian
,
magahi
,
maithili
,
malay
,
maltese
,
maori
,
marathi
,
mongolian
,
nagpuri
,
nepali
,
newari
,
norwegian
,
occitan
,
persian
,
polish
,
portuguese
,
romanian
,
russian
,
serbian_cyrillic
,
serbian_latin
,
slovak
,
slovenian
,
spanish
,
swahili
,
swedish
,
tabassaran
,
tagalog
,
tamil
,
telugu
,
turkish
,
ukrainian
,
urdu
,
uyghur
,
uzbek
,
vietnamese
A number between 0 and 1 indicating the threshold for document segmentation. Defaults to auto, which uses an automatic threshold.
The options for chunking the document. If not specified, then chunking will not be performed.
options.chunking_options. strategy
The strategy to use for merging chunks. Defaults to context_rich.
Available options:
context_rich
,
mixed_multi_column
,
maximize_within_limit
options.chunking_options. tokenizer
The tokenizer to use for chunking. Defaults to openai_tokenizer.
Available options:
openai_tokenizer
,
character_tokenizer
,
huggingface_tokenizer
options.chunking_options. tokenizer_options
options.chunking_options. max_tokens
The maximum number of tokens per chunk. Defaults to 512.
options.chunking_options. merge_across_pages
A boolean value indicating whether to merge chunks across pages. Defaults to false. Not supported for the 'mixed_multi_column' strategy.
options. output_label_options
A dictionary of options to specify which heuristic to apply to enforce certain label outputs.
options.output_label_options. title_candidate_elements
An array of strings representing the elements that should be considered as title candidates. Defaults to ["Section-header", "Caption"]
options.output_label_options. promote_title
A boolean that specifies whether to promote an element to title. Defaults to false.
options.output_label_options. orientation_correction
A boolean value indicating whether to correct the orientation of the pages. Defaults to false.
The format of the output. Defaults to json.
Available options:
json
,
markdown
The ID of the async task.