Partition Document
curl --request POST \
--url https://api.aryn.cloud/v1/document/partition \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: multipart/form-data' \
--form 'options={
"selected_pages": [
123
],
"extract_images": true,
"extract_table_structure": true,
"use_ocr": true,
"ocr_images": true,
"ocr_language": "abaza",
"threshold": "auto",
"chunking_options": {
"strategy": "context_rich",
"tokenizer": "openai_tokenizer",
"tokenizer_options": {},
"max_tokens": 123,
"merge_across_pages": true
},
"output_label_options": {
"title_candidate_elements": [
"<string>"
],
"promote_title": true,
"orientation_correction": true
},
"output_format": "json"
}'
{
"status": [
"<string>"
],
"status_code": 123,
"error": "<string>",
"elements": [
{
"type": "<string>",
"bbox": [
123
],
"properties": {},
"text_representation": "<string>"
}
],
"markdown": "<string>"
}
This is the Aryn DocParse API for partitioning (and optionally chunking) a document synchronously.
Authorizations
Bearer authentication header of the form Bearer <token>
, where <token>
is your auth token.
Headers
Body
An array containing single integers (e.g., 1) and/or arrays with exactly two integers representing a range (e.g., [1, 10]).
A boolean value indicating whether to crop images detected in the document and return them in ppm format converted to base64 within the binary_representation of returned image elements.
A boolean value indicating whether to extract table structure from the document. This means detecting cells of a table broken into rows and columns.
A boolean value indicating whether to use OCR or not on the document.
A boolean value indicating whether to OCR images in the document.
The language to use for OCR. Defaults to english
.
abaza
, adyghe
, afrikaans
, albanian
, angika
, arabic
, avar
, azerbaijani
, belarusian
, bhojpuri
, bihari
, bosnian
, bulgarian
, chinese
, chinese_traditional
, croatian
, czech
, danish
, dargwa
, dutch
, english
, estonian
, french
, german
, hindi
, hungarian
, icelandic
, indonesian
, ingush
, irish
, italian
, japanese
, kabardian
, korean
, konkani
, kurdish
, lak
, latvian
, lezghian
, lithuanian
, magahi
, maithili
, malay
, maltese
, maori
, marathi
, mongolian
, nagpuri
, nepali
, newari
, norwegian
, occitan
, persian
, polish
, portuguese
, romanian
, russian
, serbian_cyrillic
, serbian_latin
, slovak
, slovenian
, spanish
, swahili
, swedish
, tabassaran
, tagalog
, tamil
, telugu
, turkish
, ukrainian
, urdu
, uyghur
, uzbek
, vietnamese
A number between 0 and 1 indicating the threshold for document segmentation. Defaults to auto, which uses an automatic threshold.
auto
The options for chunking the document. If not specified, then chunking will not be performed.
The strategy to use for merging chunks. Defaults to context_rich.
context_rich
, mixed_multi_column
, maximize_within_limit
The tokenizer to use for chunking. Defaults to openai_tokenizer.
openai_tokenizer
, character_tokenizer
, huggingface_tokenizer
The options for the tokenizer. See the full documentation here
The maximum number of tokens per chunk. Defaults to 512.
A boolean value indicating whether to merge chunks across pages. Defaults to false. Not supported for the 'mixed_multi_column' strategy.
A dictionary of options to specify which heuristic to apply to enforce certain label outputs.
An array of strings representing the elements that should be considered as title candidates. Defaults to ["Section-header", "Caption"]
A boolean that specifies whether to promote an element to title. Defaults to false.
A boolean value indicating whether to correct the orientation of the pages. Defaults to false.
The format of the output. Defaults to json.
json
, markdown
Response
The type of the element.
The bounding box of the element.
The properties of the element.
The text representation of the element.
The binary representation of the element.
The error message if the partitioning is not successful.
Was this page helpful?
curl --request POST \
--url https://api.aryn.cloud/v1/document/partition \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: multipart/form-data' \
--form 'options={
"selected_pages": [
123
],
"extract_images": true,
"extract_table_structure": true,
"use_ocr": true,
"ocr_images": true,
"ocr_language": "abaza",
"threshold": "auto",
"chunking_options": {
"strategy": "context_rich",
"tokenizer": "openai_tokenizer",
"tokenizer_options": {},
"max_tokens": 123,
"merge_across_pages": true
},
"output_label_options": {
"title_candidate_elements": [
"<string>"
],
"promote_title": true,
"orientation_correction": true
},
"output_format": "json"
}'
{
"status": [
"<string>"
],
"status_code": 123,
"error": "<string>",
"elements": [
{
"type": "<string>",
"bbox": [
123
],
"properties": {},
"text_representation": "<string>"
}
],
"markdown": "<string>"
}