curl --request POST \
--url https://api.aryn.cloud/v1/document/partition \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: multipart/form-data' \
--form file='@example-file' \
--form 'file_url=<string>' \
--form 'options={
"selected_pages": [
123
],
"extract_images": false,
"image_extraction_options": {
"associate_captions": false,
"extract_image_format": "ppm"
},
"property_extraction_options": {
"schema": [
{
"name": "<string>",
"type": {
"type": "int",
"description": "<string>",
"examples": "<array>",
"choices": [
"<string>"
],
"item_type": {}
}
}
],
"suggest_properties": false,
"suggest_properties_instructions": "<string>"
},
"table_extraction_options": {
"include_additional_text": true,
"model_selection": "pixels > 500 -> deformable_detr; table_transformer"
},
"summarize_images": false,
"text_mode": "auto",
"table_mode": "standard",
"text_extraction_options": {
"ocr_text_mode": "vision",
"remove_line_breaks": true
},
"ocr_language": "english",
"threshold": "auto",
"chunking_options": {
"strategy": "context_rich",
"tokenizer": "openai_tokenizer",
"tokenizer_options": {
"model_name": "text-embedding-3-small"
},
"max_tokens": 123,
"merge_across_pages": true
},
"output_format": "json",
"output_label_options": {
"title_candidate_elements": [
"<string>"
],
"promote_title": false,
"orientation_correction": false
},
"markdown_options": {
"include_pagenum": false,
"include_headers": false,
"include_footers": false
},
"extract_table_structure": true,
"use_ocr": true,
"extract_image_format": "ppm"
}'{
"status": [
"<string>"
],
"status_code": 123,
"elements": [
{
"type": "<string>",
"bbox": [
123
],
"properties": {},
"text_representation": "<string>",
"binary_representation": "<string>"
}
],
"markdown": "<string>",
"error": "<string>"
}curl --request POST \
--url https://api.aryn.cloud/v1/document/partition \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: multipart/form-data' \
--form file='@example-file' \
--form 'file_url=<string>' \
--form 'options={
"selected_pages": [
123
],
"extract_images": false,
"image_extraction_options": {
"associate_captions": false,
"extract_image_format": "ppm"
},
"property_extraction_options": {
"schema": [
{
"name": "<string>",
"type": {
"type": "int",
"description": "<string>",
"examples": "<array>",
"choices": [
"<string>"
],
"item_type": {}
}
}
],
"suggest_properties": false,
"suggest_properties_instructions": "<string>"
},
"table_extraction_options": {
"include_additional_text": true,
"model_selection": "pixels > 500 -> deformable_detr; table_transformer"
},
"summarize_images": false,
"text_mode": "auto",
"table_mode": "standard",
"text_extraction_options": {
"ocr_text_mode": "vision",
"remove_line_breaks": true
},
"ocr_language": "english",
"threshold": "auto",
"chunking_options": {
"strategy": "context_rich",
"tokenizer": "openai_tokenizer",
"tokenizer_options": {
"model_name": "text-embedding-3-small"
},
"max_tokens": 123,
"merge_across_pages": true
},
"output_format": "json",
"output_label_options": {
"title_candidate_elements": [
"<string>"
],
"promote_title": false,
"orientation_correction": false
},
"markdown_options": {
"include_pagenum": false,
"include_headers": false,
"include_footers": false
},
"extract_table_structure": true,
"use_ocr": true,
"extract_image_format": "ppm"
}'{
"status": [
"<string>"
],
"status_code": 123,
"elements": [
{
"type": "<string>",
"bbox": [
123
],
"properties": {},
"text_representation": "<string>",
"binary_representation": "<string>"
}
],
"markdown": "<string>",
"error": "<string>"
}Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Show child attributes
An array containing single integers (e.g., 1) and/or arrays with exactly two integers representing a range (e.g., [1, 10]).
A boolean value indicating whether to crop images detected in the document and return them in the specified format converted to base64 within the binary_representation of returned image elements.
Show child attributes
A boolean value indicating whether to associate captions with the images.
The format to use for extracted images. Defaults to ppm.
ppm, png, jpeg (PAYG Only) Options for extracting properties (key-value pairs) from documents such as invoices, purchase orders, contracts, etc.
Show child attributes
A list of properties each of which describes a specific occurrence of information appearing in the document.
Show child attributes
The name of a property, e.g. 'address', 'first_name'.
A dictionary describing the property.
Show child attributes
The type of the property. Simple types can be 'int', 'float', 'date', 'string', 'bool', 'choice'. A nested type of 'array' can consist of properties of simple types.
int, float, date, string, bool, object, choice, array A description of the property.
An array of examples.
Valid choices for a 'choice' type property.
The type of items in an 'array' type property.
Infer the schema of the submitted document.
Additional instructions to DocParse for recognizing important properties for the schema.
Options for table extraction
Show child attributes
Attempts to merge text within the table bounding box but missed by the table due to misalignment issues.
An expression to instruct DocParse how to select the table model to use for extraction. Default is "pixels > 500 -> deformable_detr; table_transformer", which means "if the largest dimension of the table is more than 500 pixels, use deformable_detr; otherwise use table_transformer." To use only deformable_detr or table_transformer, set model_selection="deformable_detr" or model_selection="table_transformer". Refer to the full documentation for more details.
(PAYG Only) A boolean value indicating whether to summarize images detected in the document and return them as the text representation of the image elements.
The mode to use for text extraction. Defaults to auto, which intelligently uses the best combination of OCR and inline text. Note that the vision_ocr mode is only available for PAYG users.
inline_fallback_to_ocr, ocr_standard, ocr_vision, auto The mode to use for table structure extraction. Defaults to none, which will not extract table structure. Note that the vision mode is only available for PAYG users.
none, standard, vision, custom Options for text extraction
Show child attributes
The mode to use for OCR text extraction on non-table elements. Defaults to standard. Note that the vision mode is only available for PAYG users.
vision, standard A boolean value indicating whether to remove line breaks from the extracted text.
The language to use for OCR. Defaults to english.
abaza, adyghe, afrikaans, albanian, angika, arabic, avar, azerbaijani, belarusian, bhojpuri, bihari, bosnian, bulgarian, chinese, chinese_traditional, croatian, czech, danish, dargwa, dutch, english, estonian, french, german, hindi, hungarian, icelandic, indonesian, ingush, irish, italian, japanese, kabardian, korean, konkani, kurdish, lak, latvian, lezghian, lithuanian, magahi, maithili, malay, maltese, maori, marathi, mongolian, nagpuri, nepali, newari, norwegian, occitan, persian, polish, portuguese, romanian, russian, serbian_cyrillic, serbian_latin, slovak, slovenian, spanish, swahili, swedish, tabassaran, tagalog, tamil, telugu, turkish, ukrainian, urdu, uyghur, uzbek, vietnamese, welsh A number between 0 and 1 indicating the threshold for document segmentation. Defaults to auto, which uses an automatic threshold.
auto The options for chunking the document. If not specified, then chunking will not be performed.
Show child attributes
The strategy to use for merging chunks. Defaults to context_rich.
context_rich, mixed_multi_column, maximize_within_limit The tokenizer to use for chunking. Defaults to openai_tokenizer.
openai_tokenizer, character_tokenizer, huggingface_tokenizer The options for the tokenizer.
Show child attributes
The model to use for the tokenizer. Supports all tokenizers tiktoken and huggingface's transformers support so long as they do not run remote code
The maximum number of tokens per chunk. Defaults to 512.
A boolean value indicating whether to merge chunks across pages. Defaults to true. Not supported for the 'mixed_multi_column' strategy.
The format of the output. Defaults to json.
json, markdown, html A dictionary of options to specify which heuristic to apply to enforce certain label outputs.
Show child attributes
An array of strings representing the elements that should be considered as title candidates. Defaults to ["Section-header", "Caption"]
A boolean that specifies whether to promote an element to title. Defaults to false.
A boolean value indicating whether to correct the orientation of the pages. Defaults to false.
A dictionary of options to specify what to include in the markdown output.
Show child attributes
A boolean value indicating whether to include page numbers in the markdown output. Defaults to false.
A boolean value indicating whether to include page headers in the markdown output. Defaults to false.
A boolean value indicating whether to include page footers in the markdown output. Defaults to false.
Use table_mode instead. A boolean value indicating whether to extract table structure from the document. This means detecting cells of a table broken into rows and columns.
Use text_mode instead. A boolean value indicating whether to use OCR or not on the document.
The format to use for extracted images. Defaults to ppm.
ppm, png, jpeg Successful Response
Show child attributes
The type of the element.
The bounding box of the element.
The properties of the element.
The text representation of the element.
The binary representation of the element.
The error message if the partitioning is not successful.
Was this page helpful?