Output Structure
The output format of Aryn DocParse
The default output format of Aryn DocParse is JSON.
Element Format
It is often useful to process different parts of a document separately. For example, you might want to process tables differently than text paragraphs, and typically small chunks of text are embedded separately for vector search. In Aryn DocParse, these chunks are called elements.
Elements follow the following format:
An example element is given below:
Element Type
Type | Description |
---|---|
Title | Large Text |
Text | Regular Text |
Caption | Description of an image or table |
Footnote | Small text found near the bottom of the page |
Formula | LaTeX or similar mathematical expression |
List-item | Part of a list |
Page-footer | Small text at bottom of page |
Page-header | Small text at top of page |
Image | A Picture or diagram. When extract_images is set to true , this element includes a binary_representation tag which contains a base64 encoded ppm image file. When extract_images is false, the bounding box of the Image is still returned. |
Section-header | Medium-sized text marking a section. |
table | A grid of text. See the extract_table_structure option to extract information from the table rather than just detecting its presence. |
Bounding Box
Takes the format [x1, y1, x2, y2]
where each coordinate is given as the proportion of how far down or across the screen the element is. For instance, an element that is 100 pixels from the left border of a document 400 pixels wide would have an x1 coordinate of 0.25.
Properties
The score
is the model’s “confidence” in its prediction for that particular bounding box. By default, we automatically select bounding boxes to achieve good coverage with high prediction accuracy, but the user can control this by using the threshold
parameter (defaults to “auto”). If the user specifies a numeric value between 0 and 1, only Elements with a confidence score higher than the specified threshold value will be kept.
Text Representation
Text elements contain \n
when the text includes a line return.
Binary Representation
When extract_images
is set to True, Images include a binary_representation
tag which contains a base64 encoded ppm image file of the pdf cropped to the bounds of the detected image. When extract_images
is false, the bounding box of the Image is still returned.
Markdown Format
If the request to Aryn DocParse has the output_format
option set to markdown
, a successful response will look like this:
Was this page helpful?