Tutorial: DocParse Intro
Walking through a document using Aryn DocParse
Introduction
In this example, we’ll use DocParse to extract data from an NTSB report. We’ll go through the important code snippets below to see what’s going on. For the purposes of this tutorial, we focus on page 2 of the document, given below.
Getting Parsed Document
Making a Call to DocParse
We use the partition_file
function from aryn_sdk.partition
to extract text and images from the document. The aryn_api_key
is your API key from Aryn. In this example, we set use_ocr
to False
to avoid OCR and use embedded text, and set extract_images
and extract_table_structure
to True
to extract images and tables. We use selected_pages=[2]
to focus on page 2.
Viewing the JSON Output
Above, you can see the JSON output from the call to Aryn DocParse. The output is a JSON object with a status
field that shows the status of the call and an elements
field that contains a list of elements extracted from the document. The call ID has been redacted for privacy reasons.
Examining Individual Elements
Extracting an Image from the Document
The first element we see is an image. We get back a bounding box and a binary representation of the image, which we can use to display the image. The bounding boxes are given in the format [x1, y1, x2, y2]
, where (x1, y1)
is the top-left corner and (x2, y2)
is the bottom-right corner of the bounding box.
Extracting Captions from the Document
The second element we see is a caption. We get back a bounding box and the text content of the caption.
Extracting Text from the Document
The third element we see is a text element. We get back a bounding box and the text content of the element.
Extracting a Table from the Document
The fourth element we see is an table. We get back a bounding box and the table structure, which includes the cells of the table and their properties.
Walking through the first cell above, we see that there are 6 attributes: content
, rows
, cols
, is_header
, bbox
, and properties
. The content
attribute contains the text content of the cell, the rows
attribute contains the row index of the cell, the cols
attribute contains the column index of the cell, the is_header
attribute indicates whether the cell is a header cell and is optional, the bbox
attribute contains the bounding box of the cell, and the properties
attribute contains additional properties of the cell.
Displaying the Table
Here we display the table in clean markdown format below. We clean the column headers to make them a separate row in the table.
The output is given below:
0 | Aircraft Make: | MARC JONES | Registration: | N512P |
1 | Model/Series: | PITTS MODEL 12 | Aircraft Category: | Airplane |
2 | Amateur Built: | |||
3 | Operator: | M12 AVIATION LLC | Operating Certificate(s) | None |
Held: | ||||
4 | Operator Designator Code: |
Was this page helpful?