Parsing and extracting elements from a document
aryn-sdk
. We’ll go through the important code snippets below to see what’s going on. For the purposes of this tutorial, we focus on page 2 of the document, given below.
partition_file
function from aryn_sdk.partition
to extract text and images from the document. The
aryn_api_key
is your API key from Aryn. In this example, we set text_mode
to inline_fallback_to_ocr
to use
inline, embedded text when available and to use OCR otherwise, and set extract_images
and extract_table_structure
to True
to extract images and tables. We use selected_pages=[2]
to focus on page 2. This document is also be added
to your storage, where you can visualize the labeled bounding boxes and extracted
elements.
status
field that shows the status of the call and an elements
field that contains a list of elements extracted from the document. The call ID has been redacted for privacy reasons.
[x1, y1, x2, y2]
, where (x1, y1)
is the top-left corner and (x2, y2)
is the bottom-right corner of the bounding box.
content
, rows
, cols
, is_header
, bbox
, and properties
. The content
attribute contains the text content of the cell, the rows
attribute contains the row index of the cell, the cols
attribute contains the column index of the cell, the is_header
attribute indicates whether the cell is a header cell and is optional, the bbox
attribute contains the bounding box of the cell, and the properties
attribute contains additional properties of the cell.
0 | Aircraft Make: | MARC JONES | Registration: | N512P |
1 | Model/Series: | PITTS MODEL 12 | Aircraft Category: | Airplane |
2 | Amateur Built: | |||
3 | Operator: | M12 AVIATION LLC | Operating Certificate(s) | None |
Held: | ||||
4 | Operator Designator Code: |