Partition
Documentation for Aryn SDK Partition
Please find the documentation for the Aryn SDK Partition module below.
convert_image_element
Convert an image element to a more usable format. If no format is specified, create a PIL Image object. If a format is specified, output the bytes of the image in that format. If b64encode is set to True, base64-encode the bytes and return them as a string.
Parameters:
- elem: An image element from the
elements
field of apartition_file
response. - format: An optional format to output bytes of. Default is ‘PIL’.
- b64encode: Base64-encode the output bytes. Format must be set to use this.
Example:
draw_with_boxes
Create a list of images from the provided PDF, one for each page, with bounding boxes detected by the partitioner drawn on.
Parameters:
- pdf_file: An open file or path to a PDF file upon which to draw.
- partitioning_data: The output from
aryn_sdk.partition.partition_file
. - draw_table_cells: Whether to draw individually detected cells of tables. Default is False.
Returns:
A list of images of pages of the PDF, each with bounding boxes drawn on.
Example:
partition_file
Sends file to the Aryn Partitioning Service and returns a dict of its document structure and text.
Parameters:
- file: pdf file to partition
- aryn_api_key: aryn api key, provided as a string
- aryn_config: ArynConfig object, used for finding an api key. If aryn_api_key is set it will override this. Default: The default ArynConfig looks in the env var ARYN_API_KEY and the file ~/.aryn/config.yaml
- threshold: value to specify the cutoff for detecting bounding boxes. Must be set to “auto” or a floating point value between 0.0 and 1.0. Default: None (APS will choose)
- use_ocr: extract text using an OCR model instead of extracting embedded text in PDF. Default: False
- ocr_images: attempt to use OCR to generate a text representation of detected images. Default: False
- extract_table_structure: extract tables and their structural content. Default: False
- table_extraction_options: Specify options for table extraction, currently only supports boolean ‘include_additional_text’: if table extraction is enabled, attempt to enhance the table structure by merging in tokens from text extraction. This can be useful for tables with missing or misaligned text, and is False by default. Default:
- extract_images: extract image contents. Default: False
- selected_pages: list of individual pages (1-indexed) from the pdf to partition. Default: None
- aps_url: url of the Aryn Partitioning Service endpoint. Default: “https://api.aryn.cloud/v1/document/partition”
- ssl_verify: verify ssl certificates. In databricks, set this to False to fix ssl incompatibilities.
- output_format: controls output representation; can be set to markdown. Default: None (JSON elements)
Returns:
A dictionary containing “status” and “elements”. If output_format is markdown, dictionary of “status” and “markdown”.
Example:
table_elem_to_dataframe
Create a pandas DataFrame representing the tabular data inside the provided table element. If the element is not of type ‘table’ or doesn’t contain any table data, return None instead.
Parameters:
- elem: An element from the ‘elements’ field of a
partition_file
response.
Example:
tables_to_pandas
For every table element in the provided partitioning response, create a pandas DataFrame representing the tabular data. Return a list containing all the elements, with tables paired with their corresponding DataFrames.
Parameters:
- data: a response from
partition_file
Example:
Was this page helpful?