There are several options you can specify when calling DocParse. For example, we can extract the table structure from our document with the following curl command.

export ARYN_API_KEY="PUT API KEY HERE"
    curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.pdf" -F 'options={"extract_table_structure": true}' | tee document.json

All of the available options are listed below:

  • threshold: This represents the threshold for accepting the model’s predicted bounding boxes. It defaults to auto, where the service uses a processing method to find the best prediction for each possible bounding box. This is the recommended setting. However, you can override this by specifying a numerical threshold between 0 and 1. If you specify a numerical threshold, only bounding boxes with confidence scores higher than the threshold will be returned (instead of using the processing method described above). A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects. If you do set the threshold manually, we recommend starting with a value of 0.32.
  • use_ocr: It defaults to false, where the partitioner attempts to directly extract the text from the underlying PDF using PDFMiner. If true, the partitioner detects and extracts text from images using OCR. This is useful when the text is not directly extractable from the PDF, such as when the text is part of an image or when the text is rotated.
  • extract_table_structure: If true, the partitioner runs a table extraction model separate from the segmentation model in order to extract cells from regions of the document identified as tables.
  • extract_images: If true, the partitioner crops each region identified as an image and attaches it to the associated ImageElement. This can later be fed into the SummarizeImages transform when used within Sycamore.
  • selected_pages: You can specify a page (like [11] ), a page range (like [[25,30]] ), or a combination of both (like [11, [25,30]] ) of your PDF to process. The first page of the PDF is 1, not 0.
  • pages_per_call: This is only available when using the Partition function in Sycamore. This option divides the processing of your document into batches of pages, and you specify the size of each batch (number of pages). This is useful when running OCR on large documents.
  • output_format: Defaults to json which yields an array called elements which contains the partitioned elements, represented in JSON. If set to markdown the service response will instead include a field called markdown that contains a string representing the entire document in Markdown format.

Here is an example of how you can use these options in a curl command or in Python code with the Aryn SDK.