DocParse
Processing Options
Learn about the parameters you can use with Aryn DocParse
There are several options you can specify when calling DocParse. For example, we can extract the table structure from our document with the following curl command.
All of the available options are listed below:
threshold
: This represents the threshold for accepting the model’s predicted bounding boxes. It defaults toauto
, where the service uses a processing method to find the best prediction for each possible bounding box. This is the recommended setting. However, you can override this by specifying a numerical threshold between 0 and 1. If you specify a numerical threshold, only bounding boxes with confidence scores higher than the threshold will be returned (instead of using the processing method described above). A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects. If you do set the threshold manually, we recommend starting with a value of 0.32.use_ocr
: It defaults tofalse
, where the partitioner attempts to directly extract the text from the underlying PDF using PDFMiner. Iftrue
, the partitioner detects and extracts text from images using OCR. This is useful when the text is not directly extractable from the PDF, such as when the text is part of an image or when the text is rotated.extract_table_structure
: Iftrue
, the partitioner runs a table extraction model separate from the segmentation model in order to extract cells from regions of the document identified as tables.extract_images
: Iftrue
, the partitioner crops each region identified as an image and attaches it to the associatedImageElement
. This can later be fed into theSummarizeImages
transform when used within Sycamore.selected_pages
: You can specify a page (like[11]
), a page range (like[[25,30]]
), or a combination of both (like[11, [25,30]]
) of your PDF to process. The first page of the PDF is1
, not0
.pages_per_call
: This is only available when using the Partition function in Sycamore. This option divides the processing of your document into batches of pages, and you specify the size of each batch (number of pages). This is useful when running OCR on large documents.output_format
: Defaults tojson
which yields an array calledelements
which contains the partitioned elements, represented in JSON. If set tomarkdown
the service response will instead include a field calledmarkdown
that contains a string representing the entire document in Markdown format.
Here is an example of how you can use these options in a curl command or in Python code with the Aryn SDK.
Was this page helpful?