There are several options you can specify when calling DocParse. For example, we can extract the table structure from our document with the following curl command.

export ARYN_API_KEY="PUT API KEY HERE"
    curl -s -N -D headers "https://api.aryn.cloud/v1/document/partition" -H "Authorization: Bearer $ARYN_API_KEY" -F "file=@document.pdf" -F 'options={"extract_table_structure": true}' | tee document.json

All of the available options are listed below, and are optional unless specified otherwise.

  • threshold: This represents the threshold for accepting the model’s predicted bounding boxes. It defaults to auto, where the service uses a processing method to find the best prediction for each possible bounding box. This is the recommended setting. However, this can be overridden by specifying a numerical threshold between 0 and 1. If you specify a numerical threshold, only bounding boxes with confidence scores higher than the threshold will be returned (instead of using the processing method described above). A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects. If you do set the threshold manually, we recommend starting with a value of 0.32. Either the specific string auto or a float between 0.0 and 1.0, inclusive. This value specifies the cutoff for detecting bounding boxes. A lower value will include more objects, but may have overlaps, while a higher value will reduce the number of overlaps, but may miss legitimate objects. Default is auto (DocParse will choose optimal bounding boxes).
  • use_ocr: A boolean value that, when set to True, causes DocParse to extract text using an OCR model. This is useful when the text is not directly extractable from the PDF, such as when the text is part of an image or when the text is rotated. When set to False, DocParse extracts embedded text from the input document. Default is False.
  • extract_table_structure: A boolean that, when True, enables DocParse to extract tables and their structural content partitioner using a purpose built table extraction model. If set to False, tables are still identified but not analyzed for their structure; as a result, table cells and their bounding boxes are not included in the response. Default is False.
  • table_extraction_options: A map with string keys specifying options for table extraction, which currently only supports the boolean include_additional_text, which will add in text from OCR boxes. When include_additional_text is set to True and table extraction is enabled, DocParse will attempt to enhance the table structure by merging in tokens from text extraction. This can be useful for working with tables that have missing or misaligned text. include_additional_text is False by default. The default table_extraction_options is {}.
  • extract_images: A boolean that determines whether to extract images from the document. Default: False.
  • ocr_images: A boolean that, when True, causes DocParse to use OCR to attempt to generate a text representation of detected images. When False, images do not contain a text_representation. Default is False.
  • selected_pages: A list specifying individual pages (1-indexed) and page ranges from the document to partition. Single pages are specified as integers and ranges are specified as lists with two integer entries in ascending order. A valid example value for selected_pages is [1, 10, [15, 20]] which would include pages 1, 10, 15, 16, 17 …, 20. selected_pages is None by default, which results in all pages of the document being parsed.
  • chunking_options: A dictionary of options for specifying chunking behavior. Chunking is only performed when this option is present, and default options are chosen when chunking_options is specified as {}.
    • strategy: A string specifying the strategy to use to combine and split chunks. Valid values are context_rich and maximize_within_limit. The default and recommended chunker is context_rich as {'strategy': 'context_rich'}.
      • Behavior of context_rich chunker: The goal of this strategy is to add context to evenly-sized chunks. This is most useful for retrieval based GenAI applications. The context_rich chunking combines adjacent Section-header and Title elements into a new Section-header element. It merges elements into a chunk with its most recent Section-header. If the chunk would contain too many tokens, it starts a new chunk by copying the Section-header to the start of this new chunk and continues. The chunker merges elements on different pages, unless merge_across_pages is set to False.

      • Behavior of maximize_within_limit chunker: The goal of the maximize_within_limit chunker is to make the chunks as large as possible. Merges elements into the last most recently merged set of elements unless doing so would make its token count exceed max_tokens. In that case, it would keep the new element separate and start merging subsequent elements into that one, following the same rule. Merges elements on different pages, unless merge_across_pages is set to False.

    • max_tokens: An integer specifying the cutoff for splitting chunks that are too large. Default value is 512.
    • tokenizer: A string specifying the tokenizer to use when determining how characters in a chunk are grouped. Valid values are openai_tokenizer, character_tokenizer, and huggingface_tokenizer. Defaults to openai_tokenizer.
    • tokenizer_options: A tree with string keys specifying the options for the chosen tokenizer. Defaults to {'model_name': 'text-embedding-3-small'}, which works with the OpenAI tokenizer.
      • Available options for openai_tokenizer:
        • model_name: Accepts all models supported by OpenAI’s tiktoken tokenizer. Default is “text-embedding-3-small”
      • Available options for HuggingFaceTokenizer:
      • character_tokenizer does not take any options.
    • merge_across_pages: A boolean that when True the selected chunker will attempt to merge chunks across page boundaries. Defaults to True.
  • output_format: A string controlling the output representation. Defaults to json which yields an array called elements which contains the partitioned elements, represented in JSON. If set to markdown the service response will instead include a field called markdown that contains a string representing the entire document in Markdown format.
  • pages_per_call: This is only available when using the Partition function in Sycamore. This option divides the processing of your document into batches of pages, and you specify the size of each batch (number of pages). This is useful when running OCR on large documents.
  • output_label_options: A dictionary of options to specify which heuristic to apply to enforce certain label outputs. If this option is not specified, no heuristic is applied. The options the dictionary supports are listed below.
    • promote_title: A boolean that specifies whether to promote an element to title if there’s no title in the output.
    • title_candidate_elements: A list of strings that are candidate elements to be promoted to title.

Here is an example of how you can use some of these options in a curl command or in Python code with the Aryn SDK.

Was this page helpful?