Aryn’s query engine consists of a number of pieces that work together to provide an end-to-end natural-language query processing system over complex, unstructured data. The architecture of this system is also described in our Conference for Innovative Data Systems (CIDR) paper titled The Design of an LLM-powered Unstructured Analytics System.

If you want to just get started querying your data, you don’t need to know all of these specifics - just use the Workspaces UI.

Query planning

The agentic query planner in Aryn’s engine execute queries against DocSets. During query planning, Aryn provides the planner with the Properties Schema of the DocSet, which consists of the properties contained in the documents, along with their descriptions, data types, and sample values, along with a “text-representation” field representing the entire contents of each Document. It will then create the plan, which is a DAG using the engine’s available query operators, and perform several optimization and validation steps before finalizing it.

UIs like Aryn’s Workspaces enable you to easily modify the query plans using natural language, if an edit is needed.

Query operators

Aryn provides a set of high-level logical operators for query planning purposes, and rewrites the resulting logical plan into lower-level physical operators before execution. This makes it more robust to execute, and easier for you to understand the plan and debug the execution, if required.

Many simple logical operators map one-to-one to physical operators, including single-pass per-document operations like map, filter, and llm-extract, but for operations that span multiple documents, we have found it often works better to have more specific operators rather than low-level primitives.

Database operators

These query operators use Aryn’s underlying indexing and database functionality, making them fast and efficient to sweep through and process data.

OperatorDescription
FilterFilters records based on a range or match filter. Can be combined with the Query Database operator.
CountReturns a count of the number of records provided in the input. It can optionally count the distinct records.
LimitLimits the number of records returned.
MathPerforms arithmetic operation on two input numbers. Returns a number.
Query DatabaseRetrieves data from Aryn’s keyword index using a full-text, term-level, and other query types. Uses OpenSearch Query DSL.
Query Vector DatabaseRetrieves data from Aryn’s vector index using vector search, and returns the top k records. Uses OpenSearch Query DSL.
SortSorts the records based on the value of a Property.
Group ByFinds the top K frequent occurences of values for a particular field

Semantic operators

These query operators use Large Language Models (LLMs) to enable flexible procesing that requires understanding of semantic details or generation of new text. These operations incur additional latency from the LLM calls, and it’s recommended to filter records in a query plan before using a semantic operator.

OperatorDescription
LLM Extract EntityAdds a new Property by extracting information from an existing text-representation or Property.
LLM FilterFilters records based on the value of a field. Used when the semantic understanding of a field is needed.
Summarize DataThis operation generates an English response to a request based on the input data provided.

Executing query plans

After plan rewriting and optimization, the logical query plan is compiled into the physical plan for execution. Execution on large datasets benefits from distributed processing, allowing Aryn to scale out workloads with minimal overhead.

Output

Aryn will return a result and trace from a query execution. You can also choose to stream the query trace during execution.

Query result

Depending on the query, Aryn will return results in different formats:

  • List of documents: The output of a query can be a new DocSet. For example, a query could filter a DocSet and return the filtered set.
  • Table: The output of a query can be a table with derived values, like counts.
  • String: The output of a query can be a natural lanauge answer, like a summary or question-answering text.

Query trace

Aryn includes a query trace with the query results, which is a list of Documents in the DocSet that were processed at each stage/node of the query plan. This is helpful when debugging the query, or wanting to validate what documents were included. If you are streaming results, you will get a Document Trace in real-time as Documents are being processed during the query execution.