Search Pipelines#
The Aryn Conversational Search Stack uses OpenSearch Search Pipelines to orchestrate interactions with LLMs using retrieval-augmented generation (RAG). Search Pipelines are a feature in OpenSearch that allows a user to add pre- and post-processing steps in the search path. It allows server-side computation on search results. A search pipeline is made up of search ‘processors’, which represent individual processing steps in the pipeline.
They Aryn stack uses both the RAG Search Processor and Hybrid Search Processor in a single Search Pipeline per query. We describe each processor below, and conclude with how to use them together. However, the pipeline is composable, and you can choose to use different search technqiues with the RAG Search Processor.
Retrieval Augmented Generation#
Aryn added the Retrieval Augmented Generation Search Processor to OpenSearch v2.10 for conversational search. This is a response processor, meaning that it executes after the OpenSearch search process.
The diagram above showes the flow of the RAG Search Processor.
The results from a hybrid search query are retrieved as the search context
The previous interactions from conversational memory are retrieved as conversation context
The processor constructs a prompt for an LLM using the search context, conversational context, and prompt template. It sends this to the LLM and gets a response
The response is added to the question and additional metadata, and saved in conversational memory as an interaction
The generative response and list of hybrid search results are returned to the application
If a conversation ID wasn’t supplied (see here), then the processor will not retrieve the conversational context or add an interactoin to conversational memory.
To create a RAG pipeline, you must first have a remote LLM-wrapper deployed in ml-commons. Then, for example, to create a RAG pipeline called rag_pipeline
using OpenAI GPT-3.5-Turbo,
PUT /_search/pipeline/rag_pipeline
{
"description": "Retrieval Augmented Generation Pipeline",
"response_processors": [
{
"retrieval_augmented_generation": {
"tag": "openai_pipeline_demo",
"model_id": "<remote_model_id>",
"context_field_list": [
"text_representation"
],
"llm_model": "gpt-3.5-turbo"
}
}
]
}
To use this processor, simply add this to your OpenSearch query
GET <index_name>/_search
{
"query": {
"neural": {
"embedding": {
"query_text": "Who wrote the book of love",
"model_id": "<embedding model id>",
"k": 100
}
}
},
"ext": {
"generative_qa_parameters": {
"llm_question": "Who wrote the book of love?"
}
}
}
The resulting LLM answer is in response.ext
Hybrid Search#
Hybrid Search is a processor that enables relevancy score normalization and combination. This allows you to make the best of both keyword and neural search, giving higher-quality results. Ordinarily this is difficult, because neural scores and keyword scores have entirely different ranges. Hybrid search enables normalization and combination of these scores in numerous varieties, so you can customize the way your search relevancy is calculated. From our testing, we’ve found that a min_max
normalization with an arithmetic_mean
(weighted [0.111,0.889]
towards the neural score) works well, but every dataset will behave differently. To create a pipeline (called hybrid_pipeline
) with this configuration:
PUT /_search/pipeline/hybrid_pipeline
{
"description": "Hybrid Search Pipeline",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": {
"technique": "min_max"
},
"combination": {
"technique": "arithmetic_mean",
"parameters": {
"weights": [0.111, 0.889]
}
}
}
}
]
}
The hybrid search processor is called a “phase_results_processor” because it is injected in between the two phases of OpenSearch’s main search process. OpenSearch computes search results in two phases, “query”, and “fetch”. In the “query” phase, OpenSearch computes the top scores for documents and comes up with a list of the top scoring document ids. In the “fetch” phase, OpenSearch gets the source data from those document IDs and returns the list of search results that the user sees. Hybrid search interjects between the query and fetch phase, by collecting the lists of top documents and scores for each query, and normalizing and combining before the fetch phase, which keeps the computation less cumbersome.
To use this hybrid processor, execute a hybrid query:
GET <index-name>/_search?search_pipeline=hybrid_pipeline
{
"query": {
"hybrid": {
"queries": [
{
"match": {
"text_representation": "Who wrote the book of love?"
}
},
{
"neural": {
"embedding": {
"query_text": "Who wrote the book of love",
"model_id": "<embedding model id>",
"k": 100
}
}
}
]
}
}
}
Creating the full Search Pipeline#
For the Aryn Stack, we generally use a Search Pipeline with Hybrid Search and RAG. The best way to get good answers from the LLM is to put the best search results into the LLM, and Hybrid Search often gives you the best search results. To create this Search Pipeline, you can simply create a pipeline with both search processors:
PUT /_search/pipeline/rag_hybrid_pipeline
{
"description": "Hybrid RAG Pipeline",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": {
"technique": "min_max"
},
"combination": {
"technique": "arithmetic_mean",
"parameters": {
"weights": [0.111, 0.889]
}
}
}
}
],
"response_processors": [
{
"retrieval_augmented_generation": {
"tag": "openai_pipeline_demo",
"model_id": "<remote_model_id>",
"context_field_list": [
"text_representation"
],
"llm_model": "gpt-3.5-turbo"
}
}
]
}
And then querying also looks like the union of the two queries:
GET <index-name>/_search?search_pipeline=hybrid_pipeline
{
"query": {
"hybrid": {
"queries": [
{
"match": {
"text_representation": "Who wrote the book of love?"
}
},
{
"neural": {
"embedding": {
"query_text": "Who wrote the book of love?",
"model_id": "<embedding model id>",
"k": 100
}
}
}
]
}
},
"ext": {
"generative_qa_parameters": {
"llm_question": "Who wrote the book of love?"
}
}
}