Note: If you are just getting started and prefer to use Docker containers, we recommend you deploy the Aryn stack using the Quickstart.

Create a conversational search application (in-depth guide)#

Overview#

This tutorial gives an in-depth look at the Aryn Conversational Search Stack while guiding you on building a conversational application. Aryn’s stack uses semantic data preparation and retrieval-augmented generation to create a nautral language search experience with high-quality answers.

Conversational Search is a new kind of interface for search applications. Traditional Search is what we’re all familiar with: the Google-style, enter a query and get back a bunch of documents. But, with the rapid development of generative AI, users are beginning to prefer chat-style interactions with thier data. Instead of answering your question with a long list of documents, Conversational Search seeks to answer your question with a natural language answer and iterative interactions.

Retrieval Augmented Generation (RAG) is a technique used in Generative AI to ground large language models (LLMs) in truth. LLMs have a tendency to hallucinate facts, and also are generally not trained on private data. This has been hugely problematic - a lawyer famously got disbarred by using ChatGPT to prepare a defense, and ChatGPT invented citations and the lawyer presented false evidence. To keep LLMs grounded in truth, it is helpful to give them as part of their input facts that they can use that are true. RAG accomplishes this by performing a search query over a knowledge base, and using the top search results as data to give to the LLM. The LLM then answers the question using this data as context. The Generation (LLM inference) is Augmented by Retrieval (the search query).

The Aryn Stack consists of three main components: a new semantic data preparation system called Sycamore, semantic search with OpenSearch, and new conversational capabilities in OpenSearch. Generative AI powers each of these components, leading to higher quality answers and ease of use.

Components#

The Aryn stack contains two main open source software projects:

  1. Sycamore, a robust and scalable, open source semantic data preparation system. Sycamore uses LLMs for unlocking the meaning of unstructured data and preparing it for search. This, in turn, enables higher quality retrieval, and better conversational search. One of the tenets of programming is “Garbage in, garbage out.” Sycamore turns your garbage into gold.

  2. OpenSearch, a tried and true, open source enterprise search engine with vector database and search techniques, enterprise-grade security, and battle-tested scalability and reliability. In version 2.10, Aryn contributed conversational capabilities for conversation memory and APIs, so that developers can build conversational apps without needing to stitch together and manage generative AI toolkits and vector databases that are in their infancy. This new functionality stores the history of conversations and orchestrates interactions with LLMs using retrieval-augmented generation (RAG) pipelines.

Tutorial#

Sycamore#

Sycamore is our semantic data preparation system that we recommend for processing unstructured data of all kinds. For more information on how to use it, please refer to the Sycamore documentation. The important takeaway is that Sycamore can scalably transform, clean, and enrich your data for ingestion into OpenSearch.

You can specialize your Sycamore pipeline for your data, and we present a tutorial for a sample dataset we’ve published. Prepare this pipeline, and we’ll next discuss how to set up the OpenSearch component of the Aryn stack.

OpenSearch Setup#

Now that you have a Sycamore script that reads, partitions, and understands your data, you need to configure OpenSearch to index that data and enable conversational search.

Ensure that you’re running OpenSearch 2.10+, with the plugins ML-Commons, Neural-Search, and k-NN installed. This contains several new features: remote inference, hybrid search, conversation memory, and RAG pipelines. Remote inference allows you to connect to machine learning models hosted outside of the OpenSearch cluster, a must-have for using LLM services. Hybrid Search allows you to combine search relevance scores from multiple sources, such as vector search and keyword search, leading to better search results. Conversation Memory allows you to store conversations in your OpenSearch cluster, letting your application make inferences based on past interactions. Finally, RAG Pipelines offer a way to perform Retrieval Augmented Generation fully within the OpenSearch cluster, requiring only a small amount of additional work to add a generative answer to the search response coming from OpenSearch.

Enable Conversational Features#

Remote Inference and Hybrid Search are enabled by default, but Conversation Memory and RAG Pipelines are not. To enable Conversation Memory,

PUT /_cluster/settings
{
  "persistent": {
    "plugins.ml_commons.memory_feature_enabled": "true"
  }
}

To enable RAG Pipelines:

PUT /_cluster/settings
{
  "persistent": {
    "plugins.ml_commons.rag_pipeline_feature_enabled": "true"
  }
}

Create Models#

Now that all of the appropriate features are enabled, it is time to make them work. First of all, let’s set up neural search. Neural Search works by vector embedding a query and comparing that to the vector embeddings of all the documents in an index. We’ve already created our vector embeddings with Sycamore for our data, so now we just need the query-side embeddings, which will be computed at query-time. In order to do this, we must upload our vector embedding model to the OpenSearch cluster.

If your vector embedding model was one of OpenSearch’s default pretrained models, then all you need to do to load it on your cluster is the following:

POST /_plugins/_ml/models/_register
{
  "name": "<model_name>",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}

You may also need to provide a "model_group_id" parameter depending on whether you’ve set up a model group. OpenSearch will respond with a task ID. To get the model ID, simply

GET /_plugins/_ml/tasks/<task_id>

We’re not quite done yet though! At this point, all we have done is load the embedding model into OpenSearch’s index for models. In order to actually use the model, we must first deploy it. Luckily, that’s as easy as

POST /_plugins/_ml/models/<model_id>/_deploy

We can track the progress of this task with the same GetTask request, using the task_id returned by the _deploy request.

If your embedding model was not one of OpenSearch’s defaults, you can reference the OpenSearch documentation on how to configure it. You will likely need to compile the model to a specific format (e.g. ”Torch-JIT”) and then upload to S3 so that OpenSearch can access it.

So now we have an embedding model; for the rest of this guide I will refer to the embedding model’s ID as <embedding_id>.

We also need to use a large language model (LLM) in order to implement a RAG pipeline. Since we can’t host LLMs in our OpenSearch cluster, we’ll be using the new Remote Inference feature. Remote Inference allows us to create a connector to an external model serving service, and treat it as a model that behaves as ML-Commons prescribes. (Note, we can also do this for the embedding model, if you want to use vector embeddings from an LLM service). In this example, we’ll be using OpenAI, but there are guides on how to configure other LLM services.

First, we create the connector:

POST /_plugins/_ml/connectors/_create
{
  "name": "OpenAI Chat Connector",
  "description": "The connector to public OpenAI model service for GPT 3.5",
  "version": 2,
  "protocol": "http",
  "parameters": {
    "endpoint": "api.openai.com",
    "model": "gpt-3.5-turbo",
    "temperature": 0
  },
  "credential": {
    "openAI_key": "<your OpenAI key>"
  },
  "actions": [
    {
      "action_type": "predict",
      "method": "POST",
      "url": "https://${parameters.endpoint}/v1/chat/completions",
      "headers": {
        "Authorization": "Bearer ${credential.openAI_key}"
      },
      "request_body": "{ \"model\": \"${parameters.model}\", \"messages\": ${parameters.messages}, \"temperature\": ${parameters.temperature} }"
    }
  ]
}

This gives us a connector_id that we can use to register a model:

POST /_plugins/_ml/models/_register
{
    "name": "openAI-gpt-3.5-turbo",
    "function_name": "remote",
    "description": "test model",
    "connector_id": "<connector_id>"
}

A "model_group_id" is semi-optional. This returns a task which we can use to retrieve the model ID

GET /_plugins/_ml/tasks/<task_id>

This model ID will henceforth be known as <openai_id>. Now we have to deploy

POST /_plugins/_ml/models/<openai_id>/_deploy

And now, we have created the functionality to call OpenAI from our OpenSearch cluster.

Ingest Data#

In order to search over our data, we need to add our data to our cluster! That’s really easy with Sycamore, by doing this:

ds.write.opensearch(os_client_args, index_name, index_settings)

Now, your data is loaded in an OpenSearch index with mappings similar to these:

{
  "settings": {
    "index.knn": "true",
  },
  "mappings": {
      "properties": {
        "title": {"type": "text"},
        "text": {"type": "text"},
        "embedding": {
                "type": "knn_vector",
                "dimension": int,
                "method": {
                    "name": "string",
                    "space_type": "string",
                    "engine": "string",
                    "parameters": "json_object"
                }
            }
        }
    }
}

Prepare RAG Pipeline#

Now that we have an index definition and a remote model, we will create the RAG Pipeline using OpenSearch Search Pipelines:

PUT /_search/pipeline/<rag_pipeline_name>
{
    "response_processors": [
        {
            "retrieval_augmented_generation": {
                "tag": "openai_pipeline_demo",
                "description": "Demo pipeline Using OpenAI Connector",
                "model_id": "<openai_id>",
                "context_field_list": ["text"]
            }
        }
    ]
}

The context_field_list parameter represents the fields of the documents that get sent to the LLM as part of the prompt. Since in our example index mappings the body of each document was in the “text” field, that’s what we will choose. But depending on your Sycamore processing script, you may want to use other field names.

Prepare Hybrid Search Pipeline with RAG#

Hybrid Search is implemented as a Search Processor, so in order to use it for the best quality search relevance, we must make a pipeline with both processors

PUT /_search/pipeline/<hybrid_rag_pipeline>
{
  "description": "RAG + Hybrid Search Pipeline",
    "phase_results_processors": [
        {
            "normalization-processor": {
                "normalization": {
                    "technique": "min_max"
                },
                "combination": {
                    "technique": "arithmetic_mean",
                    "parameters": {
                        "weights": [.889, .111]
                    }
                }
            }
        }
    ],
    "response_processors": [
        {
            "retrieval_augmented_generation": {
                "tag": "openai_pipeline_demo",
                "description": "Demo pipeline Using OpenAI Connector",
                "model_id": "<openai_id>",
                "context_field_list": ["text"]
            }
        }
    ]
}

Use Hybrid Search, RAG Pipeline, and Conversation Memory#

Conversation Memory#

Conversation Memory exposes a set of APIs for managing and storing conversations in an OpenSearch index. A conversation is represented as a list of interactions, which each have certain fields that are useful in building and maintaining conversational applications. Those fields are:

Field

Type

Description

input

text

The human input to the application that created this interaction

prompt_template

text

The template used in this interaction. Represents the natural language frame around the input and other information that got sent to the LLM

response

text

The generative AI response

origin

keyword

The name of the system that generated this interaction

additional_info

text

Any extra information that the LLM was prompted with

create_time

date

When this interaction was created

conversation_id

keyword

The ID of the conversation that this interaction belongs to

The ‘Conversation’ object also has some higher-level information

Field

Type

Description

name

keyword

A human-readable name for this conversation. Useful if you want to allow end-users to choose which conversation to add to

create_time

date

When this conversation started

user

keyword

The name of the user who owns this conversation. Only exists if security is enabled, and you would only see conversations that you own.

The APIs for managing these objects are:

API

Method

Path

Params

Response

Description

CreateConversation

POST

…/_ml/memory/conversation

name

conversation_id

Creates a new top-level conversation object

GetConversations

GET

…/_ml/memory/conversation

max_results
next_token

conversations
[next_token]

returns a list of top-level conversation objects, paginated, sorted by recency

CreateInteraction

POST

…/_ml/memory/conversation/{conversation_id}

input
prompt_template
response
origin
additional_info

interaction_id

Creates an interaction object in conversation conversation_id

GetInteractions

GET

…/_ml/memory/conversation/{conversation_id}

max_results
next_token

interactions
[next_token]

Returns a list of interactions belonging to conversation conversation_id, paginated, sorted by recency

DeleteConversation

DELETE

…/_ml/memory/conversation/{conversation_id}

success

Deletes a conversation and all of its interactions

Use Pipeline for RAG#

Let’s build our query one step at a time. The root goal of this is tutorial is to enable conversational search applications. We will start with a simple RAG search query, and then build on it to create a very strong query to drive your application.

First, let’s suppose our user asks, “Was Abraham Lincoln a good president?”

We’ll start with a simple BM25 RAG query, that uses keyword search (not hybrid search):

GET <index_name>/_search?search_pipeline=<rag_pipeline_name>
{
    "query": {
        "match": {
            "text": "Was Abraham Lincoln a good president?"
        }
    },
    "size": 10,
    "ext": {
        "generative_qa_parameters": {
            "llm_question": "Was Abraham Lincoln a good president?"
        }
    }
}

In the response we have a list of search hits as with any OpenSearch query, and in addition we have a field called ext.retrieval_augmented_generation.answer which contains the LLM’s answer to our question from the data we presented to it.

Well, BM25 keyword match doesn’t do too well with a fully syntactic question like that. However, dense retrieval using vector search does perform pretty well. So let’s use the neural search plugin to 1/ embed our query and 2/ perform a kNN lookup against our kNN index.

GET <index_name>/_search?search_pipeline=<rag_pipeline_name>
{
    "query": {
        "neural": {
            "embedding": {
                "query_text": "Was Abraham Lincoln a good president?",
                "model_id": "<embedding_id>",
                "k": 100
            }
        }
    },
    "size": 10,
    "ext": {
        "generative_qa_parameters": {
            "llm_question": "Was Abraham Lincoln a good president?"
        }
    }
}

Make it Conversational#

Until now, all that this has done is perform one-off inferences. But modern conversational applications like the one we’re building should have a notion of ‘chat history,’ so that the LLM can make answer questions based off of previous interactions as well as document search results. Now, the Conversation Memory piece steps in.

So, our end-user enters the question “Was Abraham Lincoln a good president?” We’ll create a conversation to track where this goes.

POST /_plugins/_ml/memory/conversation
{
    "name": "Was Abraham Lincoln a good president?"
}

This returns a conversation ID, which I’ll call <conversation_id>. We can hand that to the RAG Pipeline

GET <index_name>/_search?search_pipeline=<hybrid_rag_pipeline>
{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "neural": {
                        "embedding": {
                            "query_text": "Was Abraham Lincoln a good president?",
                            "model_id": "<embedding_id>",
                            "k": 100
                        }
                    }
                },
                {
                    "match": {
                        "text": "Was Abraham Lincoln a good president?",
                    }
                }
            ]
        }
    },
    "size": 10,
    "ext": {
        "generative_qa_parameters": {
            "llm_question": "Was Abraham Lincoln a good president?",
            "conversation_id": "<conversation_id>"
        }
    }
}

The response is the same as without the conversation id, so what’s changed? Well, if we do a getInteractions

GET /_plugins/_ml/memory/conversation/<conversation_id>

We get back a list with a single interaction in it, representing the interaction we just had:

{
    "interactions": [
        {
            "interaction_id": "1430y8u28905t",
            "conversation_id": "<conversation_id>",
            "input": "Was Abraham Lincoln a good president?",
            "prompt_template": "Answer the question based on the passages. Some more instructions, and maybe a couple more. Keep in mind this extra context.",
            "response": "Yeah, he was a pretty cool dude by all accounts."
            "origin": "rag_search_pipeline",
            "additional_info": "[document1text, document2text, document3text, ...]"
            "create_time": "2023-09-11 15:33:23.234523Z"
        }
    ]
}

So our interaction has been logged by the pipeline behind the scenes. Now imagine our end-user follows that up with the very simple question, “Why?”

GET <index_name>/_search?search_pipeline=<hybrid_rag_pipeline>
{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "neural": {
                        "embedding": {
                            "query_text": "Why?",
                            "model_id": "<embedding_id>",
                            "k": 100
                        }
                    }
                },
                {
                    "match": {
                        "text": "Why?",
                    }
                }
            ]
        }
    },
    "size": 10,
    "ext": {
        "generative_qa_parameters": {
            "llm_question": "Why?",
            "conversation_id": "<conversation_id>"
        }
    }
}

Our search results will be completely nonsensical, because OpenSearch does not take context into account when returning documents. In this case, it’s going to return anything that matches with the question “Why?”. However, the LLM is still going to produce a cohesive response. Since we passed in the conversation ID, the last interaction is also in the prompt, so the LLM knows that this is a follow-up to “Was Abraham Lincoln a good president?”, and thus will respond with why Abraham Lincoln was pretty cool.

Further Steps#

Now, this last interaction seems like it could be problematic. The RAG pipeline has no notion of how relevant its search results are to a conversation, so it is sending the LLM everything (well, the top 10 documents) that matches with “Why?”. This has the potential to confuse the LLM into giving an undesirable answer. Furthermore, the LLM doesn’t have the information that led to the inference from the first interaction. Asking “Why?”, the LLM can’t reference the search results that led it to claim that Lincoln was pretty cool. In essence, it ‘forgets’ its reasoning (although those documents are stored in conversation memory, the pipeline doesn’t read them). The solution is to rewrite the question that gets sent to OpenSearch, taking into account the chat history. We don’t have a way to do this in OpenSearch, but we can just hit OpenAI directly; something like

//I'M A QUERY TO OPENAI, NOT OPENSEARCH!!!
POST https://api.openai.com/v1/chat/completions
{
    "model": "gpt-3.5-turbo",
    "messages": [
        {
            "role": "system",
            "content": "Rewrite the question taking into account the context from the previous several interactions"
        },
        {
            "role": "user",
            "content": "Was Abraham Lincoln a good president?"
        },
        {
            "role": "assistant",
            "content": "Yeah, he was a pretty cool dude by all accounts?"
        },
        {
            "role": "user",
            "content": "Question: Why? \n Rewritten Question:"
        }
    ]
}

OpenAI will rewrite the question into something like “What qualities made Abraham Lincoln a good president” And then we can query OpenSearch with RAG

GET <index_name>/_search?search_pipeline=<hybrid_rag_pipeline>
{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "neural": {
                        "embedding": {
                            "query_text": "What qualities made Abraham Lincoln a pretty cool dude, per se?",
                            "model_id": "<embedding_id>",
                            "k": 100
                        }
                    }
                },
                {
                    "match": {
                        "text": "What qualities made Abraham Lincoln a pretty cool dude, per se?",
                    }
                }
            ]
        }
    },
    "size": 10,
    "ext": {
        "generative_qa_parameters": {
            "llm_question": "Why?",
            "conversation_id": "<conversation_id>"
        }
    }
}

Now, the prompt engineering here is by no means optimal. Additionally, you may want to construct even more complicated queries: perhaps you want to pull out specific terms from the end-user’s question; perhaps you want to apply filters from things external to the end-user’s question. The possibilities are vast.