Sycamore is a robust and scalable, open source semantic data preparation system. In the Aryn Conversational Search Stack, it is the component that takes data from sources, cleans and enriches the data, and loads OpenSearch vector databases.


Sycamore uses LLMs for unlocking the meaning of unstructured data and preparing it for search. It provides a high-level API to construct Python-native data pipelines with operations such as data cleaning, information extraction, enrichment, summarization, and generation of vector embeddings that encapsulate the semantics of data.

Sycamore uses generative AI to make data extraction and enrichment easy. You can use large language models (LLMs) to extract data from documents and add to data enrichment, and use few shot training to easily instruct these models on what to do. Additionally, Sycamore uses your choice of LLM for creating vector embeddings for your data.

Sycamore runs on Ray, a scalable compute framework for Python workloads. This enables you to easily scale your Sycamore workloads.

For information on Sycamore, please visit Sycamore on GitHub or the Sycamore documentation. For a tutorial on how to develop Sycamore jobs using a Jupyter notebook and the Aryn Quickstart, visit here.