Using Sycamore for document ETL

You can use the open source Sycamore document ETL library for advanced data transformation, data cleaning, custom chunking, and more, and load the output as a new DocSet in Aryn. You can write highly customizable data pipelines in Python, and quickly iterate on your pipelines using a Jupyter notebook. Sycamore is also effective at extracting and transforming the Properties extracted from the Documents in your DocSet. Sycamore works hand-in-hand with Aryn, as both are built around the DocSet and Document abstractions. Sycamore can parse documents, extract tables, do OCR, and more using Aryn’s DocParse service, and can scale to efficiently process thousands of documents using a Ray backend. Visit the Sycamore documentation to learn more, or check out example Jupyter notebooks with an ETL pipelines here.

Writing to Aryn

You can use the Aryn connector to write your DocSet to Aryn. You can either specify a DocSet ID to write to a specific DocSet, or provide a DocSet name to create a new DocSet with that name. Vector embedding configuration The DocSet you write must use vector embeddings created with the OpenAI Text Embedding 3 model. If the source DocSet for your Sycamore job is from Aryn, your Documents will already have the properly configured embeddings. If you are reading and processing documents from elsewhere, you will need to include this embedding step in your pipeline:

...
docset.embed(OpenAIEmbedder(model_name="text-embedding-3-small"))
...

Adding Properties extracted in pipeline to Aryn DocSet Properties Schema* Your Sycamore ETL pipeline may extract Properties using LLM-powered transforms or enrich your Documents with metadata from other data sources. For Aryn to use these Properties with its query engine, your Sycamore job will also need to update the Aryn DocSet Properties Schema, which it does with the update_properties_schema parameter. By default, this is set to True. Sycamore only adds Property name and value to Aryn’s Property Schema, so you will need to update other fields like description directly in the Property Schema. You can do that with the update_docset API, get the current Properties Schema (using get_docset), add the additional paramters (like ```description), and the pass in a new Properties Schema with the added values. Writing to an existing Aryn DocSet If you write to an exisitng DocSet in Aryn, the Sycamore job will update that DocSet:

New Properites will be added
Exisitng Properties will be updated
Deleted Properties in your job will not be deleted in Aryn
Element content with the same Element_ID will be updated
New Elements will be added (new Element_IDs)
Deleted Elements in your job will not be deleted in Aryn

You can also write to a new DocSet if you want the output of your Sycamore job to correctly propagate deletions in your job.

...
docset.write.aryn(
	docset_id="<YOUR-DOCSET-ID>"
	aryn_api_key="<YOUR-ARYN-KEY>"
	)

Writing to a new Aryn DocSet To create a new Aryn DocSet and write to it:

...
docset.write.aryn(
	name="<NEW-DOCSET-NAME>"
	aryn_api_key="<YOUR-ARYN-KEY>"

Reading from Aryn

To read a DocSet from Aryn, you will also use the Aryn connector:

...
docset.read.aryn(
	docset_id="<YOUR-DOCSET-ID>"
	aryn_api_key="<YOUR-ARYN-KEY>"

Get Started

Aryn Platform

Writing to Aryn

Reading from Aryn

Get Started

Aryn Platform

​Writing to Aryn

​Reading from Aryn

Writing to Aryn

Reading from Aryn