Skip to content

cygni/cygni-competence-vectordbs

Repository files navigation

Vector database lab

For this session we'll be using a vector database called Chroma and a SentenceTransformer for creating embeddings called Infinity.

This setup could be used as the Retrieval step in a RAG.

Here is a diagram for what we will be building in this lab:

Here is an overall system design.

And a sequence diagram:

sequenceDiagram
    actor User
    User->>Application: User Query
    Application->>Infinity: Create Embedding of User Query
    activate Infinity
    Infinity-->>Application: Vector Embedding
    deactivate Infinity
    Application->>Chroma: Find nearest documents based on Vector Embedding
    activate Chroma
    Chroma-->>Application: Results
    deactivate Chroma
    Application-)User: Results
Loading

Make sure you've read the prerequisites for this lab.

First, let's get all the parts running.

Preparations

  1. Infinity
    • To generate the vector embeddings we need a language model and an application that can extract the vector. A very common library for this is called SentenceTransformer which is implemented in Python. Another way is to use an online service to get the embeddings, for example OpenAI's API. In this lab we'll use a self-hosted API called Infinity. The language model choosen for this lab is called all-MiniLM-L6-v2 and is suitable for clustering or semantic search.
    • Start the Infinity API:
    • docker run -it -p 8080:8080 michaelf34/infinity:0.0.20 --model-name-or-path sentence-transformers/all-MiniLM-L6-v2 --port 8080
    • Let's generate a vector embedding! Go to the Swagger docs for the API, click on "Try it out", add a sentence to the "input"-array and press "Execute.
    • Screenshot Infinity Swagger
    • You should get a response with a vector in 384 dimensions.
  2. Chroma
    • Chroma is a lightweight vector database that has a REST API and language bindings for Python and JavaScript.
    • Start Chroma in Docker:
    • docker run -p 8000:8000 chromadb/chroma
    • Verify it is up an running by viewing the FastAPI
    • [WINDOWS] For some windows machines, the port 8000 is already in use. To solve this, select another port docker run -p <ANOTHER_PORT>:8000 chromadb/chroma
    • [WINDOWS] Now verify that it by going to http://localhost:<ANOTHER_PORT>/docs.
    • [WINDOWS] Changing the port also affects some of the scripts used later in the different labs. The relevant changes are marked as [WINDOWS] in the code.
  3. Clone this repo and install dependencies
    • git clone git@github.com:cygni/cygni-competence-vectordbs.git
    • cd cygni-competence-vectordbs && npm install

Lab 1

Let's see how well the language model works and how to add records to Chroma by adding a subset of the MS MARCO dataset. MS MARCO is is a question answering dataset featuring 100 000 real Bing questions and human generated answers.

For this exercise we'll use only two of the questions with around 1000 different answers to each question. You can find the file in /msmarco/msmarco-subset.tsv.

The content is tab-separated with the following columns:

qid pid query passage
1133167 4712273 how is the weather in jamaica Jamaica is famous for having gorgeous sunny and warm weather most of the year. That is one of the main reasons that people like to vacation in Jamaica. If you’re making travel plans to visit Jamaica you’re probably trying to find out what the best time of the year is to visit Jamaica so that you can make the most of your vacation. Well, what the best time of year is depends on what you like to do. If you want to sit on a beach with a cold drink and bask in the warm sun for a few days then the best time to plan a trip to Jamaica is during the summer months when the temperature is usually between 80 and 90 degrees every day.

The two questions we'll be working with are:

  • How is the weather in Jamaica?
  • Hydrogen is a liquid below what temperature?

Have a look at the dataset and some different answers to these questions. Some are relevant and some are really not even answering the question.

There is a prepared program that reads the dataset and upserts the data to Chroma. Run: node indexMsMarco.mjs

The upserts of about 2500 lines of data takes around 4 minutes on an Apple M1 Max. While the program is running, have a look at the code in indexMsMarco.mjs.

It starts with:

const client = new ChromaClient();
const embedder = new SentenceTransformer('key-not-needed');


// Prepare the collection
const collection = await client.getOrCreateCollection({
  name: COLLECTION_NAME,
  embeddingFunction: embedder,
});

What is the embedder? It is a function that is called by the ChromaClient to get the vector embedding for the document or query. In our case SentenceTransformer makes an API call to Infinity that we started in the preparations.

The Chroma API can work with batches and in this example we add 20 documents per batch.

await collection.upsert({
    ids: ids,
    metadatas: metadatas,
    documents: documents
    });

The vector for each item is based on the contents of the corresponding item in the document array. Chroma expects a unique ID in string format and has the capability to store metadata as well. This is practical if you want to store extra information about the data.

Now that the data has been embedded and stored in Chroma, let's try a query.

Run: node query.mjs

It's fairly simple, first get the collection and specify the embedding function. Remember that to find documents that are semantically near the query — the query itself also needs to be embedded.

const collection = await client.getCollection({
  name: COLLECTION_MSMARCO,
  embeddingFunction: embedder
});

const results = await collection.query({
  nResults: 10,
  queryTexts: ["What is the weather like in Jamaica?"],
});

console.log(results); 

The response contains nResults number of items ordered by their distance to the query.

Exercise

Try some different queries by editing query.mjs:

  • How well do the results match for more complicated and specific questions?
  • What are the results for questions not relevant to the current dataset?
  • What happens if you give a minimal query like "Jamaica" or "Hydrogen"?
  • If you search for Volvo which is never mentioned in the data set - what type of results do you get? Where does this knowledge come from?

Lab 2 - Recipes

Let's see how well a Vector database manages more structured data. A couple of years ago there was an initiative to create an open database of recipes, unfortunately it newer took of. But I managed to find a copy of their collected data and placed it in recipes.

The data contains the name of the recipe, its ingredients and a link to the web site that published the recipe.

First unzip the recipes: cd recipes && unzip 20170107-061401-recipeitems.json.zip && cd ..

Run: node indexRecipes.mjs

This will use the name of the recipe and its ingredients to build the vector embedding and store the rest of the data as metadata. The collection contains more than 170 000 recipes and it will take quite a while to add them all. You can stop the index process after adding a couple of thousand recipes.

Run: node queryRecipes.mjs

Exercise

Try some different queries by editing queryRecipes.mjs:

  • Try giving a list of ingredients that you would like to use.
  • Try using common names such as poultry, seafood, or meat of a pig. How well does the model understand these words?
  • How would you construct a query that explicitly wants to exclude recipes that contain garlic?

Lab 3 - RAG

In the use-case for RAG we typically store a knowledgebase in the Vector database and use the search results as input with the prompt to a LLM.

The LLM we're using, all-MiniLM-L6-v2, has a limit of 256 words – the rest will be cut off. Therefore the embedding needs to be chunked when dealing with documents of larger size.

In the folder books you'll find some academic papers and books on different aspects of AI. Let's add the contents of these books in a new collection.

Run: node indexBooks.mjs

It took 3:30 minutes on a Mac M1 Max to add all the books contents to the vector database.

If you look at the code in indexBooks.mjs you'll find that most of the code involves the parsing of PDF-files. When handling data from different sources and formats this can become quite a lot of code.

We're using pdf2json that handles PDF parsing quite well.

Since a full page of text can hold more than 256 words we need to "chunk" it up into smaller pieces. In the metadata we store a reference to the file, page and the chunk index like this:

metadatas.push({
    source: currentBook,
    page: pageNo,
    chunk: index,
    totalNoofChunks: chunks.length,
}); 

The ID for each chunk is constructed like this:

ids.push(currentBook + '_p_' + pageNo + '_c_' + chunkIndex);

Exercise

  • Update query.mjs to use the new collection COLLECTION_BOOKS
  • Play around with some different queries and see if the results seem to be relevant.
  • Extend query.mjs so that it can return a more complete piece of text that could be used to send to a LLM chat as context.
    • For each of the top three items in the result, fetch the surrounding chunks and concatenate them to three larger pieces of text. Check out the client API for Chroma here

A full RAG-example

Tommy Wassgren has created a full RAG implementation about Digital Sustainability that indexes resources from the web. Check it out here

Prerequisites

  • Node >v21.4.0
  • Docker for Desktop (or similar)
  • IDE, for example Visual Studio Code

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published