For this session we'll be using a vector database called Chroma and a SentenceTransformer for creating embeddings called Infinity.
This setup could be used as the Retrieval step in a RAG.
Here is a diagram for what we will be building in this lab:
And a sequence diagram:
sequenceDiagram
actor User
User->>Application: User Query
Application->>Infinity: Create Embedding of User Query
activate Infinity
Infinity-->>Application: Vector Embedding
deactivate Infinity
Application->>Chroma: Find nearest documents based on Vector Embedding
activate Chroma
Chroma-->>Application: Results
deactivate Chroma
Application-)User: Results
Make sure you've read the prerequisites for this lab.
First, let's get all the parts running.
- Infinity
- To generate the vector embeddings we need a language model and an application that can extract the vector. A very common library for this is called SentenceTransformer which is implemented in Python. Another way is to use an online service to get the embeddings, for example OpenAI's API. In this lab we'll use a self-hosted API called Infinity. The language model choosen for this lab is called all-MiniLM-L6-v2 and is suitable for clustering or semantic search.
- Start the Infinity API:
docker run -it -p 8080:8080 michaelf34/infinity:0.0.20 --model-name-or-path sentence-transformers/all-MiniLM-L6-v2 --port 8080
- Let's generate a vector embedding! Go to the Swagger docs for the API, click on "Try it out", add a sentence to the "input"-array and press "Execute.
- You should get a response with a vector in 384 dimensions.
- Chroma
- Chroma is a lightweight vector database that has a REST API and language bindings for Python and JavaScript.
- Start Chroma in Docker:
docker run -p 8000:8000 chromadb/chroma
- Verify it is up an running by viewing the FastAPI
- [WINDOWS] For some windows machines, the port 8000 is already in use. To solve this, select another port
docker run -p <ANOTHER_PORT>:8000 chromadb/chroma
- [WINDOWS] Now verify that it by going to
http://localhost:<ANOTHER_PORT>/docs
. - [WINDOWS] Changing the port also affects some of the scripts used later in the different labs. The relevant changes are marked as [WINDOWS] in the code.
- Clone this repo and install dependencies
git clone git@github.com:cygni/cygni-competence-vectordbs.git
cd cygni-competence-vectordbs && npm install
Let's see how well the language model works and how to add records to Chroma by adding a subset of the MS MARCO dataset. MS MARCO is is a question answering dataset featuring 100 000 real Bing questions and human generated answers.
For this exercise we'll use only two of the questions with around 1000 different answers to each question. You can find the file in /msmarco/msmarco-subset.tsv.
The content is tab-separated with the following columns:
qid | pid | query | passage |
---|---|---|---|
1133167 | 4712273 | how is the weather in jamaica | Jamaica is famous for having gorgeous sunny and warm weather most of the year. That is one of the main reasons that people like to vacation in Jamaica. If you’re making travel plans to visit Jamaica you’re probably trying to find out what the best time of the year is to visit Jamaica so that you can make the most of your vacation. Well, what the best time of year is depends on what you like to do. If you want to sit on a beach with a cold drink and bask in the warm sun for a few days then the best time to plan a trip to Jamaica is during the summer months when the temperature is usually between 80 and 90 degrees every day. |
The two questions we'll be working with are:
- How is the weather in Jamaica?
- Hydrogen is a liquid below what temperature?
Have a look at the dataset and some different answers to these questions. Some are relevant and some are really not even answering the question.
There is a prepared program that reads the dataset and upserts the data to Chroma. Run:
node indexMsMarco.mjs
The upserts of about 2500 lines of data takes around 4 minutes on an Apple M1 Max. While the program is running, have a look at the code in indexMsMarco.mjs.
It starts with:
const client = new ChromaClient();
const embedder = new SentenceTransformer('key-not-needed');
// Prepare the collection
const collection = await client.getOrCreateCollection({
name: COLLECTION_NAME,
embeddingFunction: embedder,
});
What is the embedder? It is a function that is called by the ChromaClient to get the vector embedding for the document or query. In our case SentenceTransformer makes an API call to Infinity that we started in the preparations.
The Chroma API can work with batches and in this example we add 20 documents per batch.
await collection.upsert({
ids: ids,
metadatas: metadatas,
documents: documents
});
The vector for each item is based on the contents of the corresponding item in the document
array. Chroma expects a unique ID in string format and has the capability to store metadata as well. This is practical if you want to store extra information about the data.
Now that the data has been embedded and stored in Chroma, let's try a query.
Run: node query.mjs
It's fairly simple, first get the collection and specify the embedding function. Remember that to find documents that are semantically near the query — the query itself also needs to be embedded.
const collection = await client.getCollection({
name: COLLECTION_MSMARCO,
embeddingFunction: embedder
});
const results = await collection.query({
nResults: 10,
queryTexts: ["What is the weather like in Jamaica?"],
});
console.log(results);
The response contains nResults
number of items ordered by their distance to the query.
Try some different queries by editing query.mjs:
- How well do the results match for more complicated and specific questions?
- What are the results for questions not relevant to the current dataset?
- What happens if you give a minimal query like "Jamaica" or "Hydrogen"?
- If you search for Volvo which is never mentioned in the data set - what type of results do you get? Where does this knowledge come from?
Let's see how well a Vector database manages more structured data. A couple of years ago there was an initiative to create an open database of recipes, unfortunately it newer took of. But I managed to find a copy of their collected data and placed it in recipes.
The data contains the name of the recipe, its ingredients and a link to the web site that published the recipe.
First unzip the recipes:
cd recipes && unzip 20170107-061401-recipeitems.json.zip && cd ..
Run: node indexRecipes.mjs
This will use the name of the recipe and its ingredients to build the vector embedding and store the rest of the data as metadata. The collection contains more than 170 000 recipes and it will take quite a while to add them all. You can stop the index process after adding a couple of thousand recipes.
Run: node queryRecipes.mjs
Try some different queries by editing queryRecipes.mjs:
- Try giving a list of ingredients that you would like to use.
- Try using common names such as poultry, seafood, or meat of a pig. How well does the model understand these words?
- How would you construct a query that explicitly wants to exclude recipes that contain garlic?
In the use-case for RAG we typically store a knowledgebase in the Vector database and use the search results as input with the prompt to a LLM.
The LLM we're using, all-MiniLM-L6-v2, has a limit of 256 words – the rest will be cut off. Therefore the embedding needs to be chunked when dealing with documents of larger size.
In the folder books you'll find some academic papers and books on different aspects of AI. Let's add the contents of these books in a new collection.
Run: node indexBooks.mjs
It took 3:30 minutes on a Mac M1 Max to add all the books contents to the vector database.
If you look at the code in indexBooks.mjs you'll find that most of the code involves the parsing of PDF-files. When handling data from different sources and formats this can become quite a lot of code.
We're using pdf2json that handles PDF parsing quite well.
Since a full page of text can hold more than 256 words we need to "chunk" it up into smaller pieces. In the metadata we store a reference to the file, page and the chunk index like this:
metadatas.push({
source: currentBook,
page: pageNo,
chunk: index,
totalNoofChunks: chunks.length,
});
The ID for each chunk is constructed like this:
ids.push(currentBook + '_p_' + pageNo + '_c_' + chunkIndex);
- Update query.mjs to use the new collection
COLLECTION_BOOKS
- Play around with some different queries and see if the results seem to be relevant.
- Extend query.mjs so that it can return a more complete piece of text that could be used to send to a LLM chat as context.
- For each of the top three items in the result, fetch the surrounding chunks and concatenate them to three larger pieces of text. Check out the client API for Chroma here
Tommy Wassgren has created a full RAG implementation about Digital Sustainability that indexes resources from the web. Check it out here
- Node >v21.4.0
- Docker for Desktop (or similar)
- IDE, for example Visual Studio Code