Retrieval Optimizer

Search and information retrieval is a challenging problem. With the proliferation of vector search tools in the market, focus has heavily shifted towards SEO and marketing wins, rather than fundamental quality.

The Retrieval Optimizer from Redis focuses on measuring and improving retrieval quality. This framework helps determine optimal embedding models, retrieval strategies, and index configurations for your specific data and use case.

Prerequisites

Make sure you have the following tools available:
- Docker
- Python >= 3.11 and Poetry

Clone the repository:

git clone https://github.com/redis-applied-ai/retrieval-optimizer.git
cd retrieval-optimizer

Data requirements

The retrieval optimizer requires two sets of data to run an optimization study.

Indexed data

The core knowledge base of data to be embedded in Redis. Think of these as your "chunks".

Expected Format:

[
  {
    "text": "example content",
    "item_id": "abc:123"
  }
]

Ground truth data

Labeled ground truth data for generating the metrics that we will compared between samples.

Expected Format:

[
  {
    "query": "How long have sea turtles existed on Earth?",
    "relevant_item_ids": ["abc:1", "def:54", "hij:42"]
  }
]

Under the hood, the item_id is used to test if a vector query found the desired results (chunks) therefore this identifier needs to be unique to the text provided as input.

Important

The next section covers how to create this set of input data but if you already have them available you can skip.

Example data prep guide

Follow along with examples/getting_started/populate_index.ipynb to see an end-to-end example of data prep for retrieval optimization.

This guide will walk you through:

chunking source data
exporting that data to a format for use with the optimizer
creating vector representations of the data
loading them into a vector index

Labeling ground truth data

Sometimes you have a pre-defined dataset of queries and expected matches. However, this is NOT always the case. We built a simple web GUI to help.

Assuming you have created data and populated an initial vector index with that data you can run the labeling app for a more convenient experience.

Running the data labeling app

First set up a fresh environment file:

cp label_app/.env.template label_app/.env

Update the .env file (below is an example):

REDIS_URL=<Redis connection url>
LABELED_DATA_PATH=<file location for exported output>
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
SCHEMA_PATH=schema/index_schema.yaml

# Corresponding fields to return from index see label_app/main.py for implementation
ID_FIELD_NAME=unique id of a chunk or any item stored in vector index
CHUNK_FIELD_NAME=text content

Environment variable options:

Variable	Example Value	Description	Required
REDIS_URL	redis://localhost:6379	Redis connection URL	Yes
LABELED_DATA_PATH	label_app/data/labeled.json	File path where labeled data will be exported	Yes
EMBEDDING_MODEL	sentence-transformers/all-MiniLM-L6-v2	Name of the embedding model to use	Yes
SCHEMA_PATH	schema/index_schema.yaml	Path to the index schema configuration	Yes
ID_FIELD_NAME	item_id	Field name containing unique identifier in index	Yes
CHUNK_FIELD_NAME	text	Field name containing text content in index	Yes

Run the data labeling app

docker compose up

This will serve the data labeling app at localhost:8000/label. You can also interact with the swagger docs at localhost:8000/docs.

Using the data labeling app

The data labeling app will connect to the index specified in whatever file was provided as part of the SCHEMA_PATH environment variable. By default this is label_app/schema/index_schema.yaml if it connects properly you will see the name of the index and the number of documents it has indexed.

From here you can start making queries against your index, label the relevant chunks, and export to a JSON file for use in the optimization. This also a good way to test what's happening with your vector retrieval.

Running an optimization study

With your data now prepared, you can run optimization studies. A study has a config with defined params and ranges to test and compare with your data.

Run in notebook

Check out the following step by step notebooks for running the optimization process:

Getting started: examples/getting_started/retrieval_optimizer.ipynb
Adding custom retrieval examples/gettting_started/custom_retriever_optimizer.ipynb

Run with poetry

Define the config

The study config looks like this (see ex_study_config.yaml as an example):

# path to data files for easy read
raw_data_path: "label_app/data/2008-mazda3-chunks.json"
input_data_type: "json"
labeled_data_path: "label_app/data/mazda_labeled_items.json"
# metrics to be used in objective function
metric_weights:
  f1_at_k: 1
  embedding_latency: 1
  total_indexing_time: 1
# constraints for the optimization
n_trials: 10
n_jobs: 1
ret_k: [1, 10] # potential range of value to be sampled during study
ef_runtime: [10, 50]
ef_construction: [100, 300]
m: [8, 64]
# embedding models to be used
embedding_models:
  - provider: "hf"
    model: "sentence-transformers/all-MiniLM-L6-v2"
    dim: 384
  - provider: "hf"
    model: "intfloat/e5-large-v2"
    dim: 1024

Study Config Options

Variable	Example Value	Description	Required
raw_data_path	`label_app/data/2008-mazda3-chunks.json`	Path to raw data file	✅
labeled_data_path	`label_app/data/mazda-labeled-rewritten.json`	Path to labeled data file	✅
algorithms	flat, hnsw	Indexing algorithms to be tested in optimization	✅
vector_data_types	float32, float16	Data types to be tested for vectors	✅
n_trials	15	Number of optimization trials	✅
n_jobs	1	Number of parallel jobs	✅
ret_k	[1, 10]	Range of values to be tested for `k` in retrieval	✅
embedding_models	Provider: hf Model: sentence-transformers/all-MiniLM-L6-v2 Dim: 384	List of embedding models and their dimensions	✅
metric_weights	f1_at_k: 1 embedding_latency: 1 total_indexing_time: 1	Weight for respective metric used in the objective function	defaults to example
input_data_type	json	Type of input data	defaults to example
redis_url	`redis://localhost:6379`	Connection string for redis instance	defaults to example
ef_runtime	[10, 20, 30, 50]	Max top candidates during search for HNSW	defaults to example
ef_construction	[100, 150, 200, 250, 300]	Max number of connected neighbors to consider during graph building for HNSW	defaults to example
m	[8, 16, 64]	Max number of outgoing edges for each node in graph per layer for HNSW	defaults to example

Poetry Install & Setup

poetry install

poetry run study --config optimize/ex_study_config.yaml

Technical Motivation & Background

This framework implements a fairly common pattern for optimizing hyper-parameters called Bayesian Optimization using Optuna. Bayesian Optimization works by building a probabilistic model (typically Gaussian Processes) of the objective function and iteratively selecting the most promising configurations to evaluate. Unlike grid or random search, Bayesian Optimization balances exploration (trying new regions of the parameter space) and exploitation (focusing on promising areas), efficiently finding optimal hyper-parameters with fewer evaluations. This is particularly useful for expensive-to-evaluate functions, such as training machine learning models. By guiding the search using prior knowledge and updating beliefs based on observed performance, Bayesian Optimization can significantly improve both accuracy and efficiency in hyperparameter tuning.

In our case, we want to maximize the precision and recall of our vector search system while balancing performance tradeoffs such as embedding and indexing latency. Bayesian optimization gives us an automated way of testing all the knobs at our disposal to see which ones best optimize retrieval.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
.vscode		.vscode
examples		examples
images		images
label_app		label_app
optimize		optimize
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scripts.py		scripts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrieval Optimizer

Prerequisites

Data requirements

Indexed data

Ground truth data

Example data prep guide

Labeling ground truth data

Running the data labeling app

Using the data labeling app

Running an optimization study

Run in notebook

Run with poetry

Define the config

Study Config Options

Poetry Install & Setup

Technical Motivation & Background

Process diagram

About

Releases

Packages

Contributors 2

Languages

License

redis-applied-ai/retrieval-optimizer

Folders and files

Latest commit

History

Repository files navigation

Retrieval Optimizer

Prerequisites

Data requirements

Indexed data

Ground truth data

Example data prep guide

Labeling ground truth data

Running the data labeling app

Using the data labeling app

Running an optimization study

Run in notebook

Run with poetry

Define the config

Study Config Options

Poetry Install & Setup

Technical Motivation & Background

Process diagram

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages