I've built this as an experiment - to truly understand how databases work. This is only possible if I built it from first principles. GhastlyDB is the result of this experiment, and I'm super excited about how it turned out.
- Multiple embedding providers:
- OpenAI (using text-embedding-3-small model)
- NVIDIA (using nv-embedqa-mistral-7b-v2)
- ColBERT (local embedding support)
- LSM Tree-based storage architecture
- Memory-mapped memtable for fast writes
- SSTable-based persistent storage
- Skip list implementation for efficient data structure
- Thread-safe operations with concurrent access support
- Multiple similarity metrics:
- Cosine similarity
- Dot product
- L2 distance
- Efficient vector comparison algorithms
- Sorted search results with similarity scores
- Linux (amd64, arm64)
- macOS (amd64, arm64)
- Windows (amd64)
- Go 1.21 or higher
- Make
- pkg-config
- ONNX Runtime (for local embedding model inference)
- Make sure
libtokenizers.a
is present inside/libs/static/libotkenizers
. You can build it from source or find it in the releases page of HuggingFace's tokenizers port for Go. (shoutout @daulet)
brew install pkg-config
brew install onnxruntime
sudo apt-get update
sudo apt-get install build-essential pkg-config
pip install onnxruntime
pip install onnxruntime
- Clone the repository:
git clone https://github.com/ahhcash/ghastly.git
cd ghastly
- Build for your platform:
make build
This will create a binary in the bin/
directory for your current OS and architecture.
- Build for all platforms:
make build-all
This creates binaries for:
- Linux (amd64, arm64)
- macOS (amd64, arm64)
- Windows (amd64)
Building from source / using the docker container is the best way to get started. You can generate gRPC stubs or just use the REST API to perform DB operations!
Default configuration:
Config{
Path: "./ghastlydb_data",
MemtableSize: 64 * 1024 * 1024, // 64MB
Metric: "cosine",
EmbeddingModel: "openai",
}
import "github.com/ahhcash/ghastlydb/db"
// Initialize with default config
database, err := db.OpenDB(db.DefaultConfig())
// Store data
err = database.Put("key", "value")
// Retrieve data
value, err := database.Get("key")
// Semantic search
results, err := database.Search("query")
GhastlyDB uses a Log-Structured Merge Tree (LSM) architecture:
Writes are buffered in an in-memory memtable (implemented as a skip list) When memtable reaches its size limit, it's flushed to disk as an SSTable SSTables are immutable and contain sorted key-value pairs Background processes handle SSTable compaction
The search implementation supports multiple distance metrics:
Cosine similarity for normalized vectors Dot product for raw similarity L2 distance for Euclidean space
OpenAI: Cloud-based embeddings using text-embedding-3-small
NVIDIA: Cloud-based embeddings using nv-embedqa-mistral-7b-v2
ColBERT: Local inference using ONNX runtime, libtokenizers on colBERT-ir/v2
make test # Run tests
make coverage # Generate coverage report
make lint # Run golangci-lint
make fmt # Format code
Directory structure:
βββ ahhcash-ghastly/
βββ README.md
βββ Dockerfile
βββ Makefile
βββ go.mod
βββ go.sum
βββ .golangci.yml
βββ clients/
β βββ python/
β βββ __init__.py
β βββ client.py
β βββ setup.py
β βββ test_client.py
βββ cmd/
β βββ main.go
βββ db/
β βββ db.go
β βββ db_test.go
βββ embed/
β βββ embedder.go
β βββ local/
β β βββ colbert/
β β βββ config.go
β β βββ darwin.go
β β βββ embed.go
β β βββ linux.go
β β βββ platform_specific.go
β β βββ windows.go
β βββ nvidia/
β β βββ embed.go
β β βββ types.go
β βββ openai/
β βββ embed.go
β βββ types.go
βββ grpc/
β βββ gen/
β β βββ grpc/
β β βββ proto/
β β βββ ghastly.pb.go
β β βββ ghastly_grpc.pb.go
β βββ proto/
β β βββ ghastly.proto
β βββ server/
β βββ server.go
βββ http/
β βββ server/
β βββ server.go
βββ index/
β βββ connections.go
β βββ hnsw.go
β βββ search.go
βββ libs/
β βββ static/
β βββ libtokenizers/
β βββ .gitkeep
βββ mocks/
β βββ embedder.go
βββ search/
β βββ cosine.go
β βββ dot.go
β βββ l2.go
β βββ metrics_test.go
βββ storage/
β βββ memtable.go
β βββ memtable_test.go
β βββ skiplist.go
β βββ skiplist_test.go
β βββ sstable.go
β βββ store.go
β βββ store_test.go
βββ .github/
βββ workflows/
βββ build_and_deploy.yml
I would absolutely love any feedback / contributions! Please open a PR, and I'll gladly take a look :)