Skip to content

ahhcash/ghastly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

43 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GhastlyDB - a super lightweight vector database in Go

build Coverage Status

I've built this as an experiment - to truly understand how databases work. This is only possible if I built it from first principles. GhastlyDB is the result of this experiment, and I'm super excited about how it turned out.

Features πŸ’ͺ

Embedding Support

  • Multiple embedding providers:
    • OpenAI (using text-embedding-3-small model)
    • NVIDIA (using nv-embedqa-mistral-7b-v2)
    • ColBERT (local embedding support)

Storage Engine

  • LSM Tree-based storage architecture
  • Memory-mapped memtable for fast writes
  • SSTable-based persistent storage
  • Skip list implementation for efficient data structure
  • Thread-safe operations with concurrent access support

Search Capabilities

  • Multiple similarity metrics:
    • Cosine similarity
    • Dot product
    • L2 distance
  • Efficient vector comparison algorithms
  • Sorted search results with similarity scores

Cross-Platform Support

  • Linux (amd64, arm64)
  • macOS (amd64, arm64)
  • Windows (amd64)

Installation πŸ’Ύ

Prerequisites

  • Go 1.21 or higher
  • Make
  • pkg-config

Local inference specific dependencies

  • ONNX Runtime (for local embedding model inference)
  • Make sure libtokenizers.a is present inside /libs/static/libotkenizers. You can build it from source or find it in the releases page of HuggingFace's tokenizers port for Go. (shoutout @daulet)

Platform-Specific Dependencies

macOS

brew install pkg-config
brew install onnxruntime

Linux

sudo apt-get update
sudo apt-get install build-essential pkg-config
pip install onnxruntime

Windows

pip install onnxruntime

Building From Source

  1. Clone the repository:
git clone https://github.com/ahhcash/ghastly.git
cd ghastly
  1. Build for your platform:
make build

This will create a binary in the bin/ directory for your current OS and architecture.

  1. Build for all platforms:
make build-all

This creates binaries for:

  • Linux (amd64, arm64)
  • macOS (amd64, arm64)
  • Windows (amd64)

Usage πŸ§‘β€πŸ’»

Building from source / using the docker container is the best way to get started. You can generate gRPC stubs or just use the REST API to perform DB operations!

Configuration

Default configuration:

Config{
Path:           "./ghastlydb_data",
MemtableSize:   64 * 1024 * 1024, // 64MB
Metric:         "cosine",
EmbeddingModel: "openai",
}

API Usage (Coming soon 🀫)

import "github.com/ahhcash/ghastlydb/db"

// Initialize with default config
database, err := db.OpenDB(db.DefaultConfig())

// Store data
err = database.Put("key", "value")

// Retrieve data
value, err := database.Get("key")

// Semantic search
results, err := database.Search("query")

Architecture πŸ› οΈ

Storage Layer

GhastlyDB uses a Log-Structured Merge Tree (LSM) architecture:

Writes are buffered in an in-memory memtable (implemented as a skip list) When memtable reaches its size limit, it's flushed to disk as an SSTable SSTables are immutable and contain sorted key-value pairs Background processes handle SSTable compaction

Search Engine

The search implementation supports multiple distance metrics:

Cosine similarity for normalized vectors Dot product for raw similarity L2 distance for Euclidean space

Embedding Layer

OpenAI: Cloud-based embeddings using text-embedding-3-small
NVIDIA: Cloud-based embeddings using nv-embedqa-mistral-7b-v2
ColBERT: Local inference using ONNX runtime, libtokenizers on colBERT-ir/v2

Development

Testing

make test        # Run tests
make coverage    # Generate coverage report

Code Quality

make lint        # Run golangci-lint
make fmt         # Format code

Directory Structure

Directory structure:
└── ahhcash-ghastly/
β”œβ”€β”€ README.md
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ Makefile
β”œβ”€β”€ go.mod
β”œβ”€β”€ go.sum
β”œβ”€β”€ .golangci.yml
β”œβ”€β”€ clients/
β”‚   └── python/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ client.py
β”‚       β”œβ”€β”€ setup.py
β”‚       └── test_client.py
β”œβ”€β”€ cmd/
β”‚   └── main.go
β”œβ”€β”€ db/
β”‚   β”œβ”€β”€ db.go
β”‚   └── db_test.go
β”œβ”€β”€ embed/
β”‚   β”œβ”€β”€ embedder.go
β”‚   β”œβ”€β”€ local/
β”‚   β”‚   └── colbert/
β”‚   β”‚       β”œβ”€β”€ config.go
β”‚   β”‚       β”œβ”€β”€ darwin.go
β”‚   β”‚       β”œβ”€β”€ embed.go
β”‚   β”‚       β”œβ”€β”€ linux.go
β”‚   β”‚       β”œβ”€β”€ platform_specific.go
β”‚   β”‚       └── windows.go
β”‚   β”œβ”€β”€ nvidia/
β”‚   β”‚   β”œβ”€β”€ embed.go
β”‚   β”‚   └── types.go
β”‚   └── openai/
β”‚       β”œβ”€β”€ embed.go
β”‚       └── types.go
β”œβ”€β”€ grpc/
β”‚   β”œβ”€β”€ gen/
β”‚   β”‚   └── grpc/
β”‚   β”‚       └── proto/
β”‚   β”‚           β”œβ”€β”€ ghastly.pb.go
β”‚   β”‚           └── ghastly_grpc.pb.go
β”‚   β”œβ”€β”€ proto/
β”‚   β”‚   └── ghastly.proto
β”‚   └── server/
β”‚       └── server.go
β”œβ”€β”€ http/
β”‚   └── server/
β”‚       └── server.go
β”œβ”€β”€ index/
β”‚   β”œβ”€β”€ connections.go
β”‚   β”œβ”€β”€ hnsw.go
β”‚   └── search.go
β”œβ”€β”€ libs/
β”‚   └── static/
β”‚       └── libtokenizers/
β”‚           └── .gitkeep
β”œβ”€β”€ mocks/
β”‚   └── embedder.go
β”œβ”€β”€ search/
β”‚   β”œβ”€β”€ cosine.go
β”‚   β”œβ”€β”€ dot.go
β”‚   β”œβ”€β”€ l2.go
β”‚   └── metrics_test.go
β”œβ”€β”€ storage/
β”‚   β”œβ”€β”€ memtable.go
β”‚   β”œβ”€β”€ memtable_test.go
β”‚   β”œβ”€β”€ skiplist.go
β”‚   β”œβ”€β”€ skiplist_test.go
β”‚   β”œβ”€β”€ sstable.go
β”‚   β”œβ”€β”€ store.go
β”‚   └── store_test.go
└── .github/
└── workflows/
└── build_and_deploy.yml

Contributing πŸ™

I would absolutely love any feedback / contributions! Please open a PR, and I'll gladly take a look :)

About

a key value based vector db!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages