Ceylon AI RAG Framework

A powerful, modular, and extensible Retrieval-Augmented Generation (RAG) framework built with Python, supporting multiple LLM providers, embedders, and document types.

🌟 Features

Multiple Document Types: Support for various document formats including:
- Text files (with extensive format support)
- PDF documents
- Images (with OCR capabilities)
- Source code files
Flexible Architecture:
- Modular component design
- Pluggable LLM providers (OpenAI, Ollama)
- Extensible embedding providers
- Vector store integration (LanceDB)
Advanced RAG Capabilities:
- Intelligent document chunking
- Context-aware searching
- Query expansion and reranking
- Metadata enrichment
- Source attribution
Specialized RAG Implementations:
- FolderRAG: Process and analyze entire directory structures
- CodeAnalysisRAG: Specialized for source code understanding
- SimpleRAG: Basic RAG implementation for text data
- Support for custom RAG implementations

🚀 Getting Started

Installation

# Install via pip
pip install ceylon-rag

# Or install from source
git clone https://github.com/ceylonai/ceylon-rag.git
cd ceylon-rag
pip install -e .

Basic Usage

Here's a simple example using the framework:

import asyncio
from dotenv import load_dotenv
from ceylon_rag import SimpleRAG

async def main():
    # Load environment variables
    load_dotenv()

    # Configure the RAG system
    config = {
        "llm": {
            "type": "openai",
            "model_name": "gpt-4",
            "api_key": os.getenv("OPENAI_API_KEY")
        },
        "embedder": {
            "type": "openai",
            "model_name": "text-embedding-3-small",
            "api_key": os.getenv("OPENAI_API_KEY")
        },
        "vector_store": {
            "type": "lancedb",
            "db_path": "./data/lancedb",
            "table_name": "documents"
        }
    }

    # Initialize RAG
    rag = SimpleRAG(config)
    await rag.initialize()

    try:
        # Process your documents
        documents = await rag.process_documents("path/to/documents")
        
        # Query the system
        result = await rag.query("What are the main topics in these documents?")
        print(result.response)
        
    finally:
        await rag.close()

if __name__ == "__main__":
    asyncio.run(main())

🏗️ Architecture

Core Components

Document Loaders
- TextLoader: Handles text-based files
- PDFLoader: Processes PDF documents
- ImageLoader: Handles images with OCR
- Extensible base class for custom loaders
Embedders
- OpenAI embeddings support
- Ollama embeddings support
- Modular design for adding new providers
LLM Providers
- OpenAI integration
- Ollama integration
- Async interface for all providers
Vector Store
- LanceDB integration
- Efficient vector similarity search
- Metadata storage and retrieval

Document Processing

The framework provides sophisticated document processing capabilities:

# Example: Processing a code repository
async def analyze_codebase():
    config = {
        "llm": {
            "type": "openai",
            "model_name": "gpt-4"
        },
        "embedder": {
            "type": "openai",
            "model_name": "text-embedding-3-small"
        },
        "vector_store": {
            "type": "lancedb",
            "db_path": "./data/lancedb",
            "table_name": "code_documents"
        },
        "chunk_size": 1000,
        "chunk_overlap": 200
    }

    rag = CodeAnalysisRAG(config)
    await rag.initialize()
    
    documents = await rag.process_codebase("./src")
    await rag.index_code(documents)
    
    result = await rag.analyze_code(
        "Explain the main architecture of this codebase"
    )
    print(result.response)

🔧 Advanced Configuration

File Exclusions

Configure file exclusions using patterns:

config = {
    # ... other config options ...
    "excluded_dirs": [
        "venv",
        "node_modules",
        ".git",
        "__pycache__"
    ],
    "excluded_files": [
        ".env",
        "package-lock.json"
    ],
    "excluded_extensions": [
        ".pyc",
        ".pyo",
        ".pyd"
    ],
    "ignore_file": ".ragignore"  # Similar to .gitignore
}

Chunking Configuration

Customize document chunking:

config = {
    # ... other config options ...
    "chunk_size": 1000,  # Characters per chunk
    "chunk_overlap": 200,  # Overlap between chunks
}

🤝 Contributing

Contributions are welcome! Please feel free to submit pull requests. For major changes, please open an issue first to discuss what you would like to change.

📄 License

MIT License

🙏 Acknowledgments

OpenAI for GPT and embedding models
Ollama for local LLM support
LanceDB team for vector storage
All contributors and users of the framework

📚 API Documentation

For detailed API documentation, please visit our API Documentation page.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
app		app
examples		examples
tests		tests
.coderagignore		.coderagignore
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ceylon AI RAG Framework

🌟 Features

🚀 Getting Started

Installation

Basic Usage

🏗️ Architecture

Core Components

Document Processing

🔧 Advanced Configuration

File Exclusions

Chunking Configuration

🤝 Contributing

📄 License

🙏 Acknowledgments

📚 API Documentation

🔗 Links

About

Releases 1

Packages

Languages

ceylonai/ceylon-rag

Folders and files

Latest commit

History

Repository files navigation

Ceylon AI RAG Framework

🌟 Features

🚀 Getting Started

Installation

Basic Usage

🏗️ Architecture

Core Components

Document Processing

🔧 Advanced Configuration

File Exclusions

Chunking Configuration

🤝 Contributing

📄 License

🙏 Acknowledgments

📚 API Documentation

🔗 Links

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages