OpenContracts Pipeline Architecture

The OpenContracts pipeline system is a modular and extensible architecture for processing documents through various stages: parsing, thumbnail generation, and embedding. This document provides an overview of the system architecture and guides you through creating new pipeline components.

Architecture Overview

The pipeline system consists of three main component types:

Parsers: Extract text and structure from documents
Thumbnailers: Generate visual previews of documents
Embedders: Create vector embeddings for semantic search

Each component type has a base abstract class that defines the interface and common functionality:

graph TD
    A[Document Upload] --> B[Parser]
    B --> C[Thumbnailer]
    B --> D[Embedder]

    subgraph "Pipeline Components"
        B --> B1[DoclingParser]
        B --> B2[NlmIngestParser]
        B --> B3[TxtParser]

        C --> C1[PdfThumbnailer]
        C --> C2[TextThumbnailer]

        D --> D1[MicroserviceEmbedder]
    end

    C1 --> E[Document Preview]
    C2 --> E
    D1 --> F[Vector Database]

Loading

Component Registration

Components are registered in settings/base.py through configuration dictionaries:

PREFERRED_PARSERS = {
    "application/pdf": "opencontractserver.pipeline.parsers.docling_parser.DoclingParser",
    "text/plain": "opencontractserver.pipeline.parsers.oc_text_parser.TxtParser",
    # ... other mime types
}

THUMBNAIL_TASKS = {
    "application/pdf": "opencontractserver.tasks.doc_tasks.extract_pdf_thumbnail",
    "text/plain": "opencontractserver.tasks.doc_tasks.extract_txt_thumbnail",
    # ... other mime types
}

PREFERRED_EMBEDDERS = {
    "application/pdf": "opencontractserver.pipeline.embedders.sent_transformer_microservice.MicroserviceEmbedder",
    # ... other mime types
}

Component Types

Parsers

Parsers inherit from BaseParser and implement the parse_document method:

class BaseParser(ABC):
    title: str = ""
    description: str = ""
    author: str = ""
    dependencies: list[str] = []
    supported_file_types: list[FileTypeEnum] = []

    @abstractmethod
    def parse_document(
        self, user_id: int, doc_id: int, **kwargs
    ) -> Optional[OpenContractDocExport]:
        pass

Current implementations:

DoclingParser: Advanced PDF parser using machine learning
NlmIngestParser: Alternative PDF parser using NLM ingestor
TxtParser: Simple text file parser

Thumbnailers

Thumbnailers inherit from BaseThumbnailGenerator and implement the _generate_thumbnail method:

class BaseThumbnailGenerator(ABC):
    title: str = ""
    description: str = ""
    author: str = ""
    dependencies: list[str] = []
    supported_file_types: list[FileTypeEnum] = []

    @abstractmethod
    def _generate_thumbnail(
        self,
        txt_content: Optional[str],
        pdf_bytes: Optional[bytes],
        height: int = 300,
        width: int = 300,
    ) -> Optional[tuple[bytes, str]]:
        pass

Current implementations:

PdfThumbnailer: Generates thumbnails from PDF first pages
TextThumbnailer: Creates text-based preview images

Embedders

Embedders inherit from BaseEmbedder and implement the embed_text method:

class BaseEmbedder(ABC):
    title: str = ""
    description: str = ""
    author: str = ""
    dependencies: list[str] = []
    vector_size: int = 0
    supported_file_types: list[FileTypeEnum] = []

    @abstractmethod
    def embed_text(self, text: str) -> Optional[list[float]]:
        pass

Current implementations:

MicroserviceEmbedder: Generates embeddings using a remote service

Creating New Components

To create a new pipeline component:

Choose the appropriate base class (BaseParser, BaseThumbnailGenerator, or BaseEmbedder)
Create a new class inheriting from the base class
Implement required abstract methods
Set component metadata (title, description, author, etc.)
Register the component in the appropriate settings dictionary

Example of a new parser:

from opencontractserver.pipeline.base.parser import BaseParser
from opencontractserver.pipeline.base.file_types import FileTypeEnum

class MyCustomParser(BaseParser):
    title = "My Custom Parser"
    description = "Parses documents in a custom way"
    author = "Your Name"
    dependencies = ["custom-lib>=1.0.0"]
    supported_file_types = [FileTypeEnum.PDF]

    def parse_document(
        self, user_id: int, doc_id: int, **kwargs
    ) -> Optional[OpenContractDocExport]:
        # Implementation here
        pass

Then register it in settings:

PREFERRED_PARSERS = {
    "application/pdf": "path.to.your.MyCustomParser",
    # ... other parsers
}

Best Practices

Error Handling: Always handle exceptions gracefully and return None on failure
Dependencies: List all required dependencies in the component's dependencies list
Documentation: Provide clear docstrings and type hints
Testing: Create unit tests for your component in the tests directory
Metadata: Fill out all metadata fields (title, description, author)

Advanced Topics

Parallel Processing

The pipeline system supports parallel processing through Celery tasks. Each component can be executed asynchronously:

from opencontractserver.tasks.doc_tasks import process_document

# Async document processing
process_document.delay(user_id, doc_id)

Custom File Types

To add support for new file types:

Add the MIME type to ALLOWED_DOCUMENT_MIMETYPES in settings
Update FileTypeEnum in base/file_types.py
Create appropriate parser/thumbnailer/embedder implementations
Register the implementations in settings

Error Handling

Components should implement robust error handling:

def parse_document(self, user_id: int, doc_id: int, **kwargs):
    try:
        # Implementation
        return result
    except Exception as e:
        logger.error(f"Error parsing document {doc_id}: {e}")
        return None

Contributing

When contributing new pipeline components:

Follow the project's coding style
Add comprehensive tests
Update this documentation
Submit a pull request with a clear description

For questions or support, please open an issue on the GitHub repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline_overview.md

pipeline_overview.md

OpenContracts Pipeline Architecture

Architecture Overview

Component Registration

Component Types

Parsers

Thumbnailers

Embedders

Creating New Components

Best Practices

Advanced Topics

Parallel Processing

Custom File Types

Error Handling

Contributing

Files

pipeline_overview.md

Latest commit

History

pipeline_overview.md

File metadata and controls

OpenContracts Pipeline Architecture

Architecture Overview

Component Registration

Component Types

Parsers

Thumbnailers

Embedders

Creating New Components

Best Practices

Advanced Topics

Parallel Processing

Custom File Types

Error Handling

Contributing