Genomic Variant Analysis Pipeline

A systematic approach to discovering genomic variants that distinguish disease cases from controls, with potential to identify causal mutations. Built for reproducible analysis of large-scale genomic data through model stacking and bootstrap validation.

Overview

This pipeline enables:

Discovery of discriminative genomic variants between cases and controls
Prioritization of potentially causal mutations through feature importance
Statistical validation through nested bootstrap resampling
Tracked experiments with detailed metrics and artifact logging
Scalable processing of large genomic datasets
Interactive development and debugging capabilities

Architecture

graph TD
    %% Data Sources & Initial Processing
    VCF[VCF Data] --> GP[GenotypeProcessor]
    GTF[GTF Data] --> GP
    GMT[GMT Data] --> |Optional| GP
    GP --> |Encoded Genotypes| PQ[Parquet Conversion]
    
    %% Sample Processing Setup
    ST1[Sample Tables] --> SP[SampleProcessor]
    
    %% Outer Bootstrap Loop
    subgraph "Outer Bootstrap Iterations"
        PQ --> |Feature Matrix| OBS[Outer Bootstrap Split]
        SP --> |Stratified Splits| OBS
        
        OBS --> |Feature Selection Set| FS[Feature Selection]
        OBS --> |Validation Set| IBS[Inner Bootstrap Split]
        
        FS --> |Selected Features| IBS
        
        %% Inner Bootstrap Loop
        subgraph "Inner Bootstrap Iterations"
            IBS --> |Train/Test Sets| MS[Model Stack]
            
            subgraph "Model Stack"
                MS --> M1[Logistic Regression]
                MS --> M2[Naive Bayes]
                MS --> M3[SVM]
                MS --> M4[Gradient Boosting]
                MS --> M5[KNN]
                MS --> M6[Neural Net]
                
                %% Each model's evaluation flow
                M1 & M2 & M3 & M4 & M5 & M6 --> |BayesSearch| HPO[Hyperparameter Optimization]
                HPO --> |Cross-validation| PE[Performance Evaluation]
            end
            
            PE --> |Metrics & Artifacts| MLF[MLflow Tracking]
            PE --> Sum1[Inner Bootstrap Summary]
        end
        
        Sum1 --> Sum2[Outer Bootstrap Summary]
    end
    
    %% Final Results
    MLF --> |Experiment Tracking| Res[Results & Artifacts]
    Sum2 --> |Final Analysis| Res

Loading

Core Components

Data Processing: Specialized processors for genotype and sample data
Model Stack: Ensemble of models optimized for genomic feature selection
Validation: Nested bootstrap sampling for robust performance estimation
Results: Comprehensive metrics, feature importance, and SHAP explanations

Key Dependencies

Python >=3.9
MLflow: Experiment tracking and artifact management
Hail: Scalable genomic data processing
scikit-learn & XGBoost: Core modeling
SHAP: Feature importance analysis

Getting Started

Set up environment:

# Clone repository
git clone <repository-url>
cd <repository-name>

# Create and activate environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install dependencies
pip install -r requirements.txt

Configure MLflow:

# Set longer timeout for large artifact handling
export MLFLOW_HTTP_REQUEST_TIMEOUT=1200

# Start MLflow server (default: http://localhost:5000)
mlflow server

Run analysis:

# via CLI
python src/main.py

Project Structure

├── src/
│   ├── data/           # Data processing modules
│   ├── models/         # Model definitions
│   ├── eval/           # Evaluation utilities
│   └── analysis/       # Analysis notebooks
├── tests/              # Test suite
└── mlruns/            # MLflow experiment tracking

Experiment Tracking

Results are tracked through MLflow, including:

Model performance metrics
Feature importance scores
SHAP values
Confusion matrices
ROC and PR curves
Sample-level predictions

Access the MLflow UI at http://localhost:5000 after starting the server.

Contributing

Fork repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push branch (git push origin feature/amazing-feature)
Open Pull Request

License

MIT License - see LICENSE file for details

Acknowledgments

Built with support from Project MinE and Answer ALS datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Genomic Variant Analysis Pipeline

Overview

Architecture

Core Components

Key Dependencies

Getting Started

Project Structure

Experiment Tracking

Contributing

License

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

Genomic Variant Analysis Pipeline

Overview

Architecture

Core Components

Key Dependencies

Getting Started

Project Structure

Experiment Tracking

Contributing

License

Acknowledgments