Skip to content

Developing fork of sacdallago/biotrainer

License

Notifications You must be signed in to change notification settings

SebieF/biotrainer

 
 

Repository files navigation

Biotrainer

License Documentation GitHub release (latest by date)

Overview

Biotrainer is an open-source framework that simplifies machine learning model development for protein analysis. It provides:

  • Easy-to-use training and inference pipelines for protein feature prediction
  • Standardized data formats for various prediction tasks
  • Built-in support for protein language models and embeddings
  • Flexible configuration through simple YAML files

Quick Start

1. Installation

# Install using poetry (recommended)
poetry install
# Adding jupyter notebook (if needed):
poetry add jupyter

# For Windows users with CUDA support:
# Visit https://pytorch.org/get-started/locally/ and follow GPU-specific installation, e.g.:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

2. Basic Usage

# Training
poetry run biotrainer config.yml

# Inference
python3
>>> from biotrainer.inference import Inferencer
>>> inferencer, out_file = Inferencer.create_from_out_file('output/out.yml')
>>> predictions = inferencer.from_embeddings(your_embeddings)

Features

Supported Prediction Tasks

  • Residue-level classification (residue_to_class)
  • Residues-level classification (residues_to_class, like sequence_to_class with per-residue embeddings)
  • Sequence-level classification (sequence_to_class)
  • Residues-level regression (residues_to_value, like sequence_to_value with per-residue embeddings)
  • Sequence-level regression (sequence_to_value)

Built-in Capabilities

  • Multiple embedding methods (ProtT5, ESM, etc.)
  • Various neural network architectures
  • Cross-validation and model evaluation
  • Performance metrics and visualization
  • Sanity checks and automatic calculation of baselines (such as random, mean...)
  • Docker support for reproducible environments

Documentation

Tutorials

Detailed Guides

Example Configuration

protocol: residue_to_class
sequence_file: sequences.fasta
labels_file: labels.fasta
model_choice: CNN
optimizer_choice: adam
learning_rate: 1e-3
loss_choice: cross_entropy_loss
use_class_weights: True
num_epochs: 200
batch_size: 128
embedder_name: Rostlab/prot_t5_xl_uniref50

Docker Support

# Run using pre-built image
docker run --gpus all --rm \
    -v "$(pwd)/examples/docker":/mnt \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    ghcr.io/sacdallago/biotrainer:latest /mnt/config.yml

More information on running docker with gpus: Nvidia container toolkit

Getting Help

Citation

@inproceedings{
sanchez2022standards,
title={Standards, tooling and benchmarks to probe representation learning on proteins},
author={Joaquin Gomez Sanchez and Sebastian Franz and Michael Heinzinger and Burkhard Rost and Christian Dallago},
booktitle={NeurIPS 2022 Workshop on Learning Meaningful Representations of Life},
year={2022},
url={https://openreview.net/forum?id=adODyN-eeJ8}
}

About

Developing fork of sacdallago/biotrainer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Dockerfile 0.4%