Skip to content

Bioinformatics pipeline for recovery and analysis of metagenome-assembled genomes

License

Notifications You must be signed in to change notification settings

Serka-M/mmlong2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

Genome-centric long-read metagenomics workflow for automated recovery and analysis of prokaryotic genomes with Nanopore or PacBio HiFi sequencing data. The mmlong2 workflow is a continuation of mmlong.

Workflow description

Core features

  • Snakemake workflow running dependencies from a Singularity container for enhanced reproducibility
  • Bioinformatics tool and parameter optimizations for processing high complexity metagenomic samples
  • Circular prokaryotic genome extraction as separate genome bins
  • Eukaryotic contig removal for reduced prokaryotic genome contamination
  • Differential coverage support for improved prokaryotic genome recovery
  • Iterative ensemble binning strategy for improved prokaryotic genome recovery
  • Recovered genome quality classification according to MIMAG guidelines
  • Supplemental genome quality assessment, including microdiversity approximation and chimerism checks
  • Automated taxonomic classification at genome, contig and 16S rRNA levels
  • Generation of analysis-ready dataframes at genome and contig levels

Schematic overview

mmlong2-np

Installation

Bioconda

The recommended way of installing mmlong2 is by setting up a Conda environment through Bioconda:

mamba install -c bioconda mmlong2

From source (Conda)

A local Conda environment with the latest workflow code can also be created by using the following code:

mamba create --prefix mmlong2 -c conda-forge -c bioconda snakemake=8.2.3 singularity=3.8.6 zenodo_get pv pigz tar yq ncbi-amrfinderplus -y
mamba activate ./mmlong2 || source activate ./mmlong2
git clone https://github.com/Serka-M/mmlong2 mmlong2/repo
cp -r mmlong2/repo/src/* mmlong2/bin
chmod +x mmlong2/bin/mmlong2
mmlong2 -h 

Databases and bioinformatics software

Bioinformatics tools and other software dependencies will be automatically installed when running the workflow for the first time. By default, a pre-built Singularity container will be downloaded and set up, although pre-defined Conda environments can also be used by running the workflow with the --conda_envs_only setting.

To acquire prokaryotic genome taxonomy and annotation results, databases are necessary and can be automatically installed by running the following command:

mmlong2 --install_databases

If some of the databases are already installed, they can also be re-used by the workflow without downloading (e.g. --database_gtdb option). Alternatively, a guide for manual database installation is also provided.

Running mmlong2

Usage examples

For trying out the mmlong2 workflow, small test datasets can be downloaded from Zenodo:

zenodo_get -r 12168493

Once downloaded, to test the workflow in Nanopore mode up until the genome binning completes (ETA 2 hours, 110 Gb peak RAM):

mmlong2 -np mmlong2_np.fastq.gz -o mmlong2_testrun_np -p 60 -run binning

To test the workflow in PacBio HiFi mode using metaMDBG as the assembler and perform genome recovery and analysis (ETA 4.5 hours, 170 Gb peak RAM):

mmlong2 -pb mmlong2_pb.fastq.gz -o mmlong2_testrun_pb -p 60 -dbg

Full usage

MAIN INPUTS:
-np     --nanopore_reads        Path to Nanopore reads
-pb     --pacbio_reads          Path to PacBio HiFi reads
-o      --output_dir            Output directory name (default: mmlong2)
-p      --processes             Number of processes/multi-threading (default: 3)

OPTIONAL SETTINGS:
-db     --install_databases     Install missing databases used by the workflow
-dbd    --database_dir          Output directory for database installation (default: current working directory)
-cov    --coverage              CSV dataframe for differential coverage binning (e.g. NP/PB/IL,/path/to/reads.fastq)
-run    --run_until             Run pipeline until a specified stage completes (e.g.  assembly polishing filtering singletons coverage binning taxonomy annotation extraqc stats)
-tmp    --temporary_dir         Directory for temporary files (default: current working directory)
-dbg    --use_metamdbg          Use metaMDBG for assembly of PacBio reads (default: use metaFlye)
-med    --medaka_model          Medaka polishing model (default: r1041_e82_400bps_sup_v5.0.0)
-mo     --medaka_off            Do not run Medaka polishing with Nanopore assemblies (default: use Medaka)
-vmb    --use_vamb              Use VAMB for binning (default: use GraphMB)
-sem    --semibin_model         Binning model for SemiBin (default: global)
-mlc    --min_len_contig        Minimum assembly contig length (default: 3000)
-mlb    --min_len_bin           Minimum genomic bin size (default: 250000)
-rna    --database_rrna         16S rRNA database to use
-gunc   --database_gunc         Gunc database to use
-bkt    --database_bakta        Bakta database to use
-kj     --database_kaiju        Kaiju database to use
-gtdb   --database_gtdb         GTDB-tk database to use
-h      --help                  Print help information
-v      --version               Print workflow version number

ADVANCED SETTINGS:
-fmo    --flye_min_ovlp         Minimum overlap between reads used by Flye assembler (default: auto)
-fmc    --flye_min_cov          Minimum initial contig coverage used by Flye assembler (default: 3)
-env    --conda_envs_only       Use conda environments instead of container (default: use container)
-n      --dryrun                Print summary of jobs for the Snakemake workflow
-t      --touch                 Touch Snakemake output files
-r1     --rule1                 Run specified Snakemake rule for the MAG production part of the workflow
-r2     --rule2                 Run specified Snakemake rule for the MAG processing part of the workflow
-x1     --extra_inputs1         Extra inputs for the MAG production part of the Snakemake workflow
-x2     --extra_inputs2         Extra inputs for the MAG processing part of the Snakemake workflow
-xb     --extra_inputs_bakta    Extra inputs (comma-separated) for MAG annotation using Bakta

Using differential coverage binning

To perform genome recovery with differential coverage, prepare a 2-column comma-separated dataframe, indicating the additional read datatype (NP for Nanopore, PB for PacBio, IL for short reads) and read file location.
Dataframe example:

PB,/path/to/your/reads/file1.fastq
NP,/path/to/your/reads/file2.fastq
IL,/path/to/your/reads/file3.fastq.gz

The prepared dataframe can be provided to the workflow through the -cov option.

Overview of workflow results

  • <output_name>_assembly.fasta - assembled and polished metagenome
  • <output_name>_16S.fa - 16S rRNA sequences, recovered from the polished metagenome
  • <output_name>_bins.tsv - per-bin results dataframe
  • <output_name>_contigs.tsv - per-contig results dataframe
  • <output_name>_general.tsv - workflow result summary as a single row dataframe
  • dependencies.csv- list of dependencies used and their versions
  • bins - directory for metagenome assembled genomes
  • bakta - directory, containing genome annotation results from bakta

Additional documentation

Future improvements

Suggestions on improving the workflow or fixing bugs are always welcome.
Please use the GitHub Issues section or e-mail to mase@bio.aau.dk for providing feedback.