This pipeline is used to basecall raw data generated from a nanopore sequencing run (pod5). It performs the following steps:
- Generate a log file with information about the basecalling process.
- Basecall the reads using Dorado. Does this for the 24 barcoding kit by default, can do 96 with additional option.
- Split the reads based on detected barcodes and trim the barcodes from the reads (Dorado default behaviour).
- Convert the barcoded reads to FASTQ format.
- Generate statistics for the barcoded reads.
- Combine the statistics files for all the barcodes.
- Generate plots from the combined statistics file.
- Clean up the temporary files created along the way to avoid unecessary storage use.
The accuracy of de novo assemblies using nanopore only (with the same basecalling as in this pipeline) has been tested here (sup model): https://rrwick.github.io/2023/12/18/ont-only-accuracy-update.html
Start by cloning this repo (on the cluster, if aiming for cluster execution):
git clone https://github.com/vdruelle/nanopore_basecalling.git
Once this is done, you need to download dorado (https://github.com/nanoporetech/dorado, linux-x64 in our case), unpack it and move the folder called dorado-0.7.0-linux-x64
to the directory softwares/
.
You can then download the appropriate dorado model into the directory softwares/doarado_models
by typing:
./softwares/dorado-0.5.0-linux-x64/bin/dorado download --model dna_r10.4.1_e8.2_400bps_sup@v5.0.0 --directory softwares/dorado_models
Last you need to create the conda environment for the pipeline:
conda env create -f conda_env/nanopore_basecalling.yml
We suggest running the pipeline from a tmux session. To do so start by launching a tmux session, then move to where you saved the repository and activate the conda environment with:
conda activate nanopore_basecalling
Your run folder must contain a subfolder named raw
in which all your .pod5
are located as well as the params.tsv
file which you need to update. This is used for logging to keep track of the run. Start the pipeline, giving the path to your run folder as argument.
snakemake --profile cluster --config run_dir=<path to your run folder>
This command will launch the pipeline by submitting the appropriate jobs for cluster execution to basecall an split the 24 barcodes. If you want to perform basecalling for 96 barcodes instead do the following instead:
snakemake --profile cluster --config run_dir=<path to your run folder> kit96=True
You can monitor the progress of the pipeline in the console output. For a good nanopore run (20Gbp), the pipeline should take around 1h30 to complete.
If running locally (which necessitate a strong GPU, unless using faster models), also start by activating the conda environment. Then launch the pipeline with:
snakemake --config run_dir=<path to your run folder> --cores <number of cores you want to use>
Once the pipeline completes, should have more files and directories in your run folder. You should have:
- A directory
final
. It contains separate.fastq
files for all the barcodes. - A directory
statistics
. It contains a.tsv
file for some statistics about the run, as well as two figures that can be used to get an idea of how the run went. - A log file
basecalling.log
. This file describe when and with which parameter the basecalling happened.
The pipeline will also generate many intermediate outputs while it runs. They are automatically removed at the end.
This pipeline can be used to perform modified basecalling of nanopore reads. This is done by using the modified base models of dorado, which are implemented in the methylation.smk
file.
The models available are shown on the Dorado github page: https://github.com/nanoporetech/dorado.
The appropriate model needs to be downloaded and put in the softwares/doarado_models
first.
To perform methylation basecalling, modify the methylation.smk
file so that the argument DORADO_MODS
corresponds to the modified basecalling you want to perform (by default it is 6mA). Once this is done the pipeline can be run on the cluster by using:
snakemake -s methylation.smk --profile cluster --config run_dir=<path to your run folder>