- Dorado basecaller
- see installation
- snake_make_ont_consensus_build
- will require the conda package manager
- chopper quality filtering
- can be installed via conda
conda create -c bioconda -n chopper chopper
conda activate chopper
- can be installed via conda
- Convert raw signal from nanopore device into a sequence of nucleotides. This step is called basecalling and is very computationally demanding.
- If multiple samples have been sequenced together at the same time (multiplexing), the data will need to be demultiplexed to separate sequences from each sample by identifying a barcode adapter.
- Low quality sequences should be removed before building consensus sequences. We can select only the sequences that meet a minimum mean quality score threshold (in this case Q15 on the Phred scale).
- Recover consensus sequences for each sample.
- Raw Oxford Nanopore data (in
pod5
file format)- if your data is in
fast5
format, convert to pod5 here: https://pod5.nanoporetech.com/ - Record the directory of your raw data, I will refer to this directory as
"raw-data/pod5-dir"
.
- if your data is in
- The barcoding/adapter kit used
- In this case we are using the
SQK-RBK114-24
(Rapid Barcoding Kit 24 V14). - Run
dorado demux --help
to see the list of kits for the--kit-name
option.
- In this case we are using the
- An output directory to store the results. I will refer to this directory as
"output-dir"
.mkdir "output-dir"
Perform the basecalling on your raw pod5
files in the raw-data/pod5-dir
directory.
dorado basecaller \
--no-trim \
--recursive \
--device cpu \
sup "raw-data/pod5-dir" \
> output-dir/basecalled_reads.bam
The --no-trim
option tells dorado to not perform any adapter trimming (which will be done in the demultiplexing step).
The --recursive
option tells dorado to look through all the subfolders of the raw-data/pod5-dir
for pod5
files.
The --device cpu
option tells dorado to use the CPU device to perform the basecalling.
If you have a CUDA enabled GPU or Apple M1/M2 installed run dorado basecaller --help
for other options.
CPU basecalling will be much slower than GPU basecalling.
The sup
option tells dorado to use the 'super accurate' basecalling model which will produce the
highest accuracy sequences at high computational cost.
See which models are available here.
Make a directory to store the demultiplexed sequences.
mkdir "ouput-dir/demux"
Ensure the barcoding kit name (--kit-name
) is the same as the one used in the library preparation or else most sequences will be unclassified.
dorado demux \
--kit-name SQK-RBK114-24 \
--output-dir output-dir/demux \
--emit-fastq \
output-dir/basecalled_reads.bam
The --emit-fastq
option tells dorado to output .fastq
formatted files.
It will output .bam
files by default (without this option).
At the end of this step there should be one file per sample in the output-dir/demux
directory.
Sequences where the barcode could not be classified will be placed in output-dir/demux/unclassified.fastq
.
This command iterates through the .fastq
file for each sample and
removes sequences that have a mean quality score lower than Q15.
for filename in output-dir/demux/*.fastq; do \
cat ${filename} | chopper --quality 15 \
> output-dir/filt-Q15/$(basename $filename);\
done
snake_make_ont_consensus_build
is a snakemake pipeline that can do this.
Details on how this pipeline works can be found here: https://github.com/ritatam/nanopore-biosecurity-workshop-2022
Follow the steps on how to run the pipeline here: https://github.com/ritatam/snakemake_ont_consensus_build . They have been repeated here for completeness:
- Clone the repository
git clone https://github.com/ritatam/snakemake_ont_consensus_build.git
cd snakemake_ont_consensus_build
- Configure paths in the
config.yaml
file. These paths are relative to theconfig.yaml
file.
input_reads_dir: "../output-dir/filt-Q15"
output_dir: "../output-dir/ont_consensus"
- Install snakemake into a conda environment
conda install snakemake
- Install dependencies into a conda environment
snakemake --use-conda --conda-create-envs-only -c1
- Do a dry-run of the pipeline:
snakemake --use-conda -np
- Run the pipeline:
snakemake --use-conda -c8