-
Notifications
You must be signed in to change notification settings - Fork 13
Getting started
You can learn how to use some of the most important features in this 5-minutes tutorial. It is assumed that you have HPG Aligner installed, otherwise please visit the Downloads section in order to install the binaries or to compile the source code. This tutorial uses the HPG Aligner version 2.1, command lines and parameters for other HPG Aligner versions can differ slightly.
To run this worked example you also need:
- A small FASTQ paired-end dataset from Chromosome 20: test_paired_chr20.tar.gz
- Human chromosome 20 reference sequence: homo_sapiens.grch37.70.dna.chromosome.20.fa.tar.gz
Create a folder and download there the compressed FASTQ genomic DNA dataset and human chromosome, then uncompress those files:
mkdir aligner-tutorial
cd aligner-tutorial
tar zxvf test_paired_chr20.tar.gz
tar zxvf Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.tar.gz
HPG Aligner binary is assumed to be in the same directory than data in this tutorial, so copy it into the folder. Now you should have these files:
test_1.fq
test_2.fq
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa
Before mapping we need to create the index. HPG Aligner is the fastest tool creating index, we use multi-thread implementation to speed up this process, this process may take some time and need a lot memory depending on the size of the genome. You should not have any problem with this chromosome. You need to execute the command build-sa-index and to specify the FASTA genome and the output directory for the index:
mkdir chr20-index
./hpg-aligner build-sa-index -g Homo_sapiens.GRCh37.70.dna.chromosome.20.fa 0 -i index-chr20
If the index generation succeeds, you'll have the following files in the folder index-chr20:
ls -l chr20-index
dna_compression.bin
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.A
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.CHROM
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.IA
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.JA
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.S
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.SA
index
params.info
params.txt
To map the FASTQ files you need to run the tool with command dna as the dataset contains genomic DNA sequences, the index folder is passed as argument, the default options should provide a good performance, i.e. all available cores will be used to speed up performance. We create a folder called mapped to store the results:
mkdir mapped
./hpg-aligner dna -f test_1.fq -i chr20-index -o mapped
The execution should not take too much time as files are small, if you have a multi-core machine you can check that all cores are being used with the command htop. HPG Aligner provides a small report:
----------------------------------------------
Loading SA tables...
End of loading SA tables in 0.02 min. Done!!
----------------------------------------------
Starting mapping...
End of mapping in 0.09 min. Done!!
----------------------------------------------
Output file : mapped/out.sam
Num. reads : 652309
Num. mapped reads : 593322 (90.96 %)
Num. unmapped reads: 58987 (9.04 %)
Num. mappings : 593729
Num. multihit reads: 248
----------------------------------------------
The output file containing the resulting mappings is located in the folder specified by the parameter -o, in our case, the folder mapped:
ls -l mapped
out.sam
As you can see, by default SAM is the output file format, but by using the parameter --bam-format the output format will be BAM. This parameter turns the process slower.
For paired-end mapping, you have to use the parameter_-j_ for the second mate file.
./hpg-aligner dna -f test_1.fq -j test_2.fq -i chr20-index -o mapped
To see all available parameters, type:
./hpg-aligner -h