Skip to content

Getting started

jtarraga edited this page Sep 25, 2014 · 36 revisions

You can learn how to use some of the most important features in this 5-minutes tutorial. It is assumed that you have HPG Aligner installed, otherwise please visit the Downloads section in order to install the binaries or to compile the source code. This tutorial uses the HPG Aligner version 2.1, command lines and parameters for other HPG Aligner versions can differ slightly.

To run this worked example you also need:

Preparing the environment

Create a folder and download there the compressed FASTQ genomic DNA dataset and human chromosome, then uncompress those files:

mkdir aligner-tutorial
cd aligner-tutorial
tar zxvf test_paired_chr20.tar.gz
tar zxvf Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.tar.gz

HPG Aligner binary is assumed to be in the same directory than data in this tutorial, so copy it into the folder. Now you should have these files:

test_1.fq
test_2.fq
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa

Building the index

Before mapping we need to create the index. HPG Aligner is the fastest tool creating index, we use multi-thread implementation to speed up this process, this process may take some time and need a lot memory depending on the size of the genome. You should not have any problem with this chromosome. You need to execute the command build-sa-index and to specify the FASTA genome and the output directory for the index:

mkdir chr20-index
./hpg-aligner build-sa-index -g Homo_sapiens.GRCh37.70.dna.chromosome.20.fa 0 -i index-chr20

If the index generation succeeds, you'll have the following files in the folder index-chr20:

ls -l chr20-index
dna_compression.bin
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.A
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.CHROM
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.IA
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.JA
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.S
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.SA
index
params.info
params.txt

DNA Aligning

To map the FASTQ files you need to run the tool with command dna as the dataset contains genomic DNA sequences, the index folder is passed as argument, the default options should provide a good performance, i.e. all available cores will be used to speed up performance. We create a folder called mapped to store the results:

mkdir mapped
./hpg-aligner dna -f test_1.fq -i chr20-index -o mapped

The execution should not take too much time as files are small, if you have a multi-core machine you can check that all cores are being used with the command htop. HPG Aligner provides a small report:

----------------------------------------------
Loading SA tables...
End of loading SA tables in 0.02 min. Done!!
----------------------------------------------
Starting mapping...
End of mapping in 0.09 min. Done!!
----------------------------------------------
Output file        : mapped/out.sam

Num. reads         : 652309
Num. mapped reads  : 593322 (90.96 %)
Num. unmapped reads: 58987 (9.04 %)

Num. mappings      : 593729
Num. multihit reads: 248
----------------------------------------------

The output file containing the resulting mappings is located in the folder specified by the parameter -o, in our case, the folder mapped:

ls -l mapped
out.sam

As you can see, by default SAM is the output file format, but by using the parameter --bam-format the output format will be BAM. This parameter turns the process slower.

For paired-end mapping, you have to use the parameter_-j_ for the second mate file.

./hpg-aligner dna -f test_1.fq -j test_2.fq -i chr20-index -o mapped

To see all available parameters, type:

./hpg-aligner -h

RNA-seq Aligning