Skip to content

Getting started

jtarraga edited this page Sep 29, 2014 · 36 revisions

You can learn how to use some of the most important features in this 5-minutes tutorial. It is assumed that you have HPG Aligner installed, otherwise please visit the Downloads section in order to install the binaries or to compile the source code. This tutorial uses the most recent version of HPG Aligner, command lines and parameters for other versions can differ slightly.

To run this worked example you also need:

Preparing the environment

Create a folder and download there the compressed FASTQ genomic DNA dataset and human chromosome, then uncompress those files:

mkdir tutorial
cd tutorial
tar zxvf test_paired_chr20.tar.gz
tar zxvf homo_sapiens.grch37.70.dna.chromosome.20.fa.tar.gz

HPG Aligner binary is assumed to be in the same directory than data in this tutorial, so copy it into the folder. Now you should have these files:

test_1.fq
test_2.fq
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa
hpg-aligner

Building the index

Before mapping, we need to create the index. HPG Aligner is one of the fastest tools creating index, we use multi-thread implementation to speed up this process, this process may take some time and need a lot memory depending on the size of the genome. You should not have any problem with this chromosome. You need to execute the command build-sa-index and to specify the FASTA genome and the output directory for the index:

mkdir chr20-index
./hpg-aligner build-sa-index -g Homo_sapiens.GRCh37.70.dna.chromosome.20.fa -i chr20-index/

If the index generation succeeds, you'll have the following files in the folder chr20-index:

dna_compression.bin
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.A
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.CHROM
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.IA
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.JA
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.S
Homo_sapiens.GRCh37.70.dna.chromosome.20.fa.SA
index
params.info
params.txt

DNA Aligning

To map the FASTQ files you need to run the tool with command dna as the dataset contains genomic DNA sequences, the index folder is passed as argument, the default options should provide a good performance, i.e. all available cores will be used to speed up performance. We create a folder called mapped to store the results:

mkdir mapped
./hpg-aligner dna -f test_1.fq -i chr20-index -o mapped

The execution should not take too much time as files are small, if you have a multi-core machine you can check that all cores are being used with the command htop. HPG Aligner provides a small report:

----------------------------------------------
Loading SA tables...
End of loading SA tables in 0.02 min. Done!!
----------------------------------------------
Starting mapping...
End of mapping in 0.03 min. Done!!
----------------------------------------------
Output file        : mapped/alignments.sam

Num. reads         : 20359
Num. mapped reads  : 20357 (99.99 %)
Num. unmapped reads: 2 (0.01 %)

Num. mappings      : 21106
Num. multihit reads: 145
----------------------------------------------

The output file containing the resulting mappings is located in the folder specified by the parameter -o, in our case, the folder mapped:

ls -l mapped
out.sam

As you can see, by default SAM is the output file format, but by using the parameter --bam-format the output format will be BAM. This parameter turns the process slower.

For paired-end mapping, you have to use the parameter**-j** for the second mate file.

./hpg-aligner dna -f test_1.fq -j test_2.fq -i chr20-index -o mapped

----------------------------------------------
Loading SA tables...
End of loading SA tables in 0.02 min. Done!!
----------------------------------------------
Starting mapping...
End of mapping in 0.04 min. Done!!
----------------------------------------------
Output file        : mapped/alignments.sam

Num. reads         : 40718
Num. mapped reads  : 40717 (100.00 %)
Num. unmapped reads: 1 (0.00 %)

Num. mappings      : 41075
Num. multihit reads: 136
----------------------------------------------

To see all available parameters, type:

./hpg-aligner -h

RNA-seq Aligning

To map the FASTQ files you need to run the tool with command rna as the dataset, the index folder is passed as argument, the default options should provide a good performance, i.e. all available cores will be used to speed up performance. We create a folder called mapped to store the results:

mkdir mapped
./hpg-aligner rna -f test_1.fq -i chr20-index -o mapped

For paired-end mapping, you have to use the parameter_-j_ for the second mate file.

./hpg-aligner rna -f test_1.fq -j test_2.fq -i chr20-index -o mapped