Skip to content

alignment command

Nacho edited this page Jun 30, 2015 · 1 revision

The 'alignment' command allows you to process BAM sequence files both in a local scenario or in a Hadoop cluster.

Assuming you are in the hpg-bigdata folder, type the following command to see the available alignment sub-commands for the Hadoop scenario:

$ build/bin/hpg-bigdata.sh alignment

Usage:   hpg-bigdata.sh alignment <subcommand> [options]

Subcommands:
     convert  Converts BAM files to different big data formats such as Avro and Parquet
       stats  Compute some stats for a file containing alignments according to the GA4GH/Avro model
       depth  Compute the depth (or coverage) for a given file containing alignments according to the GA4GH/Avro model

For a local scenario, use the script hpg-bigdata-local.sh:

$ build/bin/hpg-bigdata-local.sh alignment

Usage:   hpg-bigdata-local.sh alignment <subcommand> [options]

Subcommands:
     convert  Converts BAM files to different big data formats such as Avro
Sub-command: convert

Converts BAM files to different big data formats such as Avro and Parquet according to the GA4GH schema models. In the local scenario, only Avro is available.

Hadoop scenario:

$ build/bin/hpg-bigdata.sh alignment convert -h

Usage:   hpg-bigdata.sh alignment convert [options]

Options:
    * -i, --input          STRING     HDFS input file in BAM format [null]
          --to-parquet                To save the output file in Parquet format [false]
      -L, --log-level      STRING     Set the level log, values: debug, info, warning, error, fatal [info]
    * -o, --output         STRING     HDFS output file to store the BAM alignments according to the GA4GH/Avro model [null]
      -h, --help                      This parameter prints this help [false]
          --conf           STRING     Set the configuration file [null]
      -v, --verbose        BOOLEAN    This parameter set the level of the logging [false]
      -x, --compression    STRING     Accepted values: snappy, deflate, bzip2, xz, null. Default: snappy [snappy]

Example:

$ hadoop fs -mkdir /test
$ hadoop fs -copyFromLocal build/data/test.bam /test
$ hadoop fs -ls /test
Found 1 items
-rw-r--r--   1 jtarraga supergroup      11755 2015-06-30 16:32 /test/test.bam
$ hadoop fs -mkdir /out
$ build/bin/hpg-bigdata.sh alignment convert -i /test/test.bam -o /out/test.bam.avro --to-parquet
...
...
$ hadoop fs -ls /out/test.bam.avro
Found 4 items
-rw-r--r--   1 jtarraga supergroup          0 2015-06-30 16:33 /out/test.bam.avro/_SUCCESS
-rw-r--r--   1 jtarraga supergroup      32608 2015-06-30 16:33 /out/test.bam.avro/part-r-00000.avro
-rw-r--r--   1 jtarraga supergroup        552 2015-06-30 16:33 /out/test.bam.avro/part-r-00000.avro.header
drwxr-xr-x   - jtarraga supergroup          0 2015-06-30 16:33 /out/test.bam.avro/to-parquet
$ hadoop fs -ls /out/test.bam.avro/to-parquet
Found 4 items
-rw-r--r--   1 jtarraga supergroup          0 2015-06-30 16:33 /out/test.bam.avro/to-parquet/_SUCCESS
-rw-r--r--   1 jtarraga supergroup      16217 2015-06-30 16:33 /out/test.bam.avro/to-parquet/_common_metadata
-rw-r--r--   1 jtarraga supergroup      21021 2015-06-30 16:33 /out/test.bam.avro/to-parquet/_metadata
-rw-r--r--   1 jtarraga supergroup      37268 2015-06-30 16:33 /out/test.bam.avro/to-parquet/part-m-00000.snappy.parquet

Local scenario:

$ build/bin/hpg-bigdata-local.sh alignment convert -h

Usage:   hpg-bigdata-local.sh alignment convert [options]

Options:
          --conf           STRING     Set the configuration file [null]
      -x, --compression    STRING     Accepted values: snappy, deflate, bzip2, xz. Default: snappy [snappy]
      -v, --verbose        BOOLEAN    This parameter set the level of the logging [false]
      -h, --help                      This parameter prints this help [false]
    * -i, --input          STRING     Local input file in BAM format [null]
      -L, --log-level      STRING     Set the level log, values: debug, info, warning, error, fatal [info]
          --to-bam                    Convert back to BAM fomat. In this case, the input file has to be  saved in the GA4GH/Avro model, and the output file will be in BAM format [false]
    * -o, --output         STRING     Local output file to store the BAM alignments according to the GA4GH/Avro model [null]

Example:

$ mkdir /tmp/out
$ build/bin/hpg-bigdata-local.sh alignment convert -i build/data/test.bam -o /tmp/out/test.bam.avro
$ ls -ltr /tmp/out/test.bam.avro 
-rw-rw-r-- 1 jtarraga jtarraga 20348 jun 30 16:37 /tmp/out/test.bam.avro

In a local scenario, you can convert back to bam from avro, using the --to-bam option:

$ build/bin/hpg-bigdata-local.sh alignment convert -i /tmp/out/test.bam.avro -o /tmp/out/test.bam.avro.bam --to-bam
$ ls -lt build/data/test.bam /tmp/out/test.bam.avro.bam
-rw-rw-r-- 1 jtarraga jtarraga 11779 jun 30 16:39 /tmp/out/test.bam.avro.bam
-rw-rw-r-- 1 jtarraga jtarraga 11755 jun 30 15:29 build/data/test.bam
Sub-command: stats

Hadoop scenario:

$ build/bin/hpg-bigdata.sh alignment stats -h

Usage:   hpg-bigdata.sh alignment stats [options]

Options:
    * -o, --output         STRING     Local output directory to save stats results in JSON format  [null]
      -L, --log-level      STRING     Set the level log, values: debug, info, warning, error, fatal [info]
      -h, --help                      This parameter prints this help [false]
          --conf           STRING     Set the configuration file [null]
    * -i, --input          STRING     HDFS input file containing alignments stored accordin

Example:

$ mkdir /tmp/out-bam-stats
$ build/bin/hpg-bigdata.sh alignment stats -i /out/test.bam.avro/part-r-00000.avro -o /tmp/out-bam-stats/
...
...
$ ls -ltr /tmp/out-bam-stats/
total 8
-rw-r--r-- 1 jtarraga jtarraga 4562 jun 30 16:43 stats.json
$ cat /tmp/out-bam-stats/stats.json 
{"num_mapped": 176, "num_unmapped": 0, "num_paired": 176, "num_mapped_first": 88, "num_mapped_second": 88, "num_mismatches": 151, "nu...
...
Sub-command: depth

Hadoop scenario:

$ build/bin/hpg-bigdata.sh alignment depth -h

Usage:   hpg-bigdata.sh alignment depth [options]

Options:
    * -o, --output         STRING     Local output directory to save stats results in a text file  [null]
      -L, --log-level      STRING     Set the level log, values: debug, info, warning, error, fatal [info]
      -h, --help                      This parameter prints this help [false]
          --conf           STRING     Set the configuration file [null]
    * -i, --input          STRING     HDFS input file containing alignments stored accordin

Example:

$ mkdir /tmp/out-bam-depth
$ build/bin/hpg-bigdata.sh alignment depth -i /out/test.bam.avro/part-r-00000.avro -o /tmp/out-bam-depth/
...
...
$ ls -ltr /tmp/out-bam-depth/
total 5088
-rw-r--r-- 1 jtarraga jtarraga 5208096 jun 30 16:47 depth.txt
$ head /tmp/out-bam-depth/depth.txt
1	2080000	0
1	2080001	0
1	2080002	0
1	2080003	0
1	2080004	0
1	2080005	0
1	2080006	0
1	2080007	0
1	2080008	0
1	2080009	0