ArrowSAM is an in-memory Sequence Alignment/Map (SAM) representation which uses Apache Arrow framework (A cross-language development platform for in-memory data) and Plasma (Shared-Memory) Object Store to store and process SAM columnar data in-memory.
The following paper describes the ArrowSAM format and its usage to speedup genomics pipelines. If you use ArrowSAM in your work, please cite the following paper.
Ahmad et al., (2020). "ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow", ICCAIS. doi.org/10.1109/ICCAIS48893.2020.9096725
Ahmad et al., "Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework", BMC Genomics, presented at APBC2020. https://doi.org/10.1186/s12864-020-07013-y
This repo contains following three components:
-
ArrowSAM (In-memory SAM data representation) integrated BWA-MEM, Picard and GATK tools.
-
A Singularity container def file (To create an environment to use all Apache Arrow related tools and libraries for ArrowSAM).
-
Scripts to run different GATK best practices recommended workflows (using different in-memory data placement techniques like ArrowSAM, ramDisk and pipes for fast processing) to run complete DNA analysis pipeline efficiently.
Note: ArrowSAM and all other workflows are based on single node, multi-core machines.
-
Install Singularity container
-
Download our Singularity script and generate singularity image (this image contains all Arrow related packges necessary for building/compiling BWA-MEM, Picard and GATK)
-
Now enter into generated image using command:
sudo singularity shell <image_name>.simg
-
Download BWA-MEM inside image
git clone https://github.com/tahashmi/bwa.git
-
Go into bwa dir and compile BWA-MEM:
cd bwa make
-
Now you can run BWA-MEM.