Download script & command builder: https://ewels.github.io/AWS-iGenomes/
In NGS bioinformatics, a typical analysis run involves aligning raw DNA sequencing reads against a known reference genome. A different reference is needed for every species, and many species have several references to choose from. Each tool then builds its own indices against these references. As such, one analysis run typically requires a number of different files. For example: raw underlying DNA sequence, annotation (GTF files) and index file for use the chosen alignment tool.
These files are quite large and take time to generate. Downloading and building them for each AWS run often takes a significant of the total run time and resources, which is very wasteful. To help with this, we have created an AWS S3 bucket containing the illumina iGenomes references, with a few additional indices for a extra tools on top of this base dataset. The iGeomes initiative aims to collect and standardise a number of common species, references and tool indices.
This data is hosted in an S3 bucket (~5TB) and crucially is uncompressed (unlike the .tar.gz
files held on the illumina iGenomes FTP servers). AWS runs can by pull just the required files to their local file storage before running. This has the advantage of being faster, cheaper and more reproducible.
To make usage easier, this repository contains a script (aws-igenomes.sh
) which can sync the AWS-iGenomes for you. It requires the AWS command line tools to be installed and configured with authentication. Required references can be supplied on the command line or given through prompts when running the script.
This repository is hosted using GitHub pages, so the script can be run in a single command as follows:
curl -fsSL https://ewels.github.io/AWS-iGenomes/aws-igenomes.sh | bash
For more details, see https://ewels.github.io/AWS-iGenomes/
If you'd prefer to just get a sync command for the files you need, you can use the web-based command builder that's available at https://ewels.github.io/AWS-iGenomes/
The details of the S3 bucket are as follows:
- Bucket Name:
ngi-igenomes
- Bucket ARN:
arn:aws:s3:::ngi-igenomes
- Region: EU (Ireland)
A full list of available files can be seen in ngi-igenomes_file_manifest.txt
The following species have reference builds available:
- Arabidopsis thaliana
- Bacillus cereus ATCC 10987
- Bacillus subtilis 168
- Bos taurus
- Caenorhabditis elegans
- Canis familiaris
- Danio rerio
- Drosophila melanogaster
- Enterobacteriophage lambda
- Equus caballus
- Escherichia coli K 12 DH10B
- Escherichia coli K 12 MG1655
- Gallus gallus
- Glycine max
- Homo sapiens
- Macaca mulatta
- Mus musculus
- Mycobacterium tuberculosis H37RV
- Oryza sativa japonica
- Pan troglodytes
- PhiX
- Pseudomonas aeruginosa PAO1
- Rattus norvegicus
- Rhodobacter sphaeroides 2.4.1
- Saccharomyces cerevisiae
- Schizosaccharomyces pombe
- Sorangium cellulosum So ce 56
- Sorghum bicolor
- Staphylococcus aureus NCTC 8325
- Sus scrofa
- Zea mays
Most of these species then have references from multiple sources and builds. For example, Mus musculus has the following:
- Ensembl
GRCm38
,NCBIM37
- NCBI
build37.1
,build37.2
,GRCm38
- UCSC
mm10
,mm9
Within each reference build, the following resources are typically available (with a few exceptions):
- Gene annotation in
GTF
andBED
format - Sequence
FASTA
files:- Whole genome files
- Separate chromosomes
- Abundant sequences
- Alignment indices for the following tools:
- For some genomes:
- smRNA (miRBase)
- Variation
An additional special-case is the GATK bundles, available for Homo sapiens (b37
, hg19
, hg38
, GRCh37
and GRCh38
).
See Data origin below for more details of how these files were generated.
The S3 bucket is currently set to be completely open access (there were problems with the previous Requester Pays policy). This will remain the case until the credits awarded to fund this project from Amazon run out or expire (hopefully stable for some time yet).
Note that if if possible, it's best for us if you run in the same region as this S3 bucket (eu-west
, Ireland).
Then there should be no data transfer fees and the resource should stay around for longer.
From the EC2 FAQ:
There is no Data Transfer charge between two Amazon Web Services within the same region (i.e. between Amazon EC2 US West and another AWS service in the US West). Data transferred between AWS services in different regions will be charged as Internet Data Transfer on both sides of the transfer.
How you use this resource largely depends on how you're using AWS. Very generally however, you can retrieve your required data by using the AWS Command Line Interface.
For example, using the aws sync
command:
aws s3 sync s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/ ./my_refs/
If the aws
tool isn't installed, probably the easiest way to get it is using pip
:
pip install --upgrade --user awscli
Remember that you must configure the tool with some kind of AWS authentication to access the contents of the s3 bucket.
For more information and help, see the AWS CLI user guide.
Nextflow is a powerful workflow manager allowing the creation of bioinformatics analysis pipelines. It was created to help the transition from traditional academic HPC systems to cloud computing. As such, it has extensive built-in support for a number of AWS features. One such feature is native integration with s3. This means that you can specify paths to required reference files in your pipeline which are stored in s3 and Nextflow will automatically retrieve them.
The repository contains an example Nextflow config file containing common paths and a suggested usage example: nextflow.config
For an example of this in action, see our NGI-RNAseq pipeline. The aws
profile config contains s3 paths and our regular HPC config contains comparable regular file paths. This allows us to run the pipeline on either our HPC system or AWS with the same command and no extra setup.
This resource is based on the illumina iGenomes references. These were downloaded and unpacked in April 2016.
After unpacking, references were added for STAR, Bismark and BED12. A new reference directory was contained for each reference and the index built (see commands below).
A full list of available files can be seen in this repository: ngi-igenomes_file_manifest.txt
module load star/2.5.1b
STAR --runMode genomeGenerate --runThreadN 8 --genomeDir ./ --genomeFastaFiles genome.fa --sjdbGTFfile genes.gtf --sjdbOverhang 100
(if no GTF file available, --sjdbGTFfile genes.gtf --sjdbOverhang 100
was not specified).
module load bowtie/1.1.2
module load bowtie2/2.2.6
module load bismark/0.14.5
bismark_genome_preparation ./
bismark_genome_preparation --bowtie2 ./
BED12 files were generated using the gtf2bed
tool from ea-utils.
Files for the STAR, Bismark and BED12 additions were kindly generated by the UPPMAX team. Full details and exactly scripts used for this can be found at github.com/UPPMAX/bio-data.
The GATK Resource Bundles for builds
b37
, hg19
and hg38
were downloaded from the Broad FTP server on 2017-05-25. For more information
about their contents, please see
this article.
Please note that b37/CEUTrio.HiSeq.WGS.b37.NA12878.bam
and associated files are not included.
This file is ~355GB and with the FTP download limiting from Broad it was going to take nearly
a year to transfer.
The Mouse Genome Project data was added to allow for the usage of GRCm38
data with the Sarek pipeline. This data was simply downloaded from the MGP FTP and additional files were created.
These files were addeed to AWS-iGenomes in November 2019.
These included the dbSNP SNP files with index and the dbSNP Indel files with corresponding index.
mgp.v5.merged.snps_all.dbSNP142.vcf.gz
mgp.v5.merged.snps_all.dbSNP142.vcf.gz.tbi
mgp.v5.merged.indels.dbSNP142.normed.vcf.gz
mgp.v5.merged.indels.dbSNP142.normed.vcf.gz.tbi
While the annotation folder contains a BED file for gene annotation, there was no intervals BED or interval list as required for running GATK available. This was simply created using the genome.fa.fai
of GRCm38
and modified as follows:
awk -v FS='\t' -v OFS='\t' '{ print $1, "0", $2 }' genome.fa.fai > wgs_calling_regions.grcm38.bed
Then we created an interval_list
file using this command:
gatk BedToIntervalList --INPUT References/GRCm38_calling_list.bed --OUTPUT References/GRCm38_calling_list.list --SEQUENCE_DICTIONARY References/genome.dict
AWS-iGenomes is now an AWS Open Data Resource (see https://registry.opendata.aws/aws-igenomes/). AWS has agreed to host up to 8TB data for AWS-iGenomes dataset until at least 28th October 2022. The resource has been renewed once so far and I hope that it will continue to be renewed for the forseeable future.
If you have any questions please get in touch with Phil Ewels (phil.ewels@scilifelab.se, @ewels) or create an issue on this repository.
- Made a web interface for generating aws s3 sync commands (not everyone likes random command line scripts..)
- Now that Amazon are taking the cost of the hosting, everything is fully public
- Added
--no-sign-request
to the commands so that they work without authentication
- Added
- Added new GRCh37 and GRCh38 builds for GATK
- Different to the existing hg18 and hg19 builds only in that the file organisation is cleaner and consistent with the rest of iGenomes (old builds left for backwards-compatibility)
- Contain new indexes for BWA. More to be added in the future.
Version v0.2 - 2016-05-25
- Added GATK bundles
b37
,hg19
andhg38
from the Broad FTP download - Minor download script updates
Version v0.1 - 2016-05-23
Initial released. Repository created with file-list of the iGenomes resource, with added BED12, STAR and Bismark indices. Download bash script written and basic website created at https://ewels.github.io/AWS-iGenomes/
The iGenomes resource was created by illumina. All credit for the collection and standardisation of this data should go to them!
This S3 resource was set up and documented by Phil Ewels (@ewels). The additional references not found in the base iGenomes resource were created with the help of Wesley Schaal (@wschaal) - a system administrator at UPPMAX (Uppsala Multidisciplinary Center for Advanced Computational Science).
The resource was initially developed for use at the National Genomics Infrastructure at SciLifeLab in Stockholm, Sweden.