-
Notifications
You must be signed in to change notification settings - Fork 189
ALLMAPS
The ordering and orientation of genomic scaffolds to reconstruct chromosomes is essential during de novo genome assembly. This process is often assisted by various mapping techniques. Because each map provides a unique line of evidence, a combination of multiple maps can greatly improve the accuracy of the resulting chromosomal assemblies. ALLMAPS is capable of computing a scaffold order that maximizes the colinearity to a collection of maps, including genetic, physical, or comparative maps into the final chromosome build. We highlight several salient features of ALLMAPS.
- Clear, computable objective which is to maximize the colinearity to multiple input maps - reproducible
- Allow variable weights in input maps, leading to better control in conflict resolution - flexible
- Can incorporate many mapping evidence, requiring minimal transformation - powerful
- Streamline the genome build (with FASTA and AGP outputs) and genome update (with CHAIN output) - easy-to-use
Solving scaffold ordering and orientation (OO) in general is NP-hard. ALLMAPS converts the problem into Traveling Salesman Problem (TSP) and refines scaffold OO using Genetic Algorithm. For a rough idea, a 'live' demo of the scaffold OO on yellow catfish chromosome 1 can be viewed in the animation below.
Tang H, Zhang X, Miao C, Zhang J, Ming R, Schnable J, Schnable P, Lyons E, Lu J. (2015) ALLMAPS: robust scaffold ordering based on multiple maps. Genome Biology 16(1):3
Most of these Python libraries can be installed using pip install
, for example:
pip install --user biopython==1.70 numpy deap networkx matplotlib jcvi
Notice the biopython==1.70
since the latest release (v1.72) of biopython contains a buggy function that ALLMAPS uses, so we have to use an older version.
Optional: If you experience a problem installing the jcvi
library, you can clone the jcvi repository directly and put the directory on your PYTHONPATH
. For example, if you want to install it to ~/code
, do the following:
$ cd ~/code
$ git clone git://github.com/tanghaibao/jcvi.git
$ export PYTHONPATH=~/code:$PYTHONPATH
If you are somehow stuck, you might need to install the libraries manually, please check out our installation guide in heavy detail.
We would like to use a small example to showcase what you can do with ALLMAPS. In this example, we have two maps and scaffold sequences. Our goal is to use the two maps, to order and orient the genomic scaffolds into chromosomes. Download the test dataset here. We have the following files:
- scaffold sequences (
scaffolds.fasta
) - Map 1 (
JMMale.csv
) - Map 2 (
JMFemale.csv
)
That's it. Now let's watch how the magic unfolds.
Scaffold ID, scaffold position, LG, genetic position
You can do this in EXCEL, but remember to save as "comma-separated format".
For the test maps, both maps are properly formatted, so you can skip this step. Additionally, please note that ALLMAPS supports a wide array of mapping evidence. For possible conversion for your own data, check out the article on How to use different types of genomic maps.
Step 2. Merge the two maps together. This will generate a weights file (weights.txt
) and the input bed file (JM-2.bed
).
python -m jcvi.assembly.allmaps merge JMMale.csv JMFemale.csv -o JM-2.bed
JMFemale 1
JMMale 1
The first map (first line) in the weights file is important since other maps will be partitioned according to this map. Please pick the map that matches the expected number of chromosomes.
python -m jcvi.assembly.allmaps path JM-2.bed scaffolds.fasta
We now have a release!
-
JM-2.fasta
- reconstructed chromosome sequences -
JM-2.agp
- order and orientations of the scaffolds, which can be used in Genbank submissions -
JM-2.chain
- useful to convert scaffold coordinates to new coordinates, for example, if you annotated gene models using the scaffolds, you can use this file along withliftOver
to transfer the genes onto chromosomes. See ALLMAPS: How to lift over gene annotations.
Each generated PDF file like chr23.pdf
corresponds to a reconstructed chromosome. The left panel contains "side-by-side" alignments between chromosomes and the linkage groups. This type of plot is helpful to reveal conflicting markers as crossing lines. The second visualization on the right is a scatter plot, where the coordinates of the dots represent the physical locations and the map locations of the markers. The scatter plots are a good visualization for illustrating the monotonic trend as well as revealing breaks in colinearity.
A summary report JM-2.summary.txt
is provided to the user during the genome build, with important statistics such as the number of scaffolds anchored, the number of big (N50) scaffolds anchored, the total number of markers included in the final assembly and total length of sequences. These statistics can be useful tools to compare the efficacy of ALLMAPS before and after anchoring.
*** Summary for each individual map ***
======================================================================
o JMFemale JMMale
----------------------------------------------------------------------
Linkage Groups 1 1
Markers (unique) 61 59
Markers per Mb 4.2 4.3
N50 Scaffolds 5 5
Scaffolds 18 18
Scaffolds with 1 marker 7 7
Scaffolds with 2 markers 3 5
Scaffolds with 3 markers 1 0
Scaffolds with >=4 markers 7 6
Total bases 14,691,276 (96.2%) 13,781,640 (90.2%)
----------------------------------------------------------------------
*** Summary for consensus map ***
===================================================================================
o Anchored Oriented Unplaced
-----------------------------------------------------------------------------------
Markers (unique) 120 107 0
Markers per Mb 7.9 8.8 0
N50 Scaffolds 5 5 0
Scaffolds 22 13 0
Scaffolds with 1 marker 5 0 0
Scaffolds with 2 markers 7 3 0
Scaffolds with 3 markers 1 1 0
Scaffolds with >=4 markers 9 9 0
Total bases 15,270,543 (100.0%) 12,156,526 (79.6%) 0 (0.0%)
-----------------------------------------------------------------------------------
The anchor rate for the consensus map reaches 100%, which is better than using any single map alone. Hooray!
ALLMAPS provide some way to splitting chimeric contigs, although most users would prefer to split with their own custom scripts. Please check out ALLMAPS: How to split chimeric contigs.
ALLMAPS can estimate sizes for inter-scaffold gaps, through the conversion from genetic distance (for example, centi-Morgan used in genetic maps) to physical distance. The default inter-scaffold gap size is 100, which according to Genbank convention, indicates a gap with unknown size. The following command will re-estimate gap sizes based on distance conversion, and generate a new AGP file JM-2.estimategaps.agp
.
python -m jcvi.assembly.allmaps estimategaps JM-2.bed
Please check out ALLMAPS: How to estimate gap sizes.
The above example assumes genetic maps as input data. ALLMAPS support other types of maps in BED format, and multiple BED files can be merged into a single input file, using mergebed
.
python -m jcvi.assembly.allmaps mergebed opticalmap.bed geneticmap.bed synteny.bed -o merged.bed
For detailed examples of conversion from other maps, please check out How to use different types of genomic maps.
The pivot map is the map with the heaviest weight or whichever comes first in the event of a tie, based on weights.txt
. The pivot map determines the number of partitions that lead to chromosomes. Make sure you set the weights so that the most confident map as the pivot (either assigned it a larger weight or make it the first line of the weights.txt
in the case of equal weights).
Two studies medicago
and yellow-catfish
, described in detail in the ALLMAPS manuscript, can be run to get further familiarized with ALLMAPS. See run.sh
in the respective folder on running ALLMAPS on these two data sets.
- Tang, Haibao (2014): ALLMAPS supporting data: Yellow catfish genome assembly. figshare. http://dx.doi.org/10.6084/m9.figshare.1057746
- Tang, Haibao (2014): ALLMAPS supporting data: Medicago genome assembly. figshare. http://dx.doi.org/10.6084/m9.figshare.1057745
© Haibao Tang, 2010-2024