Skip to content

Latest commit

 

History

History
87 lines (50 loc) · 5.24 KB

README.md

File metadata and controls

87 lines (50 loc) · 5.24 KB

General

A repository for some code related to genetics/ genomics under development.

Introgression viewer

A program for quickly viewing a high level overview of introgressions or contrasting genotypes between two or more individuals. Designed for high density SNP data as an input. The program was designed for viewing introgressions in barley genotypes. It currently only works with fixed (ie non-heterozygous) individuals and will ignore region of heterozygozity, though there are plans to include het calls.

The genotype of a particular region is determoined from a sliding window and parameters set by the user. The larger the sliding window, the more accurate the prediction of the genotype for that particular region, but there is a trade off between accuracy and resolution. A histogram of identity displayed by the program should bbe used to fine tune the parameters. The user can also set a minimum introgression size to avoid false positive recombination events.

Dependencies:

  • pandas
  • matplotlib

Introgression viewer requires python >=3.7

Usage

python Introgression_viewer.py [-d] [-c] [-chrom] [-cutoff] [-step_size] [-window_size] [-colour] [-features]

Optional arguments:

 -d               Input dataset
 -c               The comparisons you would like to make
 -chrom           Optional: The chromosomes you want to visualise
 -cutoff          Optional: Percentage similarity cutoff for determining alleles (default: 90)
 -step_size       Optional: Step size (default = 10)
 -window_size     Optional: Window size (default = 100)
 -colour          Optional: Colours for comparison (default = "purple,green")
 -features        Optional: Genes or other features to add

Default usage would look like this:

python Introgression_viewer.py -d my_genotypes.tab -c "Int_52-Int_17, Int_52-Barke"

Detailed explanation of arguments:

-d Input dataset This is a tab delimited text file with genotype information for each of your individuals of interest. The columns must be in the format Marker Name, chrom, position, line 1 allele, line 2 allele,.... line n allele. e.g:

Marker	Chromosome	Barke position	Barke	Int_52	Int_33	Int_42	Int_17	Int_19	Int_56
JHI-Hv50k-2016-7	chr1H	112201	T	T	T	T	T	T	T
JHI-Hv50k-2016-24	chr1H	110264	T	T	T	T	T	T	T
JHI-Hv50k-2016-64	chr1H	106615	A	A	A	A	A	A	A
JHI-Hv50k-2016-66	chr1H	106465	T	T	T	T	T	T	T
JHI-Hv50k-2016-72	chr1H	105691	G	G	G	G	G	G	G
JHI-Hv50k-2016-73	chr1H	105623	NA	NA	NA	NA	NA	NA	NA
JHI-Hv50k-2016-88	chr1H	104313	A	A	A	A	A	A	A
JHI-Hv50k-2016-97	chr1H	103806	A	A	A	A	A	A	A

Genotypes should be in the format A|C|G|T. Heterozygous genotypes (e.g A/T) and NAs will be removed automatically.

-c comparisons The genotypes you want to compare. One or more comparisons in this format: genotype2-genotype1, genotype3-genotype1. Warning: Different comparisons may have different similarities, so you may want to run one at a time with paramters set for each one. Also make sure genotype names match genotype names in your -d input file, otherwise nothing will work!')

-chrom chromosomes Optional: The chromosomes you want to visualise. e.g chr1H,chr2H. Default will display all.

-cutoff Optional: Percentage similarity cutoff for determining alleles, default = 90. Use the histogram to set this value correctly

-step_size Step size (int). Small value (1) can give more accurate introgression intervals, but will be fuzzy. Large value for sharp intervals, but loss of accuracy', default = 10

-window_size Window size. Larger windows will mean less false positives/negatives, but could lead to more inaccurate boundry positions

-colour Colours for comparison in format <colour1,colour2>. Default: purple and green. Note: If you want to use colour codes, you can, but change the # to %, e.g for yellow %FFFF00 instead of #FFFF00

-features Genes or other features to add onto chromosomes. File must be in bed4 format

Outputs

Outputs will appear in the same directory as the script

Genotype2-Genotype1.bed Introgressions (or regions of similarity/difference) in bed4 format. This is used as the input for the chromosome plot. Check here for any oddities.

Genotype2-Genotype1_chromosomes.png The visualisation of chromosomes

chromosomeN_histogram.png A histogram of genotypic similarity from all windows, for each chromosome. Use these figures to determine your cutoff. X axis is percentage similarity, y is frequency. You should see a peak close to 100 which represents those areas of the chromosome that are the same, and a further peak showing those areas of the genome where there are differences. Set your cutoff between these two peaks. If your genotype data is accurate this should be set to about 95%.

Examples:

Figure 1

Figure 2

In these examples, those regions that have a different genotype between the two lines being compared are in yellow, whilst areas of the same genotype are in blue. If you compare a line from a population with its parent, the areas of difference/similarity will correspond to introgressions (as in these examples).