Read this first:

This repository was last updated for March-2017 UniprotKB release. Now, I have moved on to using BigBed files from UniProt, instead of Bed files used here. This recent work is available at repository uniprot_genomic.

UniProt in hg19 coordinates

UniProt provides human genome annotation data enabling mapping of amino acid annotations directly to reference genome coordinates, but they are available only in hg38 coordinates. See this publication for more info:

Nucleic Acids Res. 2017 Jan 4;45(D1):D158-D169. doi: 10.1093/nar/gkw1099. UniProt: the universal protein knowledgebase. The UniProt Consortium.

This repository converts and makes this data available in hg19 coordinates.

Files for download:

Besides conversion to hg19 coordinated, few changes are made here to suit our purposes, which is to identify if query amino acids have any UniProt annotation. See 'Processing pipeline' section for details.

Restructured, hg19-converted Bed files. This is what you probably are interested in.

Two merged files each containing selective sequence annotations of interest, as listed below.

a. Merged file - Type 1 has following annotation types merged into a single file.

  1	Active site
  2	Binding site for any chemical group
  3	Calcium binding region
  4	Cross-link between proteins
  5	Disulfide bond
  6	Glycosylation-PTM
  7	Interesting site
  8	Lipidation-PTM
  9	Metal binding site
  10	Motif
  11	Nucleotide binding region
  12	Other PTM
  13	Signal peptide
  14	Transit peptide
  15	Zinc finger region

b. Merged file - Type 0 has following annotation types merged into a single file.

  1	Active peptide
  2	Chain
  3	Coiled coil
  4	DNA binding domain
  5	Domain
  6	Intramembrane
  7	Natural variant
  8	Region of interest
  9	Repeated motifs or domains
  10	Topological domain
  11	Transmembrane region

Processing pipeline:

Use liftOver tool for conversion of hg38 to hg19 coordinates. Note: If you are interested in excecuting the script, download chain file and store it in settings_files directory. It is not provided here due to license concerns.
Fix formatting issues in resulting Bed files.

Reformat Bed files as follows:

a. Replace score column (5th column), which is zero by default in UniProt provided data, with corresponding sequence annotation type as shown below.

Original format by UniProt:
>chr1	7970956	7970959	Q99497	0	+	7970956	7970959	255,102,102	1	3	0	.	Nucleophile. Pubmed:20304780, Pubmed:25416785

Format we used here:
>chr1	8031016	8031019	Q99497	Active site	+	8031016	8031019	255,102,102	1	3	0	.	Nucleophile. Pubmed:20304780, Pubmed:25416785

b. Restructure the rows in Bed files that have non-continuous amino acids as in example below.

Original format by UniProt (this line has coordinates for three, non-continuous amino acids):
>chr1	1633782	1633815	O75900	0	+	1633782	1633815	0,153,0	3	3,3,3	0,12,30	.	Zinc; catalytic.

Format we used here (one amino acid per row, if non-continuous):
>chr1	1569161	1569164	O75900	Metal binding site	+	1569161	1569164	0,153,0	1	3	0	.	Zinc; catalytic.
>chr1	1569173	1569176	O75900	Metal binding site	+	1569173	1569176	0,153,0	1	3	0	.	Zinc; catalytic.
>chr1	1569191	1569194	O75900	Metal binding site	+	1569191	1569194	0,153,0	1	3	0	.	Zinc; catalytic.

Resulting Bed files are what you probably need if you are looking for replacement for UniProt provided hg38 genome coordinates in hg19 format.

Further Restructuring:

We further merge sequence annotation types of our interest into two Bed files.

For annotation type 'natural variant', replace disease acronyms with their complete name.
Merge Bed files of interest (as customized in the settings file; based on values 0 and 1) based on sequence annotation types into two sets of merged files.

Download the resulting merged bed files:

a. Merged Bed file - Type 1

b. Merged Bed file - Type 0

Disclaimer

UniProt's license applies for the genome coordinates data available in this repository. Thanks to UniProt for permitting us to distribute this data in hg19 format. Data is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Code in this repository is distributed under MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Download_data_Mar2017		Download_data_Mar2017
hg38_uniprot_bedfiles		hg38_uniprot_bedfiles
settings_files		settings_files
.gitignore		.gitignore
1_liftover_hg38_to_hg19.py		1_liftover_hg38_to_hg19.py
2_extract_reviewed.py		2_extract_reviewed.py
3_intersect_uniprot.py		3_intersect_uniprot.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Read this first:

UniProt in hg19 coordinates

Files for download:

Processing pipeline:

Further Restructuring:

Disclaimer

About

Releases

Packages

Languages

ManavalanG/UniProt-genome-annotations-hg19

Folders and files

Latest commit

History

Repository files navigation

Read this first:

UniProt in hg19 coordinates

Files for download:

Processing pipeline:

Further Restructuring:

Disclaimer

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages