GitHub - timrpeterson-lab/pubmedquery-tim: ranking genes and disease by how often they are mentioned in all 28M Pubmed publications

Obtaining and analyzing ~27M PubMed citations

How to import data to MySQL and analyze it.

obtain TSV files from Max at UCSC. There is ~1200 files, which contain collections of distinct pubmed citations. See papers.txt (or articles_used.txt) for ~1200 file names used to make the MySQL data snapshot around 7/4/17. max_download_papers.py
unzip the .articles files gunzip *.articles.gz
upload to mysql with upload_mysql.php
generate gene_disease pivot table disease_gene_rank_v2.php
generate gene_paper pivot table disease_paper.php
analyze data with MySQL. The main issue with the data is that english words don't get recognized well. There are genes with official NCBI symbols like "MICE", "SET", "MET", and "COPD", that add noise to the returned results. Other genes like p53 have official symbols "TP53" that aren't as commonly used. The solution is manual curation. See this file for documentation on what each query does: morpheome-db-queries.sql
Perhaps the most useful query for MORPHEOME for top-cited gene ranking is described below. It returns a ranked list of all the genes that co-occur with a given search term, in this example "osteoporosis". It is slow (can be 30s), so it needs optimization if it will be used on a website. Perhaps, we need a index on some of the JOINed tables?

select * from aliases 
join (select gene_paper_copy.gene_id, count(gene_paper_copy.gene_id) as count from gene_paper_copy
join (
	SELECT * FROM publications WHERE match(abstract) against("+osteoporosis" IN BOOLEAN MODE)) p
	on gene_paper_copy.pmid=p.PMID 
	group by gene_paper_copy.gene_id) m
	on aliases.gene_id=m.gene_id
	where type = "NCBI_official_symbol"
	group by aliases.gene_id 
	order by m.count desc;

Download a list of common queries from MeSH and put in csv using mesh_xml2csv.php (for easy import to MySQL) so we can pre-fetch results of top-cited-orphan gene pairs. Otherwise if each user had to run their query through all 28M papers and the rest of the pipeline, it would take forever.
Created mesh_paper table using mesh_paper.php. Run php scripts as daemon (i.e., in the background) using this https://dor.ky/run-php-script-as-daemon-using-supervisord/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Obtaining and analyzing ~27M PubMed citations

How to import data to MySQL and analyze it.

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
README.md		README.md
articles_used.txt		articles_used.txt
disease_gene_rank.php		disease_gene_rank.php
disease_gene_rank_v2.php		disease_gene_rank_v2.php
gene_paper.php		gene_paper.php
max_download_papers.py		max_download_papers.py
mesh_paper.php		mesh_paper.php
mesh_xml2csv.php		mesh_xml2csv.php
morpheome-db-queries.sql		morpheome-db-queries.sql
mysql_backup.sh		mysql_backup.sh
papers.txt		papers.txt
upload_mysql.php		upload_mysql.php

timrpeterson-lab/pubmedquery-tim

Folders and files

Latest commit

History

Repository files navigation

Obtaining and analyzing ~27M PubMed citations

How to import data to MySQL and analyze it.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages