PMID to PMC API from Medline cannot convert all provided PMID #37

titipata · 2016-12-29T16:27:36Z

The API here cannot convert all PMID input. I was trying to parse citations from given set of PMIDs but it only returns subset of PMIDs that I provided. One possibility is to host pair of PMIDs/PMCs somewhere on the cloud and provide similar API or source file that user can use to convert PMID to PMC.

titipata · 2017-01-04T17:35:09Z

I uploaded PMID-PMC pairs (size of 91 MB, not bad not bad) where we can download as follow:

wget https://s3-us-west-2.amazonaws.com/science-of-science-bucket/nih/pmid_pmc_pair.csv

For given file, you can convert PMID to PMC on your own. From here, we can modify parse_citation_web function to receive just PMC be as below.

def parse_citation_web(pmc):
    """
    Parse citations from given PMC 
    Parameters
    ----------
    pmc: str, PMC of the document e.g. 'PMC1217341'
    Returns
    -------
    dict_out: dict, contains following keys
        pmc: Pubmed Central ID
        n_citations: number of citations for given articles
        pmc_cited: list of PMCs that cite the given PMC
    """

    link = "http://www.ncbi.nlm.nih.gov/pmc/articles/%s/citedby/" % str(pmc)
    page = requests.get(link)
    tree = html.fromstring(page.content)
    n_citations = extract_citations(tree)
    n_pages = int(n_citations/30) + 1

    pmc_cited_all = list() # all PMC cited
    citations = tree.xpath('//div[@class="rprt"]/div[@class="title"]/a/@href')[1::]
    pmc_cited = list(map(extract_pmc, citations))
    pmc_cited_all.extend(pmc_cited)
    if n_pages >= 2:
        for i in range(2, n_pages+1):
            link = "http://www.ncbi.nlm.nih.gov/pmc/articles/%s/citedby/?page=%s" % (pmc, str(i))
            page = requests.get(link)
            tree = html.fromstring(page.content)
            citations = tree.xpath('//div[@class="rprt"]/div[@class="title"]/a/@href')[1::]
            pmc_cited = list(map(extract_pmc, citations))
            pmc_cited_all.extend(pmc_cited)
    pmc_cited_all = [p for p in pmc_cited_all if p is not pmc]
    dict_out = {'n_citations': n_citations,
                'pmc': pmc,
                'pmc_cited': pmc_cited_all}
    return dict_out

titipata · 2017-01-14T03:16:57Z

Also, we also want to add Copyright Notice for scraping function so that users don't scrape too much and get blocked https://www.ncbi.nlm.nih.gov/pmc/about/copyright/#copy-PMC

nick-hahner · 2018-03-28T00:16:44Z

What about ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.csv ?
Not every PMCID has a corresponding PMID though according to this list.

titipata · 2018-03-28T00:39:25Z

@nick-hahner, nice! It contains ~ 1.8M rows of PMID/ PMC pairs of Open Access Subset. I'm still thinking about how to update the list regularly by not hurting the repository. I mean, I could upload PMC-PMID pairs from MEDLINE somewhere as I mentioned. Do you have any preference or suggestions on how to make it available on the repository?

nick-hahner · 2018-03-28T01:40:14Z

Actually this file is probably better with 4,892,265 rows:
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz

First rsync or wget -c -N ... the file to some directory like ~/.pp_data
Then you can use an sqlite3 db

# Create an indexed sqlite db 
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///pmid_to_pmcid.db')  # but some better location
df = pd.read_csv('~/.pp_data/PMC-ids.csv.gz', dtype=str)
df[['PMCID', 'PMID']].to_sql('pmc_pmid', engine, index=False, if_exists='replace')
engine.execute('create index pmc_idx on pmc_pmid(PMCID)')
engine.execute('create index pmid_idx on pmc_pmid(PMID)')

# then later you can fetch like so:
from sqlalchemy import create_engine, text as sqa_text
def get_pmcid_from_pmid(pmid):
    engine = create_engine('sqlite:///pmid_to_pmcid.db')
    ret = engine.execute(sqa_text('select pmcid from pmc_pmid where pmid = :pmid;'), pmcid=pmcid).fetchone()
    return ret[0] if ret else None

How's that sound?

chengkun-wu · 2018-03-28T02:44:51Z

@nick-hahner Yes! I used the ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz file for my local conversion.

titipata added the enhancement label Dec 29, 2016

titipata changed the title ~~PMID to PMC API from Medline cannot convert all PMID~~ PMID to PMC API from Medline cannot convert all provided PMID Dec 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PMID to PMC API from Medline cannot convert all provided PMID #37

PMID to PMC API from Medline cannot convert all provided PMID #37

titipata commented Dec 29, 2016

titipata commented Jan 4, 2017 •

edited

Loading

titipata commented Jan 14, 2017

nick-hahner commented Mar 28, 2018 •

edited

Loading

titipata commented Mar 28, 2018

nick-hahner commented Mar 28, 2018 •

edited

Loading

chengkun-wu commented Mar 28, 2018

PMID to PMC API from Medline cannot convert all provided PMID #37

PMID to PMC API from Medline cannot convert all provided PMID #37

Comments

titipata commented Dec 29, 2016

titipata commented Jan 4, 2017 • edited Loading

titipata commented Jan 14, 2017

nick-hahner commented Mar 28, 2018 • edited Loading

titipata commented Mar 28, 2018

nick-hahner commented Mar 28, 2018 • edited Loading

chengkun-wu commented Mar 28, 2018

titipata commented Jan 4, 2017 •

edited

Loading

nick-hahner commented Mar 28, 2018 •

edited

Loading

nick-hahner commented Mar 28, 2018 •

edited

Loading