Deep learning is emerging as a powerful tool in multiple biomedical areas, especially in genetics and genomics. This project is an exercise in automating the annotation of new abstracts as relevant to deep learning in genetics using an existing curated dataset.
All abstract information were queried from the National Library of Medicine's PubMed citation database using the Entrez Direct (EDirect) utilities.
A sample output file is included as input/abstracts-8.26.21.pickle
to ensure reproducibility, but one can easily adapt notebooks/pubmed-curate.ipynb
to refresh the queried results. However, do note that a new query may differ from the provided set and may not work with the provided model predictions.
The temp files used in the Snorkel analysis are not included but can be regenerated by running notebooks/pubmed-features.ipynb
. A revision was made to the classification of PMID 32080846, which was incorrectly classified initially as category 4 and is now correctly labeled as category 1. The notebooks are still using the original input files, but there is a revision cell at the end of notebooks/pubmed-features.ipynb
to correct this.
This code was run on Python 3.8.12
due to compatibility issues with the Snorkel package. All required packages are listed in the requirements.txt
file, which can be installed with the command pip install -r requirements.txt
.
Sample training scripts for the RandomForestClassifier
and BertForDocumentClassification
are included in the scripts/
folder. BERT training and prediction will realistically require a CUDA-enabled GPU system. notebooks/pubmed-snorkel.ipynb
will not run properly if the data regeneration steps above are not followed.
This research was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health. This work utilized the computational resources of the NIH HPC Biowulf cluster.
If you found this project to be useful, consider citing our paper below. Thanks for reading!
Ledgister Hanchard, S. E.*, Dwyer, M. C.*, Liu, S.*, Hu, P., Tekendo-Ngongang, C., Waikel, R. L., Duong, D., & Solomon, B. D. (2022). Scoping review and classification of Deep Learning in Medical Genetics. Genetics in Medicine. https://doi.org/10.1016/j.gim.2022.04.025 *Co-first authors.