Skip to content

Automatic annotation of new abstracts as relevant to deep learning in genetics using large language models.

Notifications You must be signed in to change notification settings

simonliu99/classify-medical-genetics-abstracts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Classification of Medical Genetics Abstracts

Summary

Deep learning is emerging as a powerful tool in multiple biomedical areas, especially in genetics and genomics. This project is an exercise in automating the annotation of new abstracts as relevant to deep learning in genetics using an existing curated dataset.

Abstract Queries

All abstract information were queried from the National Library of Medicine's PubMed citation database using the Entrez Direct (EDirect) utilities.

A sample output file is included as input/abstracts-8.26.21.pickle to ensure reproducibility, but one can easily adapt notebooks/pubmed-curate.ipynb to refresh the queried results. However, do note that a new query may differ from the provided set and may not work with the provided model predictions.

Input File Regeneration

The temp files used in the Snorkel analysis are not included but can be regenerated by running notebooks/pubmed-features.ipynb. A revision was made to the classification of PMID 32080846, which was incorrectly classified initially as category 4 and is now correctly labeled as category 1. The notebooks are still using the original input files, but there is a revision cell at the end of notebooks/pubmed-features.ipynb to correct this.

Code

This code was run on Python 3.8.12 due to compatibility issues with the Snorkel package. All required packages are listed in the requirements.txt file, which can be installed with the command pip install -r requirements.txt.

Sample training scripts for the RandomForestClassifier and BertForDocumentClassification are included in the scripts/ folder. BERT training and prediction will realistically require a CUDA-enabled GPU system. notebooks/pubmed-snorkel.ipynb will not run properly if the data regeneration steps above are not followed.

Acknowledgement

This research was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health. This work utilized the computational resources of the NIH HPC Biowulf cluster.

If you found this project to be useful, consider citing our paper below. Thanks for reading!

Citation

Ledgister Hanchard, S. E.*, Dwyer, M. C.*, Liu, S.*, Hu, P., Tekendo-Ngongang, C., Waikel, R. L., Duong, D., & Solomon, B. D. (2022). Scoping review and classification of Deep Learning in Medical Genetics. Genetics in Medicine. https://doi.org/10.1016/j.gim.2022.04.025 *Co-first authors.

About

Automatic annotation of new abstracts as relevant to deep learning in genetics using large language models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published