Simplifying Science: Utilizing BERT and SciBERT for Scientific and Plain Language Text Classification
Independent final project for the UC Berkeley Natural Language Processing with Deep Learning graduate course.
Scientific jargon poses a significant barrier to the accessibility of scientific literature, yet from a researcher’s perspective it can be difficult to identify. This study explores the efficacy of advanced natural language processing (NLP) models in distinguishing between scientific and plain language texts using the Plain Language Adaptation of Biomedical Abstracts (PLABA) dataset. Leveraging the capabilities of BERT (Bidirectional Encoder Representations from Transformers) and SciBERT, a BERT variant pre-trained on scientific corpora, I conducted a comparative analysis to assess their performance in classifying text as either scientific or plain language. My methodology involved preprocessing the texts, implementing a simple neural network as a baseline, and then employing both BERT and SciBERT models. The baseline model, utilizing Word2Vec and NLTK, achieved a modest accuracy as expected. BERT demonstrated significant improvement, achieving a test accuracy of 97.01%, with high F1 scores and recall, indicating its proficiency in contextual understanding. Highlighting the advantages of domain-specific models in NLP tasks, SciBERT slightly outperformed BERT. This research offers insights into the optimization of NLP models scientific text identification, which could lead to advancements in plain language tools to aid scientific communication.