Simplifying Science: Utilizing BERT and SciBERT for Scientific and Plain Language Text Classification

Independent final project for the UC Berkeley Natural Language Processing with Deep Learning graduate course.

Scientific jargon poses a significant barrier to the accessibility of scientific literature, yet from a researcher’s perspective it can be difficult to identify. This study explores the efficacy of advanced natural language processing (NLP) models in distinguishing between scientific and plain language texts using the Plain Language Adaptation of Biomedical Abstracts (PLABA) dataset. Leveraging the capabilities of BERT (Bidirectional Encoder Representations from Transformers) and SciBERT, a BERT variant pre-trained on scientific corpora, I conducted a comparative analysis to assess their performance in classifying text as either scientific or plain language. My methodology involved preprocessing the texts, implementing a simple neural network as a baseline, and then employing both BERT and SciBERT models. The baseline model, utilizing Word2Vec and NLTK, achieved a modest accuracy as expected. BERT demonstrated significant improvement, achieving a test accuracy of 97.01%, with high F1 scores and recall, indicating its proficiency in contextual understanding. Highlighting the advantages of domain-specific models in NLP tasks, SciBERT slightly outperformed BERT. This research offers insights into the optimization of NLP models scientific text identification, which could lead to advancements in plain language tools to aid scientific communication.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Simplifying Science: Utilizing BERT and SciBERT for Scientific and Plain Language Text Classification

Files

README.md

Latest commit

History

README.md

File metadata and controls

Simplifying Science: Utilizing BERT and SciBERT for Scientific and Plain Language Text Classification