Skip to content

moulya-sudhir/scientific-entity-recognition

Repository files navigation

Scientific Entity Recognition

The code for HW4 can be found in the ANLP-hw4_2.ipynb file.

The rest of this repo is legacy code for HW2

Link to Website: https://aclanthology.org/

File Structure:

  • code/webscrape.py : Contains function to scrape and download pdfs from ACL Anthology website, based on conference name, year and number of pdfs

    • Example Usage:
      • from webscrape import scrape_pdfs
      • scrape_pdfs('acl','2022', 5)
  • code/runner.py : Contains functions to train the model and test via a test set

    • Example Usage:
      • python code/runner.py --train --model_name 'roberta-large' --epochs 5 --batch_size 4 --lr 2e-5 --output_dir 'models/roberta-large' --train_data 'data/train'
      • python code/runner.py --test --model_name 'models/roberta-large' --batch_size 4 --output_dir 'models/roberta-large' --test_data 'data/test'
  • code/prediction.py : Contains code to predict on the Kaggle test sets

    • Example Usage:
      • python code/prediction.py --model_name 'models/roberta-large' --test_csv 'data/test.csv' --output_csv 'data/outputs.csv'
  • data folder : folder containing all scientific paper data for all 3 conferences

    • Each folder has a pdfs folder, tokens folder, and an annotations folder.
  • test_webscrape.ipynb : Examples of how to use the webscrape function

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published