Assignment for Joyita Di Sayandip Bhattacharyya #3rd Year, B. Tech, CSE 2021-2025 Batch, B. P. Poddar Institute Of Management and Technology
My Approach for this task:
1. Data Reading and Preprocessing:
- I began by reading the citation network dataset (
cit-HepTh.txt
) and the abstracts dataset (cit-HepTh-abstracts.tar.gz
). - The citation network dataset provides information about which papers cite other papers.
- In the abstracts dataset, I found abstracts of papers organized by year.
2. Loading Pretrained Sci-BERT Model:
- I loaded the pretrained Sci-BERT model (
allenai/scibert_scivocab_uncased
) using the Sentence Transformers library. - This model, trained specifically for scientific text, allows me to generate embeddings for sentences or paragraphs.
3. Iterating Over Seed Papers:
- I selected a few seed papers (e.g.,
'9201001', '9203201', '119203001'
) for which I wanted to calculate similarity scores. - For each seed paper:
- If the paper exists in the citation network dataset, I proceeded to find its references and their abstracts.
4. Embedding Abstracts and Calculating Similarity:
- For each reference paper cited by the seed paper:
- I embedded the abstracts of both the seed paper and the reference paper using the Sci-BERT model.
- Then, I calculated the cosine similarity between the embeddings of the seed paper's abstract and the reference paper's abstract.
- Cosine similarity serves as a similarity metric here, indicating how similar the abstracts are.
- Higher similarity scores suggest greater thematic similarity between papers.
5. Output:
- I printed the reference paper IDs along with their similarity scores for each seed paper.
- This information helps in understanding the thematic similarity between papers based on their abstract content.