Code, Data, and Checkpoint For Unleashing the Power of Knowledge Extraction from Scientific Literature in Catalysis
- pytorch==1.8.1
- pytorch-lightning==1.5.10
- transformers==4.6.1
- tqdm
- numpy
- pandas
- seqeval
- stanza==1.2.0
- streamlit==1.1.0
- pyserini==0.13.0
Download dataset from zenodo and follow the dataset instruction to extract the data
The reproduce.ipynb
notebook provides instructions to reproduce our main expriment result using ALL data
The train.ipynb
notebook gives examples of how to train your own model on your own data set
The prediction.ipynb
notebook gives examples of how to use our model to extract catalysis information on given text
We are sorry that we cannot make our search engine and correlation analysis system public since some articles in our collection are not open-accessed. Sharing those is against Elsevier's TDM policy. But we provide the key piece of code of our system, which should help you build a similar system on you own data. If you want to try our system personally, please get in touch with us.
Here we show you how we search query related articles from our highly relevant article collection
follow the Guide to indexing and searching English documents
section at pyserini's guideline
- we split each article into paragraphs using sliding window algorithm: the 1st paragraph contains the 1st to 10th sentence, the 2nd paragraph contains the 11th to 20th sentence and so on. As lessons learned from TREC-COVID challenge and TREC 2007 genomics track shows that treat the full text as a single document is not a good design. A window size of 10 is proven to be effective on MS MARCO Document Ranking
- we use BM25(k1=1.2, b=0.9) to search paragraphs, we tune the parameters of BM25 by optimizing the following retrieval task: we using the title of each article as query and search the whole collection of article abstracts. The higher we rank the corresponding abstract which comes from the same article as the title, the better the parameters. This is the same as what is done in Content-Based Weak Supervision for Ad-Hoc Re-Ranking
- The final rank is based on the sum of the top 3 paragraph BM25 scores of each article. Note that directly take the maxiumal score is more popular as suggested by expriments on Robust04. We use top 3 mainly for better visual display of articles.
Here we store the metadata of each paper in MongoDB and use Pymongo to access it. Feel free change load_db
and fetch_paper
function to use other metadata storage.
Here we show you how we analyze the correlation between entities
- you need to provide a pandas DataFrame which contains
doi
,sent_i
,label
,span
andnorm_span
five columns. It should covers all extracted chemical entities, their index(which paper, which sentence), their label(which type of entity), their normalization form(mainly to aggregate alternative expressions) - Currently, we search the normalized entity, which could cause the following problems:
Ru on C
returns no result, sinceRu on C
will be normlized intoRu C
. So searchingRu C
has result while searchingRu on C
has no result. One alternative is to search the raw entity, but in this way, the co-occurance patter will be greatly weakened by various alternative expression. So we decided to design the system in current form. In the furture, we may have better normalization techniques to solve this issue.
Please create an issue or email to zhangyue@udel.edu should you have any questions about the code and data.