Biomedical Tokenisers -- Evaluation & Benchmarking Framework #2

tsantosh7 · 2022-11-13T14:34:05Z

Lynne created a dataset https://docs.google.com/spreadsheets/d/1HCE5phm0tcdfZCd44DbufXpkHMwYPHGZBjHkALfT0aE/edit with example sentences.

Santosh tested a few tokenisers (scispacy, NLTKs word and treebank), whitespace, punctuation and OPENNLP https://solr.apache.org/guide/7_3/language-analysis.html. The scispacy, NLTKs word and treebank tokenisers were evaluated for 80% while the whitespace and punctuation based for 50 and 0%. This shows that bioscience-based tokenisers are needed for otur search.

However biomed-based tokenisers are python based, and therefore there is a bottleneck in integrating them with Solr. This is the reason why OPENNLP-based tokenisation can be used as it suppoprt direct integration with solr. Currently, the English-based OPENNLP tokeniser is giving 52% accuracy.

Santosh is training the OPENNLP tokeniser model on Xiao's 300 articles to improve it accuracy beyond 70%

tsantosh7 · 2022-11-13T14:35:26Z

The openNLP model is now ready with an accuracy of 65%. bio-en-token_v03.bin

Following the integration mentioned here https://solr.apache.org/guide/7_4/language-analysis.html#opennlp-integration

Change the code as follows:

 <fieldType name="text" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
  </analyzer>
</fieldType>

<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
      tokenizerModel="en-tokenizer.bin"/>
</analyzer>
</fieldType>

tsantosh7 moved this to Backlog in Search & Annotation Improvements Nov 13, 2022

tsantosh7 added this to Search & Annotation Improvements Nov 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Biomedical Tokenisers -- Evaluation & Benchmarking Framework #2

Biomedical Tokenisers -- Evaluation & Benchmarking Framework #2

tsantosh7 commented Nov 13, 2022

tsantosh7 commented Nov 13, 2022 •

edited

Loading

Biomedical Tokenisers -- Evaluation & Benchmarking Framework #2

Biomedical Tokenisers -- Evaluation & Benchmarking Framework #2

Comments

tsantosh7 commented Nov 13, 2022

tsantosh7 commented Nov 13, 2022 • edited Loading

tsantosh7 commented Nov 13, 2022 •

edited

Loading