Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Biomedical Tokenisers -- Evaluation & Benchmarking Framework #2

Open
tsantosh7 opened this issue Nov 13, 2022 · 1 comment
Open

Biomedical Tokenisers -- Evaluation & Benchmarking Framework #2

tsantosh7 opened this issue Nov 13, 2022 · 1 comment

Comments

@tsantosh7
Copy link
Member

Lynne created a dataset https://docs.google.com/spreadsheets/d/1HCE5phm0tcdfZCd44DbufXpkHMwYPHGZBjHkALfT0aE/edit with example sentences.

Santosh tested a few tokenisers (scispacy, NLTKs word and treebank), whitespace, punctuation and OPENNLP https://solr.apache.org/guide/7_3/language-analysis.html. The scispacy, NLTKs word and treebank tokenisers were evaluated for 80% while the whitespace and punctuation based for 50 and 0%. This shows that bioscience-based tokenisers are needed for otur search.

However biomed-based tokenisers are python based, and therefore there is a bottleneck in integrating them with Solr. This is the reason why OPENNLP-based tokenisation can be used as it suppoprt direct integration with solr. Currently, the English-based OPENNLP tokeniser is giving 52% accuracy.

Santosh is training the OPENNLP tokeniser model on Xiao's 300 articles to improve it accuracy beyond 70%

@tsantosh7
Copy link
Member Author

tsantosh7 commented Nov 13, 2022

The openNLP model is now ready with an accuracy of 65%. bio-en-token_v03.bin

Following the integration mentioned here https://solr.apache.org/guide/7_4/language-analysis.html#opennlp-integration

Change the code as follows:

 <fieldType name="text" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
  </analyzer>
</fieldType>
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.OpenNLPTokenizerFactory"
      tokenizerModel="en-tokenizer.bin"/>
</analyzer>
</fieldType>

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant