You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Santosh tested a few tokenisers (scispacy, NLTKs word and treebank), whitespace, punctuation and OPENNLP https://solr.apache.org/guide/7_3/language-analysis.html. The scispacy, NLTKs word and treebank tokenisers were evaluated for 80% while the whitespace and punctuation based for 50 and 0%. This shows that bioscience-based tokenisers are needed for otur search.
However biomed-based tokenisers are python based, and therefore there is a bottleneck in integrating them with Solr. This is the reason why OPENNLP-based tokenisation can be used as it suppoprt direct integration with solr. Currently, the English-based OPENNLP tokeniser is giving 52% accuracy.
Santosh is training the OPENNLP tokeniser model on Xiao's 300 articles to improve it accuracy beyond 70%
The text was updated successfully, but these errors were encountered:
Lynne created a dataset https://docs.google.com/spreadsheets/d/1HCE5phm0tcdfZCd44DbufXpkHMwYPHGZBjHkALfT0aE/edit with example sentences.
Santosh tested a few tokenisers (scispacy, NLTKs word and treebank), whitespace, punctuation and OPENNLP https://solr.apache.org/guide/7_3/language-analysis.html. The scispacy, NLTKs word and treebank tokenisers were evaluated for 80% while the whitespace and punctuation based for 50 and 0%. This shows that bioscience-based tokenisers are needed for otur search.
However biomed-based tokenisers are python based, and therefore there is a bottleneck in integrating them with Solr. This is the reason why OPENNLP-based tokenisation can be used as it suppoprt direct integration with solr. Currently, the English-based OPENNLP tokeniser is giving 52% accuracy.
Santosh is training the OPENNLP tokeniser model on Xiao's 300 articles to improve it accuracy beyond 70%
The text was updated successfully, but these errors were encountered: