Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

natural.Tfidf.listTerms works incorrectly for custom-generated tokens (those passed as array to addDocument(...)) #634

Closed
senatet opened this issue Jan 4, 2022 · 2 comments

Comments

@senatet
Copy link

senatet commented Jan 4, 2022

natural.Tfidf.addDocument accepts either a string or an array of pre-tokenized texts. When a document is added using an array of tokens, listTerms still applies the tokenization to the individual document tokens when computing the tfidf score, resulting in a tfidf score of 0, even though the tf and idf scores are > 0.

(natural version: ^5.1.11)
An example:

> var natural = require('natural')
> var tfidf = new natural.TfIdf()
> tfidf.listTerms(0)
[
  {
    term: 'domain',
    tf: 1,
    idf: 0.3068528194400547,
    tfidf: 0.3068528194400547
  },
  { term: 'google.com', tf: 1, idf: 0.3068528194400547, tfidf: 0 }
]

The second document should have a tfidf score of 0.306... (1 * .0.3068..), but it is 0.

The fix is simple.. Update the listTerms(...) function to pass an array in tfidf: _this.tfidf(term, d) call (change to:
tfidf: _this.tfidf([term], d) (line 174 here: https://github.com/NaturalNode/natural/blob/master/lib/natural/tfidf/tfidf.js ).

Thanks.

@DSchmidlin
Copy link

I ran into the same problem. The workaround I used was to set a custom tokenizer that does this work.

tfidf.setTokenizer( { tokenize(x) { return [x] } });

Hugo-ter-Doest added a commit that referenced this issue Jul 2, 2024
Hugo-ter-Doest added a commit that referenced this issue Jul 2, 2024
@Hugo-ter-Doest
Copy link
Collaborator

Solved in #748

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants