Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Biterm most frequent topic filtering #7

Closed
BenoitFayolle opened this issue Oct 12, 2021 · 4 comments
Closed

Biterm most frequent topic filtering #7

BenoitFayolle opened this issue Oct 12, 2021 · 4 comments

Comments

@BenoitFayolle
Copy link

biterms <- biterms[, topic_freq := .N, by = list(term1, term2)]
biterms <- biterms[, list(best_topic = topic[which.max(topic_freq)], cooc = .N), by = list(term1, term2)]

Correct me if I'm wrong but these don't actually pick the best/most frequent topic.
topic_freq gives the number of occurrences of each biterm in the whole corpus since topic is not included in the by argument of the first line.
Hence second line picks the maximum of a variable that is constant within each group

@jwijffels
Copy link
Contributor

jwijffels commented Oct 12, 2021

I tried to speed up the logic which I orignally implemented at

# biterms <- biterms[, list(best_topic = utils::head(base::names(base::sort(base::table(topic), decreasing = TRUE)), 1),
# cooc = .N), by = list(term1, term2)]
but the implementation wasn't the correct speedup.
Later on, I make sure only biterms with terms highly emitted by each topic are shown at
bi <- bi[bi$term1 %in% topictokens & bi$term2 %in% topictokens, ]
This was done to make the graph crisp.
So a bug clearly but probably not occurring that much unless you really have completely overlapping topics.

@BenoitFayolle
Copy link
Author

BenoitFayolle commented Oct 12, 2021

I think you are responding to my other issue but this one is different. I can send a reprex tomorrow

@BenoitFayolle
Copy link
Author

Nevermind, I just saw your commit to fix this issue 👍

@jwijffels
Copy link
Contributor

I pushed the package on CRAN just now.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants