Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

option to include 0s when converting textstat_simil to data.frame #18

Open
dhalpern opened this issue Jul 15, 2019 · 3 comments
Open

option to include 0s when converting textstat_simil to data.frame #18

dhalpern opened this issue Jul 15, 2019 · 3 comments

Comments

@dhalpern
Copy link

Requested feature

Currently, when converting a textstat_simil matrix to a data.frame, any 0s get dropped. 0s might be substantively important though so might be nice to have a feature that includes them.

This is the current behavior:

library(tidytext)
library(quanteda) 
dat <- data_frame(doc = rep(1:5, each = 2),
                    word = c("a", "b",
                               "a", "c",
                               "a", "c",
                               "b", "e",
                               "b", "f"), count = rep(1, 10)) 
tstat_mat <- dat %>% 
    cast_dfm(doc, word, count) %>% 
    textstat_simil(method = "cosine", margin = "documents")

tstat_mat
textstat_simil object; method = "cosine"
    1   2   3   4   5
1 1.0 0.5 0.5 0.5 0.5
2 0.5 1.0 1.0   0   0
3 0.5 1.0 1.0   0   0
4 0.5   0   0 1.0 0.5
5 0.5   0   0 0.5 1.0

tstat_mat %>% as.data.frame()
  document1 document2 cosine
1         1         2    0.5
2         1         3    0.5
3         2         3    1.0
4         1         4    0.5
5         1         5    0.5
6         4         5    0.5

It would be great to have an option for the dataframe include all pairs with 0s where needed

Use case

Similarities of 0 might be substantively interesting

Additional context

@kbenoit
Copy link
Contributor

kbenoit commented Jul 25, 2019

@koheiw we could probably add this as an option - to keep 0s - to proxy2triplet(). Then add a include_zeros = FALSE argument to as.data.frame.textstat_proxy().

@koheiw
Copy link
Collaborator

koheiw commented Jul 30, 2019

I think when min_simil is not used, as.data.frame.textstat_proxy() should return all the values.

@kbenoit
Copy link
Contributor

kbenoit commented Jul 31, 2019

Makes sense to me, and this treats the . as missing rather than zero when min_simil is used.

@kbenoit kbenoit transferred this issue from quanteda/quanteda Nov 28, 2020
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants