Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add native textmodel_lda #30

Open
koheiw opened this issue Aug 4, 2020 · 5 comments
Open

Add native textmodel_lda #30

koheiw opened this issue Aug 4, 2020 · 5 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@koheiw
Copy link
Collaborator

koheiw commented Aug 4, 2020

topicmodels::LDA is implemented using this library, which I can call directly via Rcpp:

https://sourceforge.net/projects/gibbslda/files/

We can call the library in this way

https://github.com/cran/topicmodels/blob/ade6dc5698f385ad222fd28aa8e90c1a4bd33cf5/R/lda.R#L134-L155

There are a lot of things going on but it shouldn't be too complex for minimal functions that users usually need:

If we implement our quanteda-native LDA, I move quanteda.seededlda to this package.

https://github.com/koheiw/quanteda.seededlda

@koheiw koheiw added the enhancement New feature or request label Aug 4, 2020
@koheiw
Copy link
Collaborator Author

koheiw commented Aug 4, 2020

GibbsLDA++-0.2.tar.gz

@koheiw koheiw added the help wanted Extra attention is needed label Aug 4, 2020
koheiw added a commit that referenced this issue Aug 9, 2020
koheiw added a commit that referenced this issue Aug 10, 2020
@koheiw
Copy link
Collaborator Author

koheiw commented Aug 10, 2020

I manage to make GibbsLDA++ work and we have both seeded and regular LDA.

# seeded LDA (repliates https://github.com/koheiw/quanteda.seededlda)

> result10 <- textmodel_lda(dfmt_spnik, verbose = FALSE, seeds = tfmt_spnik)
> terms(result10)
      economy    politics        society         diplomacy    military   nature      other     
 [1,] "company"  "parliament"    "police"        "diplomatic" "army"     "human"     "going"   
 [2,] "money"    "congress"      "school"        "embassy"    "navy"     "sand"      "really"  
 [3,] "market"   "politicians"   "hospital"      "ambassador" "soldiers" "water"     "come"    
 [4,] "bank"     "parliamentary" "prison"        "treaty"     "marine"   "syria"     "see"     
 [5,] "industry" "lawmakers"     "women"         "diplomat"   "korea"    "syrian"    "american"
 [6,] "banks"    "voters"        "man"           "diplomats"  "korean"   "terrorist" "know"    
 [7,] "markets"  "lawmaker"      "investigation" "sanctions"  "missile"  "daesh"     "facebook"
 [8,] "banking"  "politician"    "found"         "iran"       "air"      "turkish"   "much"    
 [9,] "china"    "uk"            "court"         "deal"       "nuclear"  "turkey"    "good"    
[10,] "chinese"  "eu"            "children"      "meeting"    "force"    "weapons"   "team"  

# regular (unseeded) LDA
> result11 <- textmodel_lda(dfmt_spnik, k = 7, verbose = FALSE)
> terms(result11)
      topic1     topic2      topic3      topic4       topic5      topic6         topic7    
 [1,] "korea"    "china"     "syria"     "eu"         "going"     "uk"           "police"  
 [2,] "korean"   "chinese"   "syrian"    "sanctions"  "really"    "house"        "video"   
 [3,] "nuclear"  "economic"  "israel"    "iran"       "much"      "british"      "women"   
 [4,] "missile"  "india"     "terrorist" "deal"       "know"      "department"   "court"   
 [5,] "air"      "oil"       "daesh"     "union"      "see"       "white"        "man"     
 [6,] "nato"     "billion"   "turkish"   "agreement"  "come"      "campaign"     "found"   
 [7,] "force"    "trade"     "turkey"    "germany"    "good"      "ukrainian"    "children"
 [8,] "japan"    "project"   "weapons"   "elections"  "something" "secretary"    "service" 
 [9,] "kim"      "indian"    "saudi"     "parliament" "facebook"  "ukraine"      "swedish" 
[10,] "aircraft" "companies" "iraq"      "german"     "problem"   "intelligence" "rights" 

My question is should I separate the function to textmodel_lda(x, k) and textmodel_seededlda(x, dictionary) just like my older package?

@JBGruber
Copy link
Collaborator

Just my very subjective two cents: I think a dedicated textmodel_seededlda() function would be good advertisement for the concept as it is not widely known yet.

Which doesn't mean though that textmodel_lda() shouldn't be able to do it as well. Like stringi::stri_detect() which runs stringi::stri_detect_fixed() if one wants to.

@koheiw
Copy link
Collaborator Author

koheiw commented Aug 11, 2020

@JBGruber thanks for the input. I added textmodel_seededlda() to make it more visible to users.

@kbenoit
Copy link
Contributor

kbenoit commented Aug 18, 2020

Sorry to be a downer here - and I was offline for 2 weeks - but seeded LDA is already available through topicmodels::LDA(). See #31 (review).

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants