-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[WIP] Add topic coherence pipeline to gensim #710
Conversation
What is "segmentation"? Can you add some background info / motivation? |
@piskvorky sorry I forgot to tag the issue before. I've tagged it now. |
Ah, nice! :-) @tmylk please review. |
@piskvorky @tmylk I've added the API I was thinking of for adding the topic coherence pipeline. I've currently implemented only the U_mass topic coherence (still have to work a lot on it though) using this pipeline. Sorry for not adding the tests yet. Just wanted to show you what kind of an API I was thinking of. Is this API fine? |
Looks ok to me conceptually (but have a look at |
|
||
EPSILON = 1e-12 # Should be small. Value as suggested in paper. | ||
|
||
def Log_Conditional_Probability(segmented_topics, per_topic_probability): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: function names are lowercase.
I've added initial tests for the segmentation module. Although I'm still thinking about what kind of inputs can I get in the segmentation module (or what different outputs can the users use from LDA to input into the pipeline). |
|
||
logger = logging.getLogger(__name__) | ||
|
||
def s_one_pre(topics): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a model argument here. Handle it inside
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Demonstration for the `u_mass` topic coherence using topic coherence pipeline" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmylk I've added an example notebook for u_mass
topic coherence. Is this alright?
8749621
to
7a04893
Compare
@tmylk @piskvorky I have updated this PR with the latest version. I have also added a |
Added initial draft of |
self.conf = direct_confirmation_measure.log_conditional_probability | ||
self.aggr = aggregation.arithmetic_mean | ||
|
||
elif self.coherence == 'c_v': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'c_v' should be a constant to avoid duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry can you please elaborate on this a bit more?
213fcff
to
b45c00e
Compare
I've added |
Thanks! Why doesn't pyldavis show on github? |
Added tests for checking mathematical correctness of the |
Added mathematical correctness test for |
I've modified the ipython notebook a bit. Hopefully it's more human-interpretable now 😄 |
Added introduction to the notebook. @tmylk can you please review? |
@dsquareindia Could you move the new files you created from gensim/.py to gensim/topic_coherence/.py ? These modules are only needed for topic coherence and shouldn't be in root. |
@tmylk done. Could you please check |
4cbd3ae
to
738235d
Compare
@tmylk tests pass now. Just used |
@guntherzhao simple rule: more - better, here - 7 topics. Our coherence is port of https://github.com/dice-group/Palmetto , you can read more about it on Palmetto wiki page or http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf For future - please use mailing list for questions (GitHub only for feature requests & bug reports) |
@menshikh-iv Thanks for your help! |
This comment was marked as spam.
This comment was marked as spam.
This comment was marked as spam.
This comment was marked as spam.
Hi, sorry to resurrect this post but my question relates directly to one of the answers. @menshikh-iv, you said above:
which I read as "for Umass, the more negative value the better." Looking at the example graph, 5 topics has the most negative Umass coherence score which I therefore take to be best. However your answer states 7 topics (which just so happens to be the least negative value). I'm hoping you could clarify this point, especially as there seems to be a lot of confusion on this topic out there in the world (e.g. here) Thanks! |
Addresses #278.
@tmylk @piskvorky is this alright?