Add the 'keep_tokens' parameter to 'filter_extremes' and test it #1210

toltoxgh · 2017-03-13T19:29:06Z

Add the optional 'keep_tokens' parameter to the 'filter_extremes'
method in dictionary.py. This parameter can contain a list of tokens,
which will be kept regardless of the 'no_below' and 'no_above' settings.
This can be useful if the research goal is to enforce certain tokens to
appear in topics, and still be able to filter all other extremes.

If 'keep_tokens' is not given, the functionality of 'filter_extremes' is
unchanged.

Unit tests are also provided to assert examples of the above.

Add the optional 'keep_tokens' parameter to the 'filter_extremes' method in dictionary.py. This parameter can contain a list of tokens, which will be kept regardless of the 'no_below' and 'no_above' settings. This can be useful if the research goal is to enforce certain tokens to appear in topics, and still be able to filter all other extremes. If 'keep_tokens' is not given, the functionality of 'filter_extremes' is unchanged. Unit tests are also provided to assert examples of the above.

toltoxgh · 2017-03-13T19:57:38Z

The travis-ci check failed because:

Traceback (most recent call last):

  File "/home/travis/build/RaRe-Technologies/gensim/gens

    self.assertAlmostEqual(expected, result)

AssertionError: 0.0894502 != 0.089450255 within 7 places

I do not know how this can affect my commit, as not providing the optional 'keep_tokens' parameter should not change any functionality in dictionary.py, see the unit tests.

How can this commit be included successfully?

tmylk · 2017-03-13T20:26:03Z

@toliwa It's not related to your code changes, just the new release of scipy 0.19

Changing it to assertAlmostEqual(expected, result, places=5) will fix the issue.

tmylk

Please keep to a single iteration

tmylk · 2017-03-13T20:34:45Z

gensim/corpora/dictionary.py

+        # add ids of keep_tokens elements to good_ids
+        if keep_tokens:
+            keep_ids = [self.token2id[v] for v in keep_tokens if v in self.token2id]
+            good_ids_copy =  (v for v in itervalues(self.token2id) if no_below <= self.dfs.get(v, 0) <= no_above_abs)


please keep to a single iteration of token2id by adding a or v in keep_tokens check if keep_tokens is present

tmylk · 2017-03-13T21:06:11Z

Please merge in the latest develop to make the tests pass.

Create good_ids only once as per optimization suggestion, regardless if 'keep_tokens' is provided or not.

tmylk · 2017-03-13T21:55:44Z

Thanks for the new feature!

* Add the 'keep_tokens' parameter to 'filter_extremes' and test it Add the optional 'keep_tokens' parameter to the 'filter_extremes' method in dictionary.py. This parameter can contain a list of tokens, which will be kept regardless of the 'no_below' and 'no_above' settings. This can be useful if the research goal is to enforce certain tokens to appear in topics, and still be able to filter all other extremes. If 'keep_tokens' is not given, the functionality of 'filter_extremes' is unchanged. Unit tests are also provided to assert examples of the above. * Create good_ids only once Create good_ids only once as per optimization suggestion, regardless if 'keep_tokens' is provided or not.

piskvorky · 2017-04-09T07:24:32Z

gensim/corpora/dictionary.py

-            if no_below <= self.dfs.get(v, 0) <= no_above_abs)
+        if keep_tokens:
+            keep_ids = [self.token2id[v] for v in keep_tokens if v in self.token2id]
+            good_ids =  (v for v in itervalues(self.token2id) 


Code style: no vertical indent in gensim -- please use hanging indent.

tmylk suggested changes Mar 13, 2017

View reviewed changes

Create good_ids only once

ee6b4f7

Create good_ids only once as per optimization suggestion, regardless if 'keep_tokens' is provided or not.

tmylk approved these changes Mar 13, 2017

View reviewed changes

tmylk merged commit 8c869cb into piskvorky:develop Mar 13, 2017

piskvorky reviewed Apr 9, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the 'keep_tokens' parameter to 'filter_extremes' and test it #1210

Add the 'keep_tokens' parameter to 'filter_extremes' and test it #1210

toltoxgh commented Mar 13, 2017

toltoxgh commented Mar 13, 2017 •

edited

Loading

tmylk commented Mar 13, 2017 •

edited

Loading

tmylk left a comment

tmylk Mar 13, 2017

tmylk commented Mar 13, 2017

tmylk commented Mar 13, 2017

piskvorky Apr 9, 2017

Add the 'keep_tokens' parameter to 'filter_extremes' and test it #1210

Add the 'keep_tokens' parameter to 'filter_extremes' and test it #1210

Conversation

toltoxgh commented Mar 13, 2017

toltoxgh commented Mar 13, 2017 • edited Loading

tmylk commented Mar 13, 2017 • edited Loading

tmylk left a comment

Choose a reason for hiding this comment

tmylk Mar 13, 2017

Choose a reason for hiding this comment

tmylk commented Mar 13, 2017

tmylk commented Mar 13, 2017

piskvorky Apr 9, 2017

Choose a reason for hiding this comment

toltoxgh commented Mar 13, 2017 •

edited

Loading

tmylk commented Mar 13, 2017 •

edited

Loading