Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add the 'keep_tokens' parameter to 'filter_extremes' and test it #1210

Merged
merged 2 commits into from
Mar 13, 2017

Conversation

toltoxgh
Copy link
Contributor

Add the optional 'keep_tokens' parameter to the 'filter_extremes'
method in dictionary.py. This parameter can contain a list of tokens,
which will be kept regardless of the 'no_below' and 'no_above' settings.
This can be useful if the research goal is to enforce certain tokens to
appear in topics, and still be able to filter all other extremes.

If 'keep_tokens' is not given, the functionality of 'filter_extremes' is
unchanged.

Unit tests are also provided to assert examples of the above.

Add the optional 'keep_tokens' parameter to the 'filter_extremes'
method in dictionary.py. This parameter can contain a list of tokens,
which will be kept regardless of the 'no_below' and 'no_above' settings.
This can be useful if the research goal is to enforce certain tokens to
appear in topics, and still be able to filter all other extremes.

If 'keep_tokens' is not given, the functionality of 'filter_extremes' is
unchanged.

Unit tests are also provided to assert examples of the above.
@toltoxgh
Copy link
Contributor Author

toltoxgh commented Mar 13, 2017

The travis-ci check failed because:

Traceback (most recent call last):

  File "/home/travis/build/RaRe-Technologies/gensim/gens

    self.assertAlmostEqual(expected, result)

AssertionError: 0.0894502 != 0.089450255 within 7 places

I do not know how this can affect my commit, as not providing the optional 'keep_tokens' parameter should not change any functionality in dictionary.py, see the unit tests.

How can this commit be included successfully?

@tmylk
Copy link
Contributor

tmylk commented Mar 13, 2017

@toliwa It's not related to your code changes, just the new release of scipy 0.19

Changing it to assertAlmostEqual(expected, result, places=5) will fix the issue.

Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep to a single iteration

# add ids of keep_tokens elements to good_ids
if keep_tokens:
keep_ids = [self.token2id[v] for v in keep_tokens if v in self.token2id]
good_ids_copy = (v for v in itervalues(self.token2id) if no_below <= self.dfs.get(v, 0) <= no_above_abs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep to a single iteration of token2id by adding a or v in keep_tokens check if keep_tokens is present

@tmylk
Copy link
Contributor

tmylk commented Mar 13, 2017

Please merge in the latest develop to make the tests pass.

Create good_ids only once as per optimization
suggestion, regardless if 'keep_tokens' is provided or not.
@tmylk tmylk merged commit 8c869cb into piskvorky:develop Mar 13, 2017
@tmylk
Copy link
Contributor

tmylk commented Mar 13, 2017

Thanks for the new feature!

pranaydeeps pushed a commit to pranaydeeps/gensim that referenced this pull request Mar 21, 2017
* Add the 'keep_tokens' parameter to 'filter_extremes' and test it

Add the optional 'keep_tokens' parameter to the 'filter_extremes'
method in dictionary.py. This parameter can contain a list of tokens,
which will be kept regardless of the 'no_below' and 'no_above' settings.
This can be useful if the research goal is to enforce certain tokens to
appear in topics, and still be able to filter all other extremes.

If 'keep_tokens' is not given, the functionality of 'filter_extremes' is
unchanged.

Unit tests are also provided to assert examples of the above.

* Create good_ids only once

Create good_ids only once as per optimization
suggestion, regardless if 'keep_tokens' is provided or not.
if no_below <= self.dfs.get(v, 0) <= no_above_abs)
if keep_tokens:
keep_ids = [self.token2id[v] for v in keep_tokens if v in self.token2id]
good_ids = (v for v in itervalues(self.token2id)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: no vertical indent in gensim -- please use hanging indent.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants