-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add the 'keep_tokens' parameter to 'filter_extremes' and test it #1210
Add the 'keep_tokens' parameter to 'filter_extremes' and test it #1210
Conversation
Add the optional 'keep_tokens' parameter to the 'filter_extremes' method in dictionary.py. This parameter can contain a list of tokens, which will be kept regardless of the 'no_below' and 'no_above' settings. This can be useful if the research goal is to enforce certain tokens to appear in topics, and still be able to filter all other extremes. If 'keep_tokens' is not given, the functionality of 'filter_extremes' is unchanged. Unit tests are also provided to assert examples of the above.
The travis-ci check failed because:
I do not know how this can affect my commit, as not providing the optional 'keep_tokens' parameter should not change any functionality in dictionary.py, see the unit tests. How can this commit be included successfully? |
@toliwa It's not related to your code changes, just the new release of scipy 0.19 Changing it to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep to a single iteration
gensim/corpora/dictionary.py
Outdated
# add ids of keep_tokens elements to good_ids | ||
if keep_tokens: | ||
keep_ids = [self.token2id[v] for v in keep_tokens if v in self.token2id] | ||
good_ids_copy = (v for v in itervalues(self.token2id) if no_below <= self.dfs.get(v, 0) <= no_above_abs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please keep to a single iteration of token2id
by adding a or v in keep_tokens
check if keep_tokens
is present
Please merge in the latest develop to make the tests pass. |
Create good_ids only once as per optimization suggestion, regardless if 'keep_tokens' is provided or not.
Thanks for the new feature! |
* Add the 'keep_tokens' parameter to 'filter_extremes' and test it Add the optional 'keep_tokens' parameter to the 'filter_extremes' method in dictionary.py. This parameter can contain a list of tokens, which will be kept regardless of the 'no_below' and 'no_above' settings. This can be useful if the research goal is to enforce certain tokens to appear in topics, and still be able to filter all other extremes. If 'keep_tokens' is not given, the functionality of 'filter_extremes' is unchanged. Unit tests are also provided to assert examples of the above. * Create good_ids only once Create good_ids only once as per optimization suggestion, regardless if 'keep_tokens' is provided or not.
if no_below <= self.dfs.get(v, 0) <= no_above_abs) | ||
if keep_tokens: | ||
keep_ids = [self.token2id[v] for v in keep_tokens if v in self.token2id] | ||
good_ids = (v for v in itervalues(self.token2id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: no vertical indent in gensim -- please use hanging indent.
Add the optional 'keep_tokens' parameter to the 'filter_extremes'
method in dictionary.py. This parameter can contain a list of tokens,
which will be kept regardless of the 'no_below' and 'no_above' settings.
This can be useful if the research goal is to enforce certain tokens to
appear in topics, and still be able to filter all other extremes.
If 'keep_tokens' is not given, the functionality of 'filter_extremes' is
unchanged.
Unit tests are also provided to assert examples of the above.