-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Refactor code with PEP8 and additional limitations. Fix #1521 #1569
Conversation
gensim/models/wrappers/ldamallet.py
Outdated
doc = [(id_, weight) | ||
for id_, weight in enumerate(map(float, parts)) | ||
if abs(weight) > eps] | ||
doc = [(id_, weight) for id_, weight in enumerate(float(x) for x in parts) if abs(weight) > eps] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv
line 308:
doc = [(int(id_), float(weight))
for id_, weight in zip(*[iter(parts)] * 2)
if abs(float(weight)) > eps]
line 310 :
doc = [(id_, float(weight))
for id_, weight in enumerate(parts)
if abs(float(weight)) > eps]
How about above code?
Above code is less loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, accepted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zsef123 Both examples above are incorrect. The correct formatting using hanging indent:
doc = [
(id_, float(weight))
for id_, weight in enumerate(parts)
if abs(float(weight)) > eps
]
Actually, this line not too long, so simple:
doc = [(id_, float(weight)) for id_, weight in enumerate(parts) if abs(float(weight)) > eps]
would work too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already done as your last variant @piskvorky
gensim/test/test_sklearn_api.py
Outdated
@@ -363,7 +363,7 @@ def testPipeline(self): | |||
data = cache | |||
test_data = data.data[0:2] | |||
test_target = data.target[0:2] | |||
id2word = Dictionary(map(lambda x: x.split(), test_data)) | |||
id2word = Dictionary([x.split() for x in data.data]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data.data
need change to test_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
gensim/test/test_atmodel.py
Outdated
@@ -436,7 +436,7 @@ def testPasses(self): | |||
for test_rhot in test_rhots: | |||
model.update(corpus, author2doc) | |||
|
|||
msg = ", ".join(map(str, [passes, model.num_updates, model.state.numdocs])) | |||
msg = ", ".join(str(x) for x in [passes, model.num_updates, model.state.numdocs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about msg = "%d, %d, %d" % (passes, model.num_updates, model.state.numdocs)
?
This is more simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Epic! I reviewed the diff and left some minor comments inside. I will look at the non-diffed (unchanged) parts next.
gensim/corpora/dictionary.py
Outdated
self.dfs = dict((tokenid, freq) | ||
for tokenid, freq in iteritems(self.dfs) | ||
if tokenid not in bad_ids) | ||
self.token2id = dict((token, tokenid) for token, tokenid in iteritems(self.token2id) if tokenid not in bad_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer support py2.6, so this can be a dict comprehension: {token: tokenid for ...}
. Here and elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
gensim/corpora/dictionary.py
Outdated
logger.info( | ||
"keeping %i tokens which were in no less than %i and no more than %i (=%.1f%%) documents", | ||
len(good_ids), no_below, no_above_abs, 100.0 * no_above) | ||
logger.info("keeping %i tokens which were in no less than %i and no more than %i (=%.1f%%) documents", len(good_ids), no_below, no_above_abs, 100.0 * no_above) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line too long, the args should be on their own separate line. Here and elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
gensim/corpora/dictionary.py
Outdated
@@ -333,7 +322,7 @@ def merge_with(self, other): | |||
old2new[other_id] = new_id | |||
try: | |||
self.dfs[new_id] += other.dfs[other_id] | |||
except Exception: | |||
except AttributeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? What will happen if other.dfs
exists, but is not a dictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
gensim/corpora/sharded_corpus.py
Outdated
@@ -254,20 +252,14 @@ def init_shards(self, output_prefix, corpus, shardsize=4096, dtype=_default_dtyp | |||
"""Initialize shards from the corpus.""" | |||
|
|||
if not gensim.utils.is_corpus(corpus): | |||
raise ValueError('Cannot initialize shards without a corpus to read' | |||
' from! (Got corpus type: {0})'.format(type(corpus))) | |||
raise ValueError('Cannot initialize shards without a corpus to read from! (Got corpus type: {0})'.format(type(corpus))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to use one type of formatting consistently (%s
instead of .format
). Here and elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my opinion both variants is good, for this reason we shouldn't use only one
gensim/corpora/sharded_corpus.py
Outdated
@@ -484,8 +456,7 @@ def resize_shards(self, shardsize): | |||
# If something happens when we're in this stage, we're screwed. | |||
except Exception as e: | |||
print(e) | |||
raise RuntimeError('Resizing completely failed for some reason.' | |||
' Sorry, dataset is probably ruined...') | |||
raise RuntimeError('Resizing completely failed for some reason. Sorry, dataset is probably ruined...') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use logger.exception
to log the exception, and then simply re-raise the original error, instead of printing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
gensim/corpora/sharded_corpus.py
Outdated
raise TypeError('Couldn\'t find number of features, ' | ||
'refusing to guess (dimension set to {0},' | ||
'type of corpus: {1}).'.format(self.dim, type(corpus))) | ||
raise TypeError('Couldn\'t find number of features, refusing to guess (dimension set to {0}, type of corpus: {1}).'.format(self.dim, type(corpus))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: Better use "
to delimit the string, to avoid the \'
inside, to improve readability. Here and elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (\'
occurs only in this file several times)
self.dictionary = dictionary.Dictionary() | ||
numPositions = 0 | ||
for docNo, (sourceId, docUri) in enumerate(self.documents): | ||
if docNo % 1000 == 0: | ||
logger.info("PROGRESS: at document #%i/%i (%s, %s)" % | ||
(docNo, len(self.documents), sourceId, docUri)) | ||
logger.info("PROGRESS: at document #%i/%i (%s, %s)", docNo, len(self.documents), sourceId, docUri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docNo
and resultFile
is Java-style. I think this file (package) should be simply removed, instead of updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely agree, will removed it soon (but not in this PR)
LOG.warning('no word id mapping provided; initializing from ' | ||
'corpus, assuming identity') | ||
raise ValueError('at least one of corpus/id2word must be specified, to establish input space dimensionality') | ||
LOG.warning('no word id mapping provided; initializing from corpus, assuming identity') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this LOG
? We use logger
in gensim, by convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
No description provided.