The type information about job_queue in word2vec is wrong #2928

lunastera · 2020-08-31T10:24:05Z

Problem description

_worker_loop and _job_producer say that the job_queue element is a (list of object, dict of (str, int)) type, when in fact it appears to be a (list of object, float) type.
Is this a statement for a future change, or is it a mistake?

Steps/code/corpus to reproduce

def _worker_loop(self, job_queue, progress_queue):
        """Train the model, lifting batches of data from the queue.

        This function will be called in parallel by multiple workers (threads or processes) to make
        optimal use of multicore machines.

        Parameters
        ----------
        job_queue : Queue of (list of objects, (str, int))
            A queue of jobs still to be processed. The worker will take up jobs from this queue.
            Each job is represented by a tuple where the first element is the corpus chunk to be processed and
            the second is the dictionary of parameters.

but

def _job_producer(self, data_iterator, job_queue, cur_epoch=0, total_examples=None, total_words=None):
        ...
        next_job_params = self._get_job_params(cur_epoch)
        job_no = 0

        for data_idx, data in enumerate(data_iterator):
            data_length = self._raw_word_count([data])

            # can we fit this sentence into the existing job batch?
            if batch_size + data_length <= self.batch_words:
                # yes => add it to the current job
                job_batch.append(data)
                batch_size += data_length
            else:
                job_no += 1
                job_queue.put((job_batch, next_job_params))

In _get_job_params

def _get_job_params(self, cur_epoch):
        """Get the learning rate used in the current epoch.

        Parameters
        ----------
        cur_epoch : int
            Current iteration through the corpus

        Returns
        -------
        float
            The learning rate for this epoch (it is linearly reduced with epochs from `self.alpha` to `self.min_alpha`).

        """
        alpha = self.alpha - ((self.alpha - self.min_alpha) * float(cur_epoch) / self.epochs)
        return alpha

Versions

Darwin-19.6.0-x86_64-i386-64bit
Python 3.7.4 (default, Jan 24 2020, 20:34:38)
[Clang 11.0.0 (clang-1100.0.33.16)]
Bits 64
NumPy 1.19.1
SciPy 1.5.2
gensim 4.0.0.dev0

The text was updated successfully, but these errors were encountered:

gojomo · 2020-08-31T18:47:54Z

Looks like a mistake to me – maybe a remnant of some prior implementation. (And, if _get_job_params() in practice just returns a floating-point learning-rate, it's got a bad name that suggests it's more than that.)

lunastera · 2020-09-01T04:52:04Z

@gojomo Thanks for the quick response! If there are no problems, can I create a PR to fix this?

gojomo · 2020-09-01T05:04:25Z

Sure! Unless there's some other reason for _get_job_params() to remain generically-named like that (as a search for other uses/calls/implementations might reveal), a patch could give it a better name, in addition to ensuring the comments accurately describe the current code.

lunastera mentioned this issue Sep 2, 2020

Clear up job queue parameters in word2vec #2931

Merged

piskvorky closed this as completed in #2931 Sep 8, 2020

mpenkov mentioned this issue Oct 28, 2020

Update changelog for 4.0.0 release #2981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The type information about job_queue in word2vec is wrong #2928

The type information about job_queue in word2vec is wrong #2928

lunastera commented Aug 31, 2020

gojomo commented Aug 31, 2020

lunastera commented Sep 1, 2020

gojomo commented Sep 1, 2020

The type information about job_queue in word2vec is wrong #2928

The type information about job_queue in word2vec is wrong #2928

Comments

lunastera commented Aug 31, 2020

Problem description

Steps/code/corpus to reproduce

Versions

gojomo commented Aug 31, 2020

lunastera commented Sep 1, 2020

gojomo commented Sep 1, 2020