Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

The type information about job_queue in word2vec is wrong #2928

Closed
lunastera opened this issue Aug 31, 2020 · 3 comments · Fixed by #2931
Closed

The type information about job_queue in word2vec is wrong #2928

lunastera opened this issue Aug 31, 2020 · 3 comments · Fixed by #2931

Comments

@lunastera
Copy link
Contributor

Problem description

_worker_loop and _job_producer say that the job_queue element is a (list of object, dict of (str, int)) type, when in fact it appears to be a (list of object, float) type.
Is this a statement for a future change, or is it a mistake?

Steps/code/corpus to reproduce

def _worker_loop(self, job_queue, progress_queue):
        """Train the model, lifting batches of data from the queue.

        This function will be called in parallel by multiple workers (threads or processes) to make
        optimal use of multicore machines.

        Parameters
        ----------
        job_queue : Queue of (list of objects, (str, int))
            A queue of jobs still to be processed. The worker will take up jobs from this queue.
            Each job is represented by a tuple where the first element is the corpus chunk to be processed and
            the second is the dictionary of parameters.

but

def _job_producer(self, data_iterator, job_queue, cur_epoch=0, total_examples=None, total_words=None):
        ...
        next_job_params = self._get_job_params(cur_epoch)
        job_no = 0

        for data_idx, data in enumerate(data_iterator):
            data_length = self._raw_word_count([data])

            # can we fit this sentence into the existing job batch?
            if batch_size + data_length <= self.batch_words:
                # yes => add it to the current job
                job_batch.append(data)
                batch_size += data_length
            else:
                job_no += 1
                job_queue.put((job_batch, next_job_params))

In _get_job_params

def _get_job_params(self, cur_epoch):
        """Get the learning rate used in the current epoch.

        Parameters
        ----------
        cur_epoch : int
            Current iteration through the corpus

        Returns
        -------
        float
            The learning rate for this epoch (it is linearly reduced with epochs from `self.alpha` to `self.min_alpha`).

        """
        alpha = self.alpha - ((self.alpha - self.min_alpha) * float(cur_epoch) / self.epochs)
        return alpha

Versions

Darwin-19.6.0-x86_64-i386-64bit
Python 3.7.4 (default, Jan 24 2020, 20:34:38)
[Clang 11.0.0 (clang-1100.0.33.16)]
Bits 64
NumPy 1.19.1
SciPy 1.5.2
gensim 4.0.0.dev0
@gojomo
Copy link
Collaborator

gojomo commented Aug 31, 2020

Looks like a mistake to me – maybe a remnant of some prior implementation. (And, if _get_job_params() in practice just returns a floating-point learning-rate, it's got a bad name that suggests it's more than that.)

@lunastera
Copy link
Contributor Author

@gojomo Thanks for the quick response! If there are no problems, can I create a PR to fix this?

@gojomo
Copy link
Collaborator

gojomo commented Sep 1, 2020

Sure! Unless there's some other reason for _get_job_params() to remain generically-named like that (as a search for other uses/calls/implementations might reveal), a patch could give it a better name, in addition to ensuring the comments accurately describe the current code.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants