Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Distributed lda options #782

Merged
merged 5 commits into from
Jul 13, 2016
Merged

Conversation

menshikh-iv
Copy link
Contributor

Update distributed LDA support. Now we can run worker/dispatcher in different network segments (not reachable by network broadcast). Broadcast variant also saved.

If you want to use broadcast, reading tutorial https://radimrehurek.com/gensim/dist_lsi.html on official site.

If you want to use new feature, add some arguments when you run a code, for example

  1. Execute on all machines
    export PYRO_SERIALIZERS_ACCEPTED=pickle export PYRO_SERIALIZER=pickle'
  2. On NS server
    python -m Pyro4.naming --host 0.0.0.0 --port <NS_PORT> -x
  3. On workers
    python -m gensim.models.lda_worker --host <NS_HOSTNAME> --port <NS_PORT> --no-broadcast -v
  4. On dispatcher
    python -m gensim.models.lda_dispatcher --host <NS_HOSTNAME> --port <NS_PORT> --no-broadcast -v
  5. Create LdaModel
    lda = LdaModel(..., ns_conf={"host": NS_HOST, "port": NS_PORT, "broadcast": False})
  6. Train it!

@@ -15,14 +15,21 @@


from __future__ import with_statement
import os, sys, logging, threading, time
import argparse
Copy link
Owner

@piskvorky piskvorky Jul 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is py2.7 only. @tmylk I don't think we can drop support for py2.6 yet... is this import safe?

If it's triggered only on importing lda_dispatcher.py, it's probably fine... but we don't want py2.7+ imports in "core" gensim (at import gensim).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked, this triggered only on importing lda_dispatcher.py or lda_worker.py.
Backport for argparse in setup.py for python < 2.7 (proof)

@piskvorky
Copy link
Owner

Awesome! This is a great update, and nicely done too.

If you don't mind me asking, how do you use this distributed LDA @menshikh-iv? What is your usecase/goal?

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Jul 12, 2016

@piskvorky, I have two usecases:

  1. Content classification
  2. Similarity search

I need to train LDA on large corpus of 'webpages content' and vectorize all webpages. Train process of LDA are very long. I could use several dedicated servers for training, but they not in local network, therefore I modified distributed LDA for my case.

@piskvorky
Copy link
Owner

piskvorky commented Jul 12, 2016

Thanks, interesting! Is this a personal project, academic research or a commercial project? (We keep a list of gensim adopters.)

@menshikh-iv
Copy link
Contributor Author

@piskvorky personal research for now

@tmylk tmylk merged commit 6a289fe into piskvorky:develop Jul 13, 2016
@tmylk
Copy link
Contributor

tmylk commented Jul 13, 2016

@menshikh-iv Thanks for the PR! Could you add a short notebook-style tutorial for this feature and a note in the changelog?

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Jul 13, 2016

@tmylk, unfortunately notebook-style tutorial for this feature is useless, because in notebook I can't demonstrate this feature. Maybe I update this page in documentation with small examples (like this message) ?

About changelog, I should add record to 0.3.12 in CHANGELOG.md ?

And I shoud create new PR for this actions?

@tmylk
Copy link
Contributor

tmylk commented Jul 14, 2016

Hi @menshikh-iv, the 0.3.12 is the right version to use. A new small PR would be good.

Updating this page with instructions would be great:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/src/distributed.rst

@manojpandey
Copy link
Contributor

Documentation changed from rst to markdown here: #859

@menshikh-iv menshikh-iv deleted the distributed-lda-options branch February 19, 2018 04:40
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants