Skip to content

Commit

Permalink
Read NIPS data on the fly (#3082)
Browse files Browse the repository at this point in the history
* Read NIPS data on the fly

fix #3074

* Simplify download of NIPS data

* Add nmslib to requirements_docs.txt
  • Loading branch information
jonaschn authored Mar 22, 2021
1 parent 04f3414 commit 6851524
Show file tree
Hide file tree
Showing 8 changed files with 572 additions and 334 deletions.
36 changes: 18 additions & 18 deletions docs/src/auto_examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ Understanding this functionality is vital for using gensim effectively.

.. raw:: html

<div class="sphx-glr-thumbcontainer" tooltip="Introduces transformations and demonstrates their use on a toy corpus. ">
<div class="sphx-glr-thumbcontainer" tooltip="Introduces transformations and demonstrates their use on a toy corpus.">

.. only:: html

Expand All @@ -92,7 +92,7 @@ Understanding this functionality is vital for using gensim effectively.

.. raw:: html

<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates querying a corpus for similar documents. ">
<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates querying a corpus for similar documents.">

.. only:: html

Expand Down Expand Up @@ -169,7 +169,7 @@ Learning-oriented lessons that introduce a particular gensim feature, e.g. a mod

.. raw:: html

<div class="sphx-glr-thumbcontainer" tooltip="Introduces Gensim&#x27;s fastText model and demonstrates its use on the Lee Corpus. ">
<div class="sphx-glr-thumbcontainer" tooltip="Introduces Gensim&#x27;s fastText model and demonstrates its use on the Lee Corpus.">

.. only:: html

Expand All @@ -190,7 +190,7 @@ Learning-oriented lessons that introduce a particular gensim feature, e.g. a mod

.. raw:: html

<div class="sphx-glr-thumbcontainer" tooltip="Introduces the Annoy library for similarity queries on top of vectors learned by Word2Vec. ">
<div class="sphx-glr-thumbcontainer" tooltip="Introduces the Annoy library for similarity queries on top of vectors learned by Word2Vec.">

.. only:: html

Expand All @@ -211,14 +211,14 @@ Learning-oriented lessons that introduce a particular gensim feature, e.g. a mod

.. raw:: html

<div class="sphx-glr-thumbcontainer" tooltip="Introduces Gensim&#x27;s LDA model and demonstrates its use on the NIPS corpus.">
<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates using Gensim&#x27;s implemenation of the SCM.">

.. only:: html

.. figure:: /auto_examples/tutorials/images/thumb/sphx_glr_run_lda_thumb.png
:alt: LDA Model
.. figure:: /auto_examples/tutorials/images/thumb/sphx_glr_run_scm_thumb.png
:alt: Soft Cosine Measure

:ref:`sphx_glr_auto_examples_tutorials_run_lda.py`
:ref:`sphx_glr_auto_examples_tutorials_run_scm.py`

.. raw:: html

Expand All @@ -228,18 +228,18 @@ Learning-oriented lessons that introduce a particular gensim feature, e.g. a mod
.. toctree::
:hidden:

/auto_examples/tutorials/run_lda
/auto_examples/tutorials/run_scm

.. raw:: html

<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates using Gensim&#x27;s implemenation of the SCM.">
<div class="sphx-glr-thumbcontainer" tooltip="Introduces Gensim&#x27;s LDA model and demonstrates its use on the NIPS corpus.">

.. only:: html

.. figure:: /auto_examples/tutorials/images/thumb/sphx_glr_run_scm_thumb.png
:alt: Soft Cosine Measure
.. figure:: /auto_examples/tutorials/images/thumb/sphx_glr_run_lda_thumb.png
:alt: LDA Model

:ref:`sphx_glr_auto_examples_tutorials_run_scm.py`
:ref:`sphx_glr_auto_examples_tutorials_run_lda.py`

.. raw:: html

Expand All @@ -249,7 +249,7 @@ Learning-oriented lessons that introduce a particular gensim feature, e.g. a mod
.. toctree::
:hidden:

/auto_examples/tutorials/run_scm
/auto_examples/tutorials/run_lda

.. raw:: html

Expand Down Expand Up @@ -288,7 +288,7 @@ These **goal-oriented guides** demonstrate how to **solve a specific problem** u

.. raw:: html

<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates simple and quick access to common corpora and pretrained models. ">
<div class="sphx-glr-thumbcontainer" tooltip="Demonstrates simple and quick access to common corpora and pretrained models.">

.. only:: html

Expand All @@ -309,7 +309,7 @@ These **goal-oriented guides** demonstrate how to **solve a specific problem** u

.. raw:: html

<div class="sphx-glr-thumbcontainer" tooltip="How to author documentation for Gensim. ">
<div class="sphx-glr-thumbcontainer" tooltip="How to author documentation for Gensim.">

.. only:: html

Expand Down Expand Up @@ -426,13 +426,13 @@ Blog posts, tutorial videos, hackathons and other useful Gensim resources, from
.. container:: sphx-glr-download sphx-glr-download-python
:download:`Download all examples in Python source code: auto_examples_python.zip <//Volumes/work/workspace/gensim/trunk/docs/src/auto_examples/auto_examples_python.zip>`
:download:`Download all examples in Python source code: auto_examples_python.zip </auto_examples/auto_examples_python.zip>`
.. container:: sphx-glr-download sphx-glr-download-jupyter
:download:`Download all examples in Jupyter notebooks: auto_examples_jupyter.zip <//Volumes/work/workspace/gensim/trunk/docs/src/auto_examples/auto_examples_jupyter.zip>`
:download:`Download all examples in Jupyter notebooks: auto_examples_jupyter.zip </auto_examples/auto_examples_jupyter.zip>`
.. only:: html
Expand Down
14 changes: 7 additions & 7 deletions docs/src/auto_examples/tutorials/run_lda.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"\nLDA Model\n=========\n\nIntroduces Gensim's LDA model and demonstrates its use on the NIPS corpus.\n\n\n"
"\n# LDA Model\n\nIntroduces Gensim's LDA model and demonstrates its use on the NIPS corpus.\n"
]
},
{
Expand All @@ -33,7 +33,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The purpose of this tutorial is to demonstrate how to train and tune an LDA model.\n\nIn this tutorial we will:\n\n* Load input data.\n* Pre-process that data.\n* Transform documents into bag-of-words vectors.\n* Train an LDA model.\n\nThis tutorial will **not**:\n\n* Explain how Latent Dirichlet Allocation works\n* Explain how the LDA model performs inference\n* Teach you all the parameters and options for Gensim's LDA implementation\n\nIf you are not familiar with the LDA model or how to use it in Gensim, I (Olavur Mortensen)\nsuggest you read up on that before continuing with this tutorial. Basic\nunderstanding of the LDA model should suffice. Examples:\n\n* `Introduction to Latent Dirichlet Allocation <http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation>`_\n* Gensim tutorial: `sphx_glr_auto_examples_core_run_topics_and_transformations.py`\n* Gensim's LDA model API docs: :py:class:`gensim.models.LdaModel`\n\nI would also encourage you to consider each step when applying the model to\nyour data, instead of just blindly applying my solution. The different steps\nwill depend on your data and possibly your goal with the model.\n\nData\n----\n\nI have used a corpus of NIPS papers in this tutorial, but if you're following\nthis tutorial just to learn about LDA I encourage you to consider picking a\ncorpus on a subject that you are familiar with. Qualitatively evaluating the\noutput of an LDA model is challenging and can require you to understand the\nsubject matter of your corpus (depending on your goal with the model).\n\nNIPS (Neural Information Processing Systems) is a machine learning conference\nso the subject matter should be well suited for most of the target audience\nof this tutorial. You can download the original data from Sam Roweis'\n`website <http://www.cs.nyu.edu/~roweis/data.html>`_. The code below will\nalso do that for you.\n\n.. Important::\n The corpus contains 1740 documents, and not particularly long ones.\n So keep in mind that this tutorial is not geared towards efficiency, and be\n careful before applying the code to a large dataset.\n\n\n"
"The purpose of this tutorial is to demonstrate how to train and tune an LDA model.\n\nIn this tutorial we will:\n\n* Load input data.\n* Pre-process that data.\n* Transform documents into bag-of-words vectors.\n* Train an LDA model.\n\nThis tutorial will **not**:\n\n* Explain how Latent Dirichlet Allocation works\n* Explain how the LDA model performs inference\n* Teach you all the parameters and options for Gensim's LDA implementation\n\nIf you are not familiar with the LDA model or how to use it in Gensim, I (Olavur Mortensen)\nsuggest you read up on that before continuing with this tutorial. Basic\nunderstanding of the LDA model should suffice. Examples:\n\n* `Introduction to Latent Dirichlet Allocation <http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation>`_\n* Gensim tutorial: `sphx_glr_auto_examples_core_run_topics_and_transformations.py`\n* Gensim's LDA model API docs: :py:class:`gensim.models.LdaModel`\n\nI would also encourage you to consider each step when applying the model to\nyour data, instead of just blindly applying my solution. The different steps\nwill depend on your data and possibly your goal with the model.\n\n## Data\n\nI have used a corpus of NIPS papers in this tutorial, but if you're following\nthis tutorial just to learn about LDA I encourage you to consider picking a\ncorpus on a subject that you are familiar with. Qualitatively evaluating the\noutput of an LDA model is challenging and can require you to understand the\nsubject matter of your corpus (depending on your goal with the model).\n\nNIPS (Neural Information Processing Systems) is a machine learning conference\nso the subject matter should be well suited for most of the target audience\nof this tutorial. You can download the original data from Sam Roweis'\n`website <http://www.cs.nyu.edu/~roweis/data.html>`_. The code below will\nalso do that for you.\n\n.. Important::\n The corpus contains 1740 documents, and not particularly long ones.\n So keep in mind that this tutorial is not geared towards efficiency, and be\n careful before applying the code to a large dataset.\n\n\n"
]
},
{
Expand All @@ -44,7 +44,7 @@
},
"outputs": [],
"source": [
"import io\nimport os.path\nimport re\nimport tarfile\n\nimport smart_open\n\ndef extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):\n fname = url.split('/')[-1]\n\n # Download the file to local storage first.\n # We can't read it on the fly because of\n # https://github.com/RaRe-Technologies/smart_open/issues/331\n if not os.path.isfile(fname):\n with smart_open.open(url, \"rb\") as fin:\n with smart_open.open(fname, 'wb') as fout:\n while True:\n buf = fin.read(io.DEFAULT_BUFFER_SIZE)\n if not buf:\n break\n fout.write(buf)\n\n with tarfile.open(fname, mode='r:gz') as tar:\n # Ignore directory entries, as well as files like README, etc.\n files = [\n m for m in tar.getmembers()\n if m.isfile() and re.search(r'nipstxt/nips\\d+/\\d+\\.txt', m.name)\n ]\n for member in sorted(files, key=lambda x: x.name):\n member_bytes = tar.extractfile(member).read()\n yield member_bytes.decode('utf-8', errors='replace')\n\ndocs = list(extract_documents())"
"import io\nimport os.path\nimport re\nimport tarfile\n\nimport smart_open\n\ndef extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):\n with smart_open.open(url, \"rb\") as file:\n with tarfile.open(fileobj=file) as tar:\n for member in tar.getmembers():\n if member.isfile() and re.search(r'nipstxt/nips\\d+/\\d+\\.txt', member.name):\n member_bytes = tar.extractfile(member).read()\n yield member_bytes.decode('utf-8', errors='replace')\n\ndocs = list(extract_documents())"
]
},
{
Expand All @@ -69,7 +69,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Pre-process and vectorize the documents\n---------------------------------------\n\nAs part of preprocessing, we will:\n\n* Tokenize (split the documents into tokens).\n* Lemmatize the tokens.\n* Compute bigrams.\n* Compute a bag-of-words representation of the data.\n\nFirst we tokenize the text using a regular expression tokenizer from NLTK. We\nremove numeric tokens and tokens that are only a single character, as they\ndon't tend to be useful, and the dataset contains a lot of them.\n\n.. Important::\n\n This tutorial uses the nltk library for preprocessing, although you can\n replace it with something else if you want.\n\n\n"
"## Pre-process and vectorize the documents\n\nAs part of preprocessing, we will:\n\n* Tokenize (split the documents into tokens).\n* Lemmatize the tokens.\n* Compute bigrams.\n* Compute a bag-of-words representation of the data.\n\nFirst we tokenize the text using a regular expression tokenizer from NLTK. We\nremove numeric tokens and tokens that are only a single character, as they\ndon't tend to be useful, and the dataset contains a lot of them.\n\n.. Important::\n\n This tutorial uses the nltk library for preprocessing, although you can\n replace it with something else if you want.\n\n\n"
]
},
{
Expand Down Expand Up @@ -177,7 +177,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Training\n--------\n\nWe are ready to train the LDA model. We will first discuss how to set some of\nthe training parameters.\n\nFirst of all, the elephant in the room: how many topics do I need? There is\nreally no easy answer for this, it will depend on both your data and your\napplication. I have used 10 topics here because I wanted to have a few topics\nthat I could interpret and \"label\", and because that turned out to give me\nreasonably good results. You might not need to interpret all your topics, so\nyou could use a large number of topics, for example 100.\n\n``chunksize`` controls how many documents are processed at a time in the\ntraining algorithm. Increasing chunksize will speed up training, at least as\nlong as the chunk of documents easily fit into memory. I've set ``chunksize =\n2000``, which is more than the amount of documents, so I process all the\ndata in one go. Chunksize can however influence the quality of the model, as\ndiscussed in Hoffman and co-authors [2], but the difference was not\nsubstantial in this case.\n\n``passes`` controls how often we train the model on the entire corpus.\nAnother word for passes might be \"epochs\". ``iterations`` is somewhat\ntechnical, but essentially it controls how often we repeat a particular loop\nover each document. It is important to set the number of \"passes\" and\n\"iterations\" high enough.\n\nI suggest the following way to choose iterations and passes. First, enable\nlogging (as described in many Gensim tutorials), and set ``eval_every = 1``\nin ``LdaModel``. When training the model look for a line in the log that\nlooks something like this::\n\n 2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations\n\nIf you set ``passes = 20`` you will see this line 20 times. Make sure that by\nthe final passes, most of the documents have converged. So you want to choose\nboth passes and iterations to be high enough for this to happen.\n\nWe set ``alpha = 'auto'`` and ``eta = 'auto'``. Again this is somewhat\ntechnical, but essentially we are automatically learning two parameters in\nthe model that we usually would have to specify explicitly.\n\n\n"
"## Training\n\nWe are ready to train the LDA model. We will first discuss how to set some of\nthe training parameters.\n\nFirst of all, the elephant in the room: how many topics do I need? There is\nreally no easy answer for this, it will depend on both your data and your\napplication. I have used 10 topics here because I wanted to have a few topics\nthat I could interpret and \"label\", and because that turned out to give me\nreasonably good results. You might not need to interpret all your topics, so\nyou could use a large number of topics, for example 100.\n\n``chunksize`` controls how many documents are processed at a time in the\ntraining algorithm. Increasing chunksize will speed up training, at least as\nlong as the chunk of documents easily fit into memory. I've set ``chunksize =\n2000``, which is more than the amount of documents, so I process all the\ndata in one go. Chunksize can however influence the quality of the model, as\ndiscussed in Hoffman and co-authors [2], but the difference was not\nsubstantial in this case.\n\n``passes`` controls how often we train the model on the entire corpus.\nAnother word for passes might be \"epochs\". ``iterations`` is somewhat\ntechnical, but essentially it controls how often we repeat a particular loop\nover each document. It is important to set the number of \"passes\" and\n\"iterations\" high enough.\n\nI suggest the following way to choose iterations and passes. First, enable\nlogging (as described in many Gensim tutorials), and set ``eval_every = 1``\nin ``LdaModel``. When training the model look for a line in the log that\nlooks something like this::\n\n 2016-06-21 15:40:06,753 - gensim.models.ldamodel - DEBUG - 68/1566 documents converged within 400 iterations\n\nIf you set ``passes = 20`` you will see this line 20 times. Make sure that by\nthe final passes, most of the documents have converged. So you want to choose\nboth passes and iterations to be high enough for this to happen.\n\nWe set ``alpha = 'auto'`` and ``eta = 'auto'``. Again this is somewhat\ntechnical, but essentially we are automatically learning two parameters in\nthe model that we usually would have to specify explicitly.\n\n\n"
]
},
{
Expand Down Expand Up @@ -213,7 +213,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Things to experiment with\n-------------------------\n\n* ``no_above`` and ``no_below`` parameters in ``filter_extremes`` method.\n* Adding trigrams or even higher order n-grams.\n* Consider whether using a hold-out set or cross-validation is the way to go for you.\n* Try other datasets.\n\nWhere to go from here\n---------------------\n\n* Check out a RaRe blog post on the AKSW topic coherence measure (http://rare-technologies.com/what-is-topic-coherence/).\n* pyLDAvis (https://pyldavis.readthedocs.io/en/latest/index.html).\n* Read some more Gensim tutorials (https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials).\n* If you haven't already, read [1] and [2] (see references).\n\nReferences\n----------\n\n1. \"Latent Dirichlet Allocation\", Blei et al. 2003.\n2. \"Online Learning for Latent Dirichlet Allocation\", Hoffman et al. 2010.\n\n\n"
"## Things to experiment with\n\n* ``no_above`` and ``no_below`` parameters in ``filter_extremes`` method.\n* Adding trigrams or even higher order n-grams.\n* Consider whether using a hold-out set or cross-validation is the way to go for you.\n* Try other datasets.\n\n## Where to go from here\n\n* Check out a RaRe blog post on the AKSW topic coherence measure (http://rare-technologies.com/what-is-topic-coherence/).\n* pyLDAvis (https://pyldavis.readthedocs.io/en/latest/index.html).\n* Read some more Gensim tutorials (https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials).\n* If you haven't already, read [1] and [2] (see references).\n\n## References\n\n1. \"Latent Dirichlet Allocation\", Blei et al. 2003.\n2. \"Online Learning for Latent Dirichlet Allocation\", Hoffman et al. 2010.\n\n\n"
]
}
],
Expand All @@ -233,7 +233,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.7.0"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 6851524

Please # to comment.