Efficient vector representation for documents through corruption #1159

Threynaud · 2017-02-22T10:16:12Z

I came across this paper and I think it is a very promising alternative to Doc2Vec which doesn't scale at all with the size of the corpus !

The model architecture seems very similar to Word2Vec and it would be a nice addition to Gensim.
If you think it is a good idea, I would be glad to code a PoC for it!

tmylk · 2017-02-22T13:50:02Z

Thanks for the link. There should at least be an integration with gensim of this code. Hoping the code will appear soon in this github repo:

mchen24/iclr2017#1

It is indeed very interesting way to expand on the paragaraph vectors and tf-idf-averaged word2vec. Would be good to compare it to tf-idf -averaged word2vec(not just idf) when the code becomes available.

Threynaud · 2017-02-22T14:27:50Z

I sent her an email about this, I'll keep you in touch! If she doesn't plan to make it public in a near future I'll try to implement it!

zjcerwin · 2017-03-01T05:07:52Z

Author made code public 6 hours ago

Threynaud · 2017-03-01T17:33:36Z

Looks like it is directly integrated in Mikolov's code but I don't feel like I can make the change in Gensim by myself, I am too bad with Cython..

gojomo · 2017-03-01T20:28:49Z

From a quick glance at the paper, the idea of composing a document-vector from a (random subset) of whole-doc words made sense to me and would be pretty easy to add as an option in gensim. However, I found the paper's description of the drop-out "randomly overwrites each dimension of the original document x with probability q" process a bit unclear – does it only happen to the composed doc-vec? Or the combined doc-vec + neighboring-words-context? (The paper also mentions "each word in a document is corrupted independently of others", which as far as I can tell doesn't match the earlier descriptions of leaving-out-words or zeroing-out-dimensions-of-the-docvec.)

But, doing a diff of the https://github.com/mchen24/iclr2017 code & Mikolov's "--sentence-vectors" word2vec.c patch should precisely clarify.

The paper's results on IMDB sentiment-prediction are only very-slightly better than plain Paragraph Vectors, though on classification and semantic-relatedness, there seem to be bigger improvements. However, the exact metaparameters used aren't specified, so it's unclear if perhaps the new method, highly parameter optimized, is being compared against just one set of PV parameters. Knowing for sure the method often/usually helps should be a prerequisite before adding complexity to the existing PV-Doc2Vec code.

Threynaud · 2017-03-02T09:07:09Z

@gojomo I agree that the performances are only slightly better than state of the art methods (and only on specific datasets) but I think that what makes this paper interesting is that this model scales with the size of the corpus, something I personally struggle with, with the current implementations of Doc2Vec!
However, I also agree with your last point! This is the main problem of the majority of ML papers nowadays, parameters are always omitted, which is a shame!

mchen24 · 2017-04-28T04:01:35Z

Thank you all for the interest in the work. The main motivation of the work is to address the scalability issue of PV, in particular, parameters grows with training corpus, as well as inefficiency in generating representation of unseen documents. We found the new method also improves the quality of the learned representation. There's only one additional hyper-parameter for Doc2VecC, which is the corruption rate. I don't find the method to be particularly sensitive. The general guideline is to use a large corruption rate, close to 1, for longer documents, and lower corruption rate for shorter documents.

ssafty · 2017-06-13T09:49:03Z

I used the C implementation of the corruption model, which gave me some very interesting results.
However, It's not convenient since I use a single core to avoid dirty writes, and I would love to extend it beyond the random corruption.

So I wonder if someone started implementing it in gensim? if not, is it okay for me to start it (from licensing/copyrights point of view)? Thank you

gojomo · 2017-06-13T19:15:47Z

@Saftophobia I don't know of anyone working on adding the 'Doc2VecC' method to gensim. It looks valuable and seems eligible for re-implementation in Python/Cython. (I see no IP claims and the paper author's own edits to Mikolov's sentence-vectors word2vec.c patch are marked as Apache-licensed.)

So I'd say it'd be a good thing for you to work on! When a PR is started – even one that's rough/in-progress for seeking early feedback – I'd be looking to be sure any refactoring/extensions to the existing Doc2Vec functionality avoid unnecessary complexity, and keep overhead negligible for people not using the extra techniques of Doc2VecC.

Thanks!

GeneralSemantics · 2018-10-22T23:01:26Z

Is this still underway?

menshikh-iv · 2018-10-23T04:08:06Z

@GeneralSemantics as I know, nobody work on it now, last attempt #1985 was abandoned 😞

ssafty · 2018-11-26T16:52:13Z

@GeneralSemantics I'm working on it, waiting for PR review now.

GeneralSemantics · 2018-11-27T01:53:46Z

@Saftophobia Awesome, thanks for the update!

gojomo · 2018-11-27T23:07:41Z

While I still find this technique interesting, as per my comment #1159 (comment) I'm still wondering if there's any cases where it really shines.

And I'd prefer any implementation of this still-rather-obscure mode to avoid adding complexity/overhead to the mainline modes. That's hard, because the prior code is already a mess, and in my opinion the prior refactoring of #1777 only made things worse.

As it stands, #1985 just layers in even more special mode-specific state and branching, in the main word2vec code paths (rather than any specialized subclasses or swappable policy-classes or even Doc2Vec-specific classes). It's a bit hard-to-review, and will make all other changes to this core code harder-to-review-and-understand in the future. It might provide a performance drag on users of word2vec who aren't using this mode. In my opinion, the benefits and popular demand for this mode would have to be quite large to justify this sort of toggle-in-complexity.

The threshold for accepting such functionality would be lower if it were simpler to enable, but that could require a refactoring effort of herculean proportions, already difficult via the dual python-and-cython paths, and now the dual object-and-sourcefile paths, and the accumulated layering and duplication of prior efforts.

piskvorky · 2019-01-13T15:43:24Z

Agreed with @gojomo .

On the last note: I consider the #1777 refactoring disastrous. It's hard to imagine contributing meaningfully any more (even for me).

Starting the models from a clean design, standalone and nimble (sharing concepts and code only where needed, and avoiding the various weird / hypothetical academic add-ons, which was our original refactoring goal), would simplify development, experimenting with, and ultimately possibly adding new models (such as the current ticket).

I don't think the effort is herculean, but it'd certainly have to be done by a good engineer intimate with the existing use-cases, models, APIs and code (a rather narrow circle).

gojomo · 2019-01-13T22:08:23Z

Starting over isn't what would be herculean. Rather, refactoring the existing code, while maintaining all existing interfaces, and all existing layered-on options, would be. (It's the Augean stables in there!)

My suggestions now would be about the same as offered in #1623 & also while #1777 was underway:

discard backward compatibility: build the new functionality in parallel new classes (which can then be compared directly, in the same notebook/tests, as they approach or exceed the performance/APIs/functions of the older classes)
have a tangible checklist of the kinds of "weird"/advanced functions people want to add, and make sure the core changes actually make some important subset of those easier to do cleanly add (rather than just adding a bunch of extra levels of indirection/subclassing that hypothetically add extension possibilities without any proven examples)
seriously investigate numba as a way to get cython-like performance via a single pythonic codebase

menshikh-iv · 2019-01-14T02:59:44Z

I'm +1 for

discard backward compatibility: we can't do something clear and simple if we continue to support old models (see gensim.models.deprecated that exists ONLY by backward-compatibility reason and this is muss codebase)
have a tangible checklist of the kinds of "weird"/advanced functions people want to add, and make sure the core changes actually make some important subset of those easier to do cleanly add: best option I guess, we should base on "what users actually need", not "what users potentially need". Focus on concrete extensions examples (that should be easily added on practice, not in theory, how that happens with Re-design "*2vec" implementations #1777)

mpenkov · 2019-01-14T03:54:34Z

+1 for first two points raised by @gojomo (I'm unfamiliar with numba so unable to comment on the third).

I've been actively digging through our FastText implementation recently, and I concur with @piskvorky and @gojomo: it's a mess. I've untangled some of the worst parts. It looks a lot better now than before (I hope @menshikh-iv can confirm), but there's still a long way to go.

I think going forward, it would be worth making the goals and values of the Gensim project more obvious to potential contributors. For example, on the problems with the #1777 refactoring is that it appears to have been misguided. The authors prioritized the removal of copypasta above all else, and added layers of abstractions and inheritance to meet that goal. If, for example, it was obvious that we value simplicity and loose coupling (e.g. FastText and word2vec are their own separate things), then the refactoring would have been more of a success.

ssephillip · 2019-06-01T15:27:47Z

Is there any progress on this issue?
For us Doc2VecC is very interesting because of the much shorter inference time for unseen documents (compared to Doc2Vec). In our use case the user has to wait for the unseen document to be inferred. Therefore Doc2VecC is potentially very valuable to us.

tmylk added the wishlist Feature request label Feb 22, 2017

tmylk added the difficulty easy Easy issue: required small fix label May 2, 2017

menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills and removed difficulty easy Easy issue: required small fix labels Oct 2, 2017

gojomo mentioned this issue Oct 11, 2017

Potential unification/optimization/simplification/enhancement refactor of *2Vec & related algorithms (FastText, Sent2Vec, FastSent, etc) #1623

Closed

ssafty mentioned this issue Mar 19, 2018

[WIP] Doc2VecC ("Doc2Vec with Corruption") #1985

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient vector representation for documents through corruption #1159

Efficient vector representation for documents through corruption #1159

Threynaud commented Feb 22, 2017

tmylk commented Feb 22, 2017

Threynaud commented Feb 22, 2017

zjcerwin commented Mar 1, 2017

Threynaud commented Mar 1, 2017

gojomo commented Mar 1, 2017

Threynaud commented Mar 2, 2017 •

edited

Loading

mchen24 commented Apr 28, 2017

ssafty commented Jun 13, 2017

gojomo commented Jun 13, 2017

GeneralSemantics commented Oct 22, 2018

menshikh-iv commented Oct 23, 2018

ssafty commented Nov 26, 2018

GeneralSemantics commented Nov 27, 2018

gojomo commented Nov 27, 2018

piskvorky commented Jan 13, 2019 •

edited

Loading

gojomo commented Jan 13, 2019

menshikh-iv commented Jan 14, 2019

mpenkov commented Jan 14, 2019

ssephillip commented Jun 1, 2019

Efficient vector representation for documents through corruption #1159

Efficient vector representation for documents through corruption #1159

Comments

Threynaud commented Feb 22, 2017

tmylk commented Feb 22, 2017

Threynaud commented Feb 22, 2017

zjcerwin commented Mar 1, 2017

Threynaud commented Mar 1, 2017

gojomo commented Mar 1, 2017

Threynaud commented Mar 2, 2017 • edited Loading

mchen24 commented Apr 28, 2017

ssafty commented Jun 13, 2017

gojomo commented Jun 13, 2017

GeneralSemantics commented Oct 22, 2018

menshikh-iv commented Oct 23, 2018

ssafty commented Nov 26, 2018

GeneralSemantics commented Nov 27, 2018

gojomo commented Nov 27, 2018

piskvorky commented Jan 13, 2019 • edited Loading

gojomo commented Jan 13, 2019

menshikh-iv commented Jan 14, 2019

mpenkov commented Jan 14, 2019

ssephillip commented Jun 1, 2019

Threynaud commented Mar 2, 2017 •

edited

Loading

piskvorky commented Jan 13, 2019 •

edited

Loading