Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Efficient vector representation for documents through corruption #1159

Open
Threynaud opened this issue Feb 22, 2017 · 19 comments
Open

Efficient vector representation for documents through corruption #1159

Threynaud opened this issue Feb 22, 2017 · 19 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request

Comments

@Threynaud
Copy link

I came across this paper and I think it is a very promising alternative to Doc2Vec which doesn't scale at all with the size of the corpus !

The model architecture seems very similar to Word2Vec and it would be a nice addition to Gensim.
If you think it is a good idea, I would be glad to code a PoC for it!

@tmylk
Copy link
Contributor

tmylk commented Feb 22, 2017

Thanks for the link. There should at least be an integration with gensim of this code. Hoping the code will appear soon in this github repo:

mchen24/iclr2017#1

It is indeed very interesting way to expand on the paragaraph vectors and tf-idf-averaged word2vec. Would be good to compare it to tf-idf -averaged word2vec(not just idf) when the code becomes available.

@Threynaud
Copy link
Author

I sent her an email about this, I'll keep you in touch! If she doesn't plan to make it public in a near future I'll try to implement it!

@tmylk tmylk added the wishlist Feature request label Feb 22, 2017
@zjcerwin
Copy link

zjcerwin commented Mar 1, 2017

Author made code public 6 hours ago

@Threynaud
Copy link
Author

Looks like it is directly integrated in Mikolov's code but I don't feel like I can make the change in Gensim by myself, I am too bad with Cython..

@gojomo
Copy link
Collaborator

gojomo commented Mar 1, 2017

From a quick glance at the paper, the idea of composing a document-vector from a (random subset) of whole-doc words made sense to me and would be pretty easy to add as an option in gensim. However, I found the paper's description of the drop-out "randomly overwrites each dimension of the original document x with probability q" process a bit unclear – does it only happen to the composed doc-vec? Or the combined doc-vec + neighboring-words-context? (The paper also mentions "each word in a document is corrupted independently of others", which as far as I can tell doesn't match the earlier descriptions of leaving-out-words or zeroing-out-dimensions-of-the-docvec.)

But, doing a diff of the https://github.com/mchen24/iclr2017 code & Mikolov's "--sentence-vectors" word2vec.c patch should precisely clarify.

The paper's results on IMDB sentiment-prediction are only very-slightly better than plain Paragraph Vectors, though on classification and semantic-relatedness, there seem to be bigger improvements. However, the exact metaparameters used aren't specified, so it's unclear if perhaps the new method, highly parameter optimized, is being compared against just one set of PV parameters. Knowing for sure the method often/usually helps should be a prerequisite before adding complexity to the existing PV-Doc2Vec code.

@Threynaud
Copy link
Author

Threynaud commented Mar 2, 2017

@gojomo I agree that the performances are only slightly better than state of the art methods (and only on specific datasets) but I think that what makes this paper interesting is that this model scales with the size of the corpus, something I personally struggle with, with the current implementations of Doc2Vec!
However, I also agree with your last point! This is the main problem of the majority of ML papers nowadays, parameters are always omitted, which is a shame!

@mchen24
Copy link

mchen24 commented Apr 28, 2017

Thank you all for the interest in the work. The main motivation of the work is to address the scalability issue of PV, in particular, parameters grows with training corpus, as well as inefficiency in generating representation of unseen documents. We found the new method also improves the quality of the learned representation. There's only one additional hyper-parameter for Doc2VecC, which is the corruption rate. I don't find the method to be particularly sensitive. The general guideline is to use a large corruption rate, close to 1, for longer documents, and lower corruption rate for shorter documents.

@tmylk tmylk added the difficulty easy Easy issue: required small fix label May 2, 2017
@ssafty
Copy link

ssafty commented Jun 13, 2017

I used the C implementation of the corruption model, which gave me some very interesting results.
However, It's not convenient since I use a single core to avoid dirty writes, and I would love to extend it beyond the random corruption.

So I wonder if someone started implementing it in gensim? if not, is it okay for me to start it (from licensing/copyrights point of view)? Thank you

@gojomo
Copy link
Collaborator

gojomo commented Jun 13, 2017

@Saftophobia I don't know of anyone working on adding the 'Doc2VecC' method to gensim. It looks valuable and seems eligible for re-implementation in Python/Cython. (I see no IP claims and the paper author's own edits to Mikolov's sentence-vectors word2vec.c patch are marked as Apache-licensed.)

So I'd say it'd be a good thing for you to work on! When a PR is started – even one that's rough/in-progress for seeking early feedback – I'd be looking to be sure any refactoring/extensions to the existing Doc2Vec functionality avoid unnecessary complexity, and keep overhead negligible for people not using the extra techniques of Doc2VecC.

Thanks!

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills and removed difficulty easy Easy issue: required small fix labels Oct 2, 2017
@GeneralSemantics
Copy link

Is this still underway?

@menshikh-iv
Copy link
Contributor

@GeneralSemantics as I know, nobody work on it now, last attempt #1985 was abandoned 😞

@ssafty
Copy link

ssafty commented Nov 26, 2018

@GeneralSemantics I'm working on it, waiting for PR review now.

@GeneralSemantics
Copy link

@Saftophobia Awesome, thanks for the update!

@gojomo
Copy link
Collaborator

gojomo commented Nov 27, 2018

While I still find this technique interesting, as per my comment #1159 (comment) I'm still wondering if there's any cases where it really shines.

And I'd prefer any implementation of this still-rather-obscure mode to avoid adding complexity/overhead to the mainline modes. That's hard, because the prior code is already a mess, and in my opinion the prior refactoring of #1777 only made things worse.

As it stands, #1985 just layers in even more special mode-specific state and branching, in the main word2vec code paths (rather than any specialized subclasses or swappable policy-classes or even Doc2Vec-specific classes). It's a bit hard-to-review, and will make all other changes to this core code harder-to-review-and-understand in the future. It might provide a performance drag on users of word2vec who aren't using this mode. In my opinion, the benefits and popular demand for this mode would have to be quite large to justify this sort of toggle-in-complexity.

The threshold for accepting such functionality would be lower if it were simpler to enable, but that could require a refactoring effort of herculean proportions, already difficult via the dual python-and-cython paths, and now the dual object-and-sourcefile paths, and the accumulated layering and duplication of prior efforts.

@piskvorky
Copy link
Owner

piskvorky commented Jan 13, 2019

Agreed with @gojomo .

On the last note: I consider the #1777 refactoring disastrous. It's hard to imagine contributing meaningfully any more (even for me).

Starting the models from a clean design, standalone and nimble (sharing concepts and code only where needed, and avoiding the various weird / hypothetical academic add-ons, which was our original refactoring goal), would simplify development, experimenting with, and ultimately possibly adding new models (such as the current ticket).

I don't think the effort is herculean, but it'd certainly have to be done by a good engineer intimate with the existing use-cases, models, APIs and code (a rather narrow circle).

@gojomo
Copy link
Collaborator

gojomo commented Jan 13, 2019

Starting over isn't what would be herculean. Rather, refactoring the existing code, while maintaining all existing interfaces, and all existing layered-on options, would be. (It's the Augean stables in there!)

My suggestions now would be about the same as offered in #1623 & also while #1777 was underway:

  • discard backward compatibility: build the new functionality in parallel new classes (which can then be compared directly, in the same notebook/tests, as they approach or exceed the performance/APIs/functions of the older classes)
  • have a tangible checklist of the kinds of "weird"/advanced functions people want to add, and make sure the core changes actually make some important subset of those easier to do cleanly add (rather than just adding a bunch of extra levels of indirection/subclassing that hypothetically add extension possibilities without any proven examples)
  • seriously investigate numba as a way to get cython-like performance via a single pythonic codebase

@menshikh-iv
Copy link
Contributor

I'm +1 for

  • discard backward compatibility: we can't do something clear and simple if we continue to support old models (see gensim.models.deprecated that exists ONLY by backward-compatibility reason and this is muss codebase)
  • have a tangible checklist of the kinds of "weird"/advanced functions people want to add, and make sure the core changes actually make some important subset of those easier to do cleanly add: best option I guess, we should base on "what users actually need", not "what users potentially need". Focus on concrete extensions examples (that should be easily added on practice, not in theory, how that happens with Re-design "*2vec" implementations #1777)

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 14, 2019

+1 for first two points raised by @gojomo (I'm unfamiliar with numba so unable to comment on the third).

I've been actively digging through our FastText implementation recently, and I concur with @piskvorky and @gojomo: it's a mess. I've untangled some of the worst parts. It looks a lot better now than before (I hope @menshikh-iv can confirm), but there's still a long way to go.

I think going forward, it would be worth making the goals and values of the Gensim project more obvious to potential contributors. For example, on the problems with the #1777 refactoring is that it appears to have been misguided. The authors prioritized the removal of copypasta above all else, and added layers of abstractions and inheritance to meet that goal. If, for example, it was obvious that we value simplicity and loose coupling (e.g. FastText and word2vec are their own separate things), then the refactoring would have been more of a success.

@ssephillip
Copy link

Is there any progress on this issue?
For us Doc2VecC is very interesting because of the much shorter inference time for unseen documents (compared to Doc2Vec). In our use case the user has to wait for the unseen document to be inferred. Therefore Doc2VecC is potentially very valuable to us.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request
Projects
None yet
Development

No branches or pull requests