-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Efficient vector representation for documents through corruption #1159
Comments
Thanks for the link. There should at least be an integration with gensim of this code. Hoping the code will appear soon in this github repo: It is indeed very interesting way to expand on the paragaraph vectors and tf-idf-averaged word2vec. Would be good to compare it to tf-idf -averaged word2vec(not just idf) when the code becomes available. |
I sent her an email about this, I'll keep you in touch! If she doesn't plan to make it public in a near future I'll try to implement it! |
Author made code public 6 hours ago |
Looks like it is directly integrated in Mikolov's code but I don't feel like I can make the change in Gensim by myself, I am too bad with Cython.. |
From a quick glance at the paper, the idea of composing a document-vector from a (random subset) of whole-doc words made sense to me and would be pretty easy to add as an option in gensim. However, I found the paper's description of the drop-out "randomly overwrites each dimension of the original document x with probability q" process a bit unclear – does it only happen to the composed doc-vec? Or the combined doc-vec + neighboring-words-context? (The paper also mentions "each word in a document is corrupted independently of others", which as far as I can tell doesn't match the earlier descriptions of leaving-out-words or zeroing-out-dimensions-of-the-docvec.) But, doing a diff of the https://github.com/mchen24/iclr2017 code & Mikolov's "--sentence-vectors" word2vec.c patch should precisely clarify. The paper's results on IMDB sentiment-prediction are only very-slightly better than plain Paragraph Vectors, though on classification and semantic-relatedness, there seem to be bigger improvements. However, the exact metaparameters used aren't specified, so it's unclear if perhaps the new method, highly parameter optimized, is being compared against just one set of PV parameters. Knowing for sure the method often/usually helps should be a prerequisite before adding complexity to the existing PV-Doc2Vec code. |
@gojomo I agree that the performances are only slightly better than state of the art methods (and only on specific datasets) but I think that what makes this paper interesting is that this model scales with the size of the corpus, something I personally struggle with, with the current implementations of Doc2Vec! |
Thank you all for the interest in the work. The main motivation of the work is to address the scalability issue of PV, in particular, parameters grows with training corpus, as well as inefficiency in generating representation of unseen documents. We found the new method also improves the quality of the learned representation. There's only one additional hyper-parameter for Doc2VecC, which is the corruption rate. I don't find the method to be particularly sensitive. The general guideline is to use a large corruption rate, close to 1, for longer documents, and lower corruption rate for shorter documents. |
I used the C implementation of the corruption model, which gave me some very interesting results. So I wonder if someone started implementing it in gensim? if not, is it okay for me to start it (from licensing/copyrights point of view)? Thank you |
@Saftophobia I don't know of anyone working on adding the 'Doc2VecC' method to gensim. It looks valuable and seems eligible for re-implementation in Python/Cython. (I see no IP claims and the paper author's own edits to Mikolov's sentence-vectors word2vec.c patch are marked as Apache-licensed.) So I'd say it'd be a good thing for you to work on! When a PR is started – even one that's rough/in-progress for seeking early feedback – I'd be looking to be sure any refactoring/extensions to the existing Doc2Vec functionality avoid unnecessary complexity, and keep overhead negligible for people not using the extra techniques of Doc2VecC. Thanks! |
Is this still underway? |
@GeneralSemantics as I know, nobody work on it now, last attempt #1985 was abandoned 😞 |
@GeneralSemantics I'm working on it, waiting for PR review now. |
@Saftophobia Awesome, thanks for the update! |
While I still find this technique interesting, as per my comment #1159 (comment) I'm still wondering if there's any cases where it really shines. And I'd prefer any implementation of this still-rather-obscure mode to avoid adding complexity/overhead to the mainline modes. That's hard, because the prior code is already a mess, and in my opinion the prior refactoring of #1777 only made things worse. As it stands, #1985 just layers in even more special mode-specific state and branching, in the main word2vec code paths (rather than any specialized subclasses or swappable policy-classes or even Doc2Vec-specific classes). It's a bit hard-to-review, and will make all other changes to this core code harder-to-review-and-understand in the future. It might provide a performance drag on users of word2vec who aren't using this mode. In my opinion, the benefits and popular demand for this mode would have to be quite large to justify this sort of toggle-in-complexity. The threshold for accepting such functionality would be lower if it were simpler to enable, but that could require a refactoring effort of herculean proportions, already difficult via the dual python-and-cython paths, and now the dual object-and-sourcefile paths, and the accumulated layering and duplication of prior efforts. |
Agreed with @gojomo . On the last note: I consider the #1777 refactoring disastrous. It's hard to imagine contributing meaningfully any more (even for me). Starting the models from a clean design, standalone and nimble (sharing concepts and code only where needed, and avoiding the various weird / hypothetical academic add-ons, which was our original refactoring goal), would simplify development, experimenting with, and ultimately possibly adding new models (such as the current ticket). I don't think the effort is herculean, but it'd certainly have to be done by a good engineer intimate with the existing use-cases, models, APIs and code (a rather narrow circle). |
Starting over isn't what would be herculean. Rather, refactoring the existing code, while maintaining all existing interfaces, and all existing layered-on options, would be. (It's the Augean stables in there!) My suggestions now would be about the same as offered in #1623 & also while #1777 was underway:
|
I'm +1 for
|
+1 for first two points raised by @gojomo (I'm unfamiliar with numba so unable to comment on the third). I've been actively digging through our FastText implementation recently, and I concur with @piskvorky and @gojomo: it's a mess. I've untangled some of the worst parts. It looks a lot better now than before (I hope @menshikh-iv can confirm), but there's still a long way to go. I think going forward, it would be worth making the goals and values of the Gensim project more obvious to potential contributors. For example, on the problems with the #1777 refactoring is that it appears to have been misguided. The authors prioritized the removal of copypasta above all else, and added layers of abstractions and inheritance to meet that goal. If, for example, it was obvious that we value simplicity and loose coupling (e.g. FastText and word2vec are their own separate things), then the refactoring would have been more of a success. |
Is there any progress on this issue? |
I came across this paper and I think it is a very promising alternative to Doc2Vec which doesn't scale at all with the size of the corpus !
The model architecture seems very similar to Word2Vec and it would be a nice addition to Gensim.
If you think it is a good idea, I would be glad to code a PoC for it!
The text was updated successfully, but these errors were encountered: