-
Notifications
You must be signed in to change notification settings - Fork 722
Sentence Embeddings Approaches #684
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Currently, the answer is no. The That's said if you can provide some examples we can take a look at them to see if it's worth implementing an annotator for those models in TF Hub (considering accuracy, performance, etc.) |
@maziyarpanahi Thanks for the prompt reply! Currently I'm interested in sentence-level embeddings, which can be generated using Universal Sentence Encoder or BERT. |
Actually, one of the authors of BERT embeddings suggested averaging the vectors:
Original comment:
The USE is a whole another approach and I do agree simply averaging may not be the best way especially with contextualized embeddings. I am working on introducing other pooling strategies for BERT to average the last 4 layers instead of just having 1 layer at the time, and also extend the PS: I would really like to continue this discussion here for the further developments of the Spark NLP's embeddings toolkit as I am using those myself for the similarity engine I am working on. I believe this conversation would help developing better approaches towards sentence and document embeddings. |
Thanks, that sounds useful. The reason why I believe we need generic "plug your TF/PyTorch model" annotator is because there is no "one size fits all" model, since there are different SOTA models for text summarization, text similarity, question answering and other tasks. Embedding models are also very resource hungry. I was getting OOMs on nodes with 400GB RAM using fairly small documents (< 2M characters), so efficient memory management is also something to think about. |
I agree. If I can manage TF Hub I will make it a bit more generic to have flexibility in which to use. The new version we released addresses the memory issue for BertEmbeddings. There was a bad memory leak not visible to dataset less than 1 million sentences. I used the new release myself to test on 18 million sentences. (With 100K it manages to stay right on 8G we gave to Spark session locally) PS: give the new release a try please and let me know how it goes. |
I'll close this in favor of:
|
Hi I wanted to follow up on this, specifically this comment:
If I understand correctly from reading the docs and perusing the example notebooks, there is currently NO way (even ~4 years after this thread first opened) to import a custom PyTorch model into SparkNLP? The extensive list you have made available as annotators is quite impressive, but I 100% agree with the original author about there being endless tweaks and variations of these base models for different tasks — it would be a shame to force the field into a relatively small number of variations that are officially supported as named models by Huggingface |
perhaps there is some potential for synergy, here: https://github.com/dmmiller612/sparktorch |
By this:
Do you mean that we then apply other layers/modifications on top of the BERT embeddings, manually?That would be very cumbersome — here is an example of a great SOTA model: https://github.com/shon-otmazgin/fastcoref/blob/main/models/modeling_lingmess.py that adds modifications on top of an embeddings model. As you can see, there are many, many other steps, it would be a bit weird to build pipeline steps off of an Embeddings model for something like this |
Hi, Anything else will fall outside of the scope of Spark NLP as a library. There are libraries that allow completely custom Torch/TF models, and for that we won't be reinventing the wheel. In addition, Spark NLP works as a pipeline where it is crucial to know every input and output of each stage and that is why the tasks/architectures must be known. (there are good libraries for Scala/Java and Spark allowing custom torch models) |
Sorry i don't completely understand.
Does this include any added tokens to the tokenizer? Presumably adding tokens means we are changing the output layer. Besides this, the only thing left to vary, really, is choosing a different dataset to finetune, am I correct?
That would preclude something like the model I linked above, which is built off of HF embeddings model, but has a different head, right? Or different inner architectural steps (e.g. LTG-BERT: https://aclanthology.org/2023.findings-eacl.146.pdf) that don't affect the I/O format. Or am I misunderstanding the meaning of the work "tweak", here? Can you describe in more detail what you mean by "tweak"?
Can you recommend a library in this category? Is there interoperability between these and SparkNLP? That would be really, really cool if so!! Given as you said, the efforts SparkNLP makes to standardize the other parts of the pipeline |
If you have a look at the notebooks importing models into Spark NLP, this becomes more clear. Depending on the model's architecture, the tokenization is different. Some our BPE, some our SentencePiece, etc. So you can change those before exporting them, if we support that exact architecture.
Similar to this work, adding different layer in Keras, etc. But the original model is the same at the end.
There is no interoperability, once you are in another library loading a different model, depending on what it does, you can use Spark NLP before and after if it is possible. |
Is there a way to extend sparknlp and create my custom embedder similar to
BertEmbeddings
? There are some interesting models on TF Hub which I would like to try.The text was updated successfully, but these errors were encountered: