Skip to content

Sentence Embeddings Approaches #684

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
SoundBot opened this issue Nov 18, 2019 · 12 comments
Closed

Sentence Embeddings Approaches #684

SoundBot opened this issue Nov 18, 2019 · 12 comments
Assignees

Comments

@SoundBot
Copy link

Is there a way to extend sparknlp and create my custom embedder similar to BertEmbeddings? There are some interesting models on TF Hub which I would like to try.

@maziyarpanahi
Copy link
Member

maziyarpanahi commented Nov 18, 2019

Currently, the answer is no. The WordEmbeddings annotator has this flexibility which regardless of how the embeddings were produced, as long as it follows the format it can be loaded such as GloVe, FastText, Word2Vec, etc.
However, BertEmbeddings annotator was created based on BERT itself. If you have an embedding model that is in the same format (ex: fine-tuned BERT embeddings) then you can just use the notebook and convert it to Spark NLP BertEmbeddingsModel.

That's said if you can provide some examples we can take a look at them to see if it's worth implementing an annotator for those models in TF Hub (considering accuracy, performance, etc.)

@SoundBot
Copy link
Author

@maziyarpanahi Thanks for the prompt reply! Currently I'm interested in sentence-level embeddings, which can be generated using Universal Sentence Encoder or BERT.
If I read the docs correctly, sparknlp generates word-level embeddings, which can be fed to SentenceEmbeddings annotator to average/sum word-level results. I believe this approach looses a lot of contextual information from a sentence.

@maziyarpanahi
Copy link
Member

maziyarpanahi commented Nov 24, 2019

Actually, one of the authors of BERT embeddings suggested averaging the vectors:

It should be noted that although the** “[CLS]”** acts as an “aggregate representation” for classification tasks, this is not the best choice for a high quality sentence embedding vector. According to BERT author Jacob Devlin (google-research/bert#164)

Original comment:

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

The USE is a whole another approach and I do agree simply averaging may not be the best way especially with contextualized embeddings. I am working on introducing other pooling strategies for BERT to average the last 4 layers instead of just having 1 layer at the time, and also extend the SentenceEmbeddings to do more such as weighted-average, including TF-IDF as a weight factor, and SIF (Smooth Inverse Frequency).

PS: I would really like to continue this discussion here for the further developments of the Spark NLP's embeddings toolkit as I am using those myself for the similarity engine I am working on. I believe this conversation would help developing better approaches towards sentence and document embeddings.

@maziyarpanahi maziyarpanahi changed the title Custom TF embeddings Sentence Embeddings Approaches Nov 24, 2019
@SoundBot
Copy link
Author

Thanks, that sounds useful.

The reason why I believe we need generic "plug your TF/PyTorch model" annotator is because there is no "one size fits all" model, since there are different SOTA models for text summarization, text similarity, question answering and other tasks.

Embedding models are also very resource hungry. I was getting OOMs on nodes with 400GB RAM using fairly small documents (< 2M characters), so efficient memory management is also something to think about.

@maziyarpanahi
Copy link
Member

maziyarpanahi commented Nov 29, 2019

I agree. If I can manage TF Hub I will make it a bit more generic to have flexibility in which to use.

The new version we released addresses the memory issue for BertEmbeddings. There was a bad memory leak not visible to dataset less than 1 million sentences. I used the new release myself to test on 18 million sentences. (With 100K it manages to stay right on 8G we gave to Spark session locally)

PS: give the new release a try please and let me know how it goes.

@maziyarpanahi
Copy link
Member

I'll close this in favor of:

  • Importing BERT models from TF Hub and HuggingFace for Word and Sentence embeddings Import Transformers into Spark NLP 🚀 #5669
  • We now have BertSentenceEmbeddings annotator
  • We also have SentenceEmbeddings annotator to convert any word embeddings output to sentence embeddings

@alex2awesome
Copy link
Contributor

alex2awesome commented Dec 20, 2023

Hi I wanted to follow up on this, specifically this comment:

The reason why I believe we need generic "plug your TF/PyTorch model" annotator is because there is no "one size fits all" model, since there are different SOTA models for text summarization, text similarity, question answering and other tasks.

If I understand correctly from reading the docs and perusing the example notebooks, there is currently NO way (even ~4 years after this thread first opened) to import a custom PyTorch model into SparkNLP? The extensive list you have made available as annotators is quite impressive, but I 100% agree with the original author about there being endless tweaks and variations of these base models for different tasks — it would be a shame to force the field into a relatively small number of variations that are officially supported as named models by Huggingface

@alex2awesome
Copy link
Contributor

perhaps there is some potential for synergy, here: https://github.com/dmmiller612/sparktorch

@alex2awesome
Copy link
Contributor

alex2awesome commented Dec 21, 2023

By this:

I'll close this in favor of:

Importing BERT models from TF Hub and HuggingFace for Word and Sentence embeddings #5669
We now have BertSentenceEmbeddings annotator
We also have SentenceEmbeddings annotator to convert any word embeddings output to sentence embeddings

Do you mean that we then apply other layers/modifications on top of the BERT embeddings, manually?That would be very cumbersome — here is an example of a great SOTA model:

https://github.com/shon-otmazgin/fastcoref/blob/main/models/modeling_lingmess.py

that adds modifications on top of an embeddings model. As you can see, there are many, many other steps, it would be a bit weird to build pipeline steps off of an Embeddings model for something like this

@maziyarpanahi
Copy link
Member

Hi,
As long as the model has the same architecture, same tokenizer, same inputs/outputs layers, you can do whatever you like in between! Both while importing, and when you are fine-tuning them. Today, Spark NLP supports a very wide range of tasks in NLP, audio, and vision, but as you mentioned these tasks are restricted to what we support. I personally fine-tune and tweak lots of models in pytorch via Hugging Face, save them, and make sure they are exactly the same inputs/outputs/dtype before importing them into Spark NLP.

Anything else will fall outside of the scope of Spark NLP as a library. There are libraries that allow completely custom Torch/TF models, and for that we won't be reinventing the wheel. In addition, Spark NLP works as a pipeline where it is crucial to know every input and output of each stage and that is why the tasks/architectures must be known. (there are good libraries for Scala/Java and Spark allowing custom torch models)

@alex2awesome
Copy link
Contributor

alex2awesome commented Dec 21, 2023

Sorry i don't completely understand.

"same architecture, same tokenizer, same inputs/outputs layers"

Does this include any added tokens to the tokenizer? Presumably adding tokens means we are changing the output layer. Besides this, the only thing left to vary, really, is choosing a different dataset to finetune, am I correct?

I personally fine-tune and tweak lots of models in pytorch via Hugging Face, save them, and make sure they are exactly the same inputs/outputs/dtype before importing them into Spark NLP.

That would preclude something like the model I linked above, which is built off of HF embeddings model, but has a different head, right? Or different inner architectural steps (e.g. LTG-BERT: https://aclanthology.org/2023.findings-eacl.146.pdf) that don't affect the I/O format. Or am I misunderstanding the meaning of the work "tweak", here? Can you describe in more detail what you mean by "tweak"?

there are good libraries for Scala/Java and Spark allowing custom torch models

Can you recommend a library in this category? Is there interoperability between these and SparkNLP? That would be really, really cool if so!! Given as you said, the efforts SparkNLP makes to standardize the other parts of the pipeline

@maziyarpanahi
Copy link
Member

Sorry i don't completely understand.

"same architecture, same tokenizer, same inputs/outputs layers"

Does this include any added tokens to the tokenizer? Presumably adding tokens means we are changing the output layer. Besides this, the only thing left to vary, really, is choosing a different dataset to finetune, am I correct?

If you have a look at the notebooks importing models into Spark NLP, this becomes more clear. Depending on the model's architecture, the tokenization is different. Some our BPE, some our SentencePiece, etc. So you can change those before exporting them, if we support that exact architecture.

I personally fine-tune and tweak lots of models in pytorch via Hugging Face, save them, and make sure they are exactly the same inputs/outputs/dtype before importing them into Spark NLP.

That would preclude something like the model I linked above, which is built off of HF embeddings model, but has a different head, right? Or different inner architectural steps (e.g. LTG-BERT: https://aclanthology.org/2023.findings-eacl.146.pdf) that don't affect the I/O format. Or am I misunderstanding the meaning of the work "tweak", here? Can you describe in more detail what you mean by "tweak"?

Similar to this work, adding different layer in Keras, etc. But the original model is the same at the end.

there are good libraries for Scala/Java and Spark allowing custom torch models

Can you recommend a library in this category? Is there interoperability between these and SparkNLP? That would be really, really cool if so!! Given as you said, the efforts SparkNLP makes to standardize the other parts of the pipeline

There is no interoperability, once you are in another library loading a different model, depending on what it does, you can use Spark NLP before and after if it is possible.

@JohnSnowLabs JohnSnowLabs locked as resolved and limited conversation to collaborators Dec 24, 2023
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

No branches or pull requests

3 participants