Sentence Embeddings Approaches #684

SoundBot · 2019-11-18T15:29:37Z

Is there a way to extend sparknlp and create my custom embedder similar to BertEmbeddings? There are some interesting models on TF Hub which I would like to try.

The text was updated successfully, but these errors were encountered:

maziyarpanahi · 2019-11-18T15:36:46Z

Currently, the answer is no. The WordEmbeddings annotator has this flexibility which regardless of how the embeddings were produced, as long as it follows the format it can be loaded such as GloVe, FastText, Word2Vec, etc.
However, BertEmbeddings annotator was created based on BERT itself. If you have an embedding model that is in the same format (ex: fine-tuned BERT embeddings) then you can just use the notebook and convert it to Spark NLP BertEmbeddingsModel.

That's said if you can provide some examples we can take a look at them to see if it's worth implementing an annotator for those models in TF Hub (considering accuracy, performance, etc.)

SoundBot · 2019-11-18T15:49:59Z

@maziyarpanahi Thanks for the prompt reply! Currently I'm interested in sentence-level embeddings, which can be generated using Universal Sentence Encoder or BERT.
If I read the docs correctly, sparknlp generates word-level embeddings, which can be fed to SentenceEmbeddings annotator to average/sum word-level results. I believe this approach looses a lot of contextual information from a sentence.

maziyarpanahi · 2019-11-24T15:11:11Z

Actually, one of the authors of BERT embeddings suggested averaging the vectors:

It should be noted that although the** “[CLS]”** acts as an “aggregate representation” for classification tasks, this is not the best choice for a high quality sentence embedding vector. According to BERT author Jacob Devlin (google-research/bert#164)

Original comment:

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

The USE is a whole another approach and I do agree simply averaging may not be the best way especially with contextualized embeddings. I am working on introducing other pooling strategies for BERT to average the last 4 layers instead of just having 1 layer at the time, and also extend the SentenceEmbeddings to do more such as weighted-average, including TF-IDF as a weight factor, and SIF (Smooth Inverse Frequency).

PS: I would really like to continue this discussion here for the further developments of the Spark NLP's embeddings toolkit as I am using those myself for the similarity engine I am working on. I believe this conversation would help developing better approaches towards sentence and document embeddings.

SoundBot · 2019-11-29T18:50:46Z

Thanks, that sounds useful.

The reason why I believe we need generic "plug your TF/PyTorch model" annotator is because there is no "one size fits all" model, since there are different SOTA models for text summarization, text similarity, question answering and other tasks.

Embedding models are also very resource hungry. I was getting OOMs on nodes with 400GB RAM using fairly small documents (< 2M characters), so efficient memory management is also something to think about.

maziyarpanahi · 2019-11-29T21:34:37Z

I agree. If I can manage TF Hub I will make it a bit more generic to have flexibility in which to use.

The new version we released addresses the memory issue for BertEmbeddings. There was a bad memory leak not visible to dataset less than 1 million sentences. I used the new release myself to test on 18 million sentences. (With 100K it manages to stay right on 8G we gave to Spark session locally)

PS: give the new release a try please and let me know how it goes.

maziyarpanahi · 2021-11-20T15:29:08Z

I'll close this in favor of:

Importing BERT models from TF Hub and HuggingFace for Word and Sentence embeddings Import Transformers into Spark NLP 🚀 #5669
We now have BertSentenceEmbeddings annotator
We also have SentenceEmbeddings annotator to convert any word embeddings output to sentence embeddings

alex2awesome · 2023-12-20T22:20:55Z

Hi I wanted to follow up on this, specifically this comment:

The reason why I believe we need generic "plug your TF/PyTorch model" annotator is because there is no "one size fits all" model, since there are different SOTA models for text summarization, text similarity, question answering and other tasks.

If I understand correctly from reading the docs and perusing the example notebooks, there is currently NO way (even ~4 years after this thread first opened) to import a custom PyTorch model into SparkNLP? The extensive list you have made available as annotators is quite impressive, but I 100% agree with the original author about there being endless tweaks and variations of these base models for different tasks — it would be a shame to force the field into a relatively small number of variations that are officially supported as named models by Huggingface

alex2awesome · 2023-12-20T23:14:56Z

perhaps there is some potential for synergy, here: https://github.com/dmmiller612/sparktorch

alex2awesome · 2023-12-21T00:06:46Z

By this:

I'll close this in favor of:

Importing BERT models from TF Hub and HuggingFace for Word and Sentence embeddings #5669
We now have BertSentenceEmbeddings annotator
We also have SentenceEmbeddings annotator to convert any word embeddings output to sentence embeddings

Do you mean that we then apply other layers/modifications on top of the BERT embeddings, manually?That would be very cumbersome — here is an example of a great SOTA model:

https://github.com/shon-otmazgin/fastcoref/blob/main/models/modeling_lingmess.py

that adds modifications on top of an embeddings model. As you can see, there are many, many other steps, it would be a bit weird to build pipeline steps off of an Embeddings model for something like this

maziyarpanahi · 2023-12-21T10:06:17Z

Hi,
As long as the model has the same architecture, same tokenizer, same inputs/outputs layers, you can do whatever you like in between! Both while importing, and when you are fine-tuning them. Today, Spark NLP supports a very wide range of tasks in NLP, audio, and vision, but as you mentioned these tasks are restricted to what we support. I personally fine-tune and tweak lots of models in pytorch via Hugging Face, save them, and make sure they are exactly the same inputs/outputs/dtype before importing them into Spark NLP.

Anything else will fall outside of the scope of Spark NLP as a library. There are libraries that allow completely custom Torch/TF models, and for that we won't be reinventing the wheel. In addition, Spark NLP works as a pipeline where it is crucial to know every input and output of each stage and that is why the tasks/architectures must be known. (there are good libraries for Scala/Java and Spark allowing custom torch models)

alex2awesome · 2023-12-21T22:35:05Z

Sorry i don't completely understand.

"same architecture, same tokenizer, same inputs/outputs layers"

Does this include any added tokens to the tokenizer? Presumably adding tokens means we are changing the output layer. Besides this, the only thing left to vary, really, is choosing a different dataset to finetune, am I correct?

I personally fine-tune and tweak lots of models in pytorch via Hugging Face, save them, and make sure they are exactly the same inputs/outputs/dtype before importing them into Spark NLP.

That would preclude something like the model I linked above, which is built off of HF embeddings model, but has a different head, right? Or different inner architectural steps (e.g. LTG-BERT: https://aclanthology.org/2023.findings-eacl.146.pdf) that don't affect the I/O format. Or am I misunderstanding the meaning of the work "tweak", here? Can you describe in more detail what you mean by "tweak"?

there are good libraries for Scala/Java and Spark allowing custom torch models

Can you recommend a library in this category? Is there interoperability between these and SparkNLP? That would be really, really cool if so!! Given as you said, the efforts SparkNLP makes to standardize the other parts of the pipeline

maziyarpanahi · 2023-12-24T08:37:00Z

Sorry i don't completely understand.

"same architecture, same tokenizer, same inputs/outputs layers"

Does this include any added tokens to the tokenizer? Presumably adding tokens means we are changing the output layer. Besides this, the only thing left to vary, really, is choosing a different dataset to finetune, am I correct?

If you have a look at the notebooks importing models into Spark NLP, this becomes more clear. Depending on the model's architecture, the tokenization is different. Some our BPE, some our SentencePiece, etc. So you can change those before exporting them, if we support that exact architecture.

I personally fine-tune and tweak lots of models in pytorch via Hugging Face, save them, and make sure they are exactly the same inputs/outputs/dtype before importing them into Spark NLP.

That would preclude something like the model I linked above, which is built off of HF embeddings model, but has a different head, right? Or different inner architectural steps (e.g. LTG-BERT: https://aclanthology.org/2023.findings-eacl.146.pdf) that don't affect the I/O format. Or am I misunderstanding the meaning of the work "tweak", here? Can you describe in more detail what you mean by "tweak"?

Similar to this work, adding different layer in Keras, etc. But the original model is the same at the end.

there are good libraries for Scala/Java and Spark allowing custom torch models

Can you recommend a library in this category? Is there interoperability between these and SparkNLP? That would be really, really cool if so!! Given as you said, the efforts SparkNLP makes to standardize the other parts of the pipeline

There is no interoperability, once you are in another library loading a different model, depending on what it does, you can use Spark NLP before and after if it is possible.

maziyarpanahi self-assigned this Nov 18, 2019

maziyarpanahi added the Feature request label Nov 18, 2019

maziyarpanahi changed the title ~~Custom TF embeddings~~ Sentence Embeddings Approaches Nov 24, 2019

maziyarpanahi closed this as completed Nov 20, 2021

tkafka mentioned this issue May 16, 2023

[User] Embedding doesn't seem to work? ggml-org/llama.cpp#899

Closed

JohnSnowLabs locked as resolved and limited conversation to collaborators Dec 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence Embeddings Approaches #684

Sentence Embeddings Approaches #684

SoundBot commented Nov 18, 2019

maziyarpanahi commented Nov 18, 2019 •

edited

Loading

SoundBot commented Nov 18, 2019

maziyarpanahi commented Nov 24, 2019 •

edited

Loading

SoundBot commented Nov 29, 2019

maziyarpanahi commented Nov 29, 2019 •

edited

Loading

maziyarpanahi commented Nov 20, 2021

alex2awesome commented Dec 20, 2023 •

edited

Loading

alex2awesome commented Dec 20, 2023

alex2awesome commented Dec 21, 2023 •

edited

Loading

maziyarpanahi commented Dec 21, 2023

alex2awesome commented Dec 21, 2023 •

edited

Loading

maziyarpanahi commented Dec 24, 2023

Sentence Embeddings Approaches #684

Sentence Embeddings Approaches #684

Comments

SoundBot commented Nov 18, 2019

maziyarpanahi commented Nov 18, 2019 • edited Loading

SoundBot commented Nov 18, 2019

maziyarpanahi commented Nov 24, 2019 • edited Loading

SoundBot commented Nov 29, 2019

maziyarpanahi commented Nov 29, 2019 • edited Loading

maziyarpanahi commented Nov 20, 2021

alex2awesome commented Dec 20, 2023 • edited Loading

alex2awesome commented Dec 20, 2023

alex2awesome commented Dec 21, 2023 • edited Loading

maziyarpanahi commented Dec 21, 2023

alex2awesome commented Dec 21, 2023 • edited Loading

maziyarpanahi commented Dec 24, 2023

maziyarpanahi commented Nov 18, 2019 •

edited

Loading

maziyarpanahi commented Nov 24, 2019 •

edited

Loading

maziyarpanahi commented Nov 29, 2019 •

edited

Loading

alex2awesome commented Dec 20, 2023 •

edited

Loading

alex2awesome commented Dec 21, 2023 •

edited

Loading

alex2awesome commented Dec 21, 2023 •

edited

Loading