-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
encoding in batches with token length leads to different results? #2570
Comments
Oh or do you sort by text length? It appears so. |
approach to first calculate token length instead of text length and sort, here: feel free to modify / tell me your opinion |
Hello! Thanks for reporting and for the PR! Sorting by token size seems a lot more consistent, but I'm a bit wary of the double tokenization. I understand that sorting samples by length allows you to get less padding in a batch, which should process more quickly, but perhaps efficient string sorting + slightly inefficient batches is faster than inefficient tokenization + efficient batches? As for your original question, my understanding is that the padding tokens are all fully ignored. They should not influence the eventual embedding results. If they do, please share a script for reproducing that, I would be quite interested.
|
Hi Tom - sure thing I will do some tests and share them. (Only generally speaking I experienced a 2x increase but will do a proper test). |
Hi! I was just wondering:
https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py#L350
You sort by token length which is really great for performance, however I was wondering if it still could be that sentences with different token length get parsed and pad? For some models it was observed that pad token changes the mean slightly. I was wondering if this is the case that sentence transformers batches also different token lengths?
Thank you for the great repo!
Aaron
The text was updated successfully, but these errors were encountered: