encoding in batches with token length leads to different results? #2570

achibb · 2024-04-03T11:55:11Z

Hi! I was just wondering:

https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py#L350

You sort by token length which is really great for performance, however I was wondering if it still could be that sentences with different token length get parsed and pad? For some models it was observed that pad token changes the mean slightly. I was wondering if this is the case that sentence transformers batches also different token lengths?

Thank you for the great repo!
Aaron

achibb · 2024-04-03T12:02:49Z

Oh or do you sort by text length? It appears so.
For me, sorting by token length gives clean and quick embeddings.

achibb · 2024-04-03T12:45:31Z

approach to first calculate token length instead of text length and sort, here:

#2571

feel free to modify / tell me your opinion

tomaarsen · 2024-04-03T14:03:17Z

Hello!

Thanks for reporting and for the PR! Sorting by token size seems a lot more consistent, but I'm a bit wary of the double tokenization. I understand that sorting samples by length allows you to get less padding in a batch, which should process more quickly, but perhaps efficient string sorting + slightly inefficient batches is faster than inefficient tokenization + efficient batches?
Perhaps some performance tests are needed to get a good understanding of what's best.

As for your original question, my understanding is that the padding tokens are all fully ignored. They should not influence the eventual embedding results. If they do, please share a script for reproducing that, I would be quite interested.

Tom Aarsen

achibb · 2024-04-03T14:31:54Z

Hi Tom - sure thing I will do some tests and share them. (Only generally speaking I experienced a 2x increase but will do a proper test).
Will folllow up

tomaarsen linked a pull request Apr 3, 2024 that will close this issue

Update SentenceTransformer.py to use token length for sorting #2571

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding in batches with token length leads to different results? #2570

encoding in batches with token length leads to different results? #2570

achibb commented Apr 3, 2024

achibb commented Apr 3, 2024

achibb commented Apr 3, 2024 •

edited

Loading

tomaarsen commented Apr 3, 2024

achibb commented Apr 3, 2024

encoding in batches with token length leads to different results? #2570

encoding in batches with token length leads to different results? #2570

Comments

achibb commented Apr 3, 2024

achibb commented Apr 3, 2024

achibb commented Apr 3, 2024 • edited Loading

tomaarsen commented Apr 3, 2024

achibb commented Apr 3, 2024

achibb commented Apr 3, 2024 •

edited

Loading