Use a strategy for sequence preprocessing #99

SebieF · 2024-08-01T13:10:49Z

It was found that Tokenizers for different pLMs work differently under the hood, as such it was necessary to find the correct strategy to provide the sequences to the tokenizers.

ProtT5 pre-processing (https://github.com/agemagician/ProtTrans) uses whitespaces between the amino acids.
Ankh pre-processing (https://github.com/agemagician/Ankh) does not use whitespaces between the amino acids.

It was found that Tokenizers for different pLMs work differently under the hood, as such it was necessary to find the correct strategy to provide the sequences to the tokenizers

SebieF added bug Something isn't working breaking Breaking change labels Aug 1, 2024

SebieF self-assigned this Aug 1, 2024

Using a strategy for sequence preprocessing

4ecf4f5

It was found that Tokenizers for different pLMs work differently under the hood, as such it was necessary to find the correct strategy to provide the sequences to the tokenizers

SebieF force-pushed the feature/handling_preprocessing_for_transformers branch from 327c7fe to 4ecf4f5 Compare August 1, 2024 13:15

SebieF merged commit 865295f into sacdallago:develop Aug 1, 2024
1 check passed

SebieF deleted the feature/handling_preprocessing_for_transformers branch August 1, 2024 13:18

SebieF mentioned this pull request Aug 26, 2024

Version 0.9.2 #107

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a strategy for sequence preprocessing #99

Use a strategy for sequence preprocessing #99

SebieF commented Aug 1, 2024

Use a strategy for sequence preprocessing #99

Use a strategy for sequence preprocessing #99

Conversation

SebieF commented Aug 1, 2024