Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Use a strategy for sequence preprocessing #99

Conversation

SebieF
Copy link
Collaborator

@SebieF SebieF commented Aug 1, 2024

It was found that Tokenizers for different pLMs work differently under the hood, as such it was necessary to find the correct strategy to provide the sequences to the tokenizers.

ProtT5 pre-processing (https://github.com/agemagician/ProtTrans) uses whitespaces between the amino acids.
Ankh pre-processing (https://github.com/agemagician/Ankh) does not use whitespaces between the amino acids.

@SebieF SebieF added bug Something isn't working breaking Breaking change labels Aug 1, 2024
@SebieF SebieF self-assigned this Aug 1, 2024
It was found that Tokenizers for different pLMs work differently under the hood, as such it was necessary to find the correct strategy to provide the sequences to the tokenizers
@SebieF SebieF force-pushed the feature/handling_preprocessing_for_transformers branch from 327c7fe to 4ecf4f5 Compare August 1, 2024 13:15
@SebieF SebieF merged commit 865295f into sacdallago:develop Aug 1, 2024
1 check passed
@SebieF SebieF deleted the feature/handling_preprocessing_for_transformers branch August 1, 2024 13:18
@SebieF SebieF mentioned this pull request Aug 26, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
breaking Breaking change bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant