Version 0.9.2 #107

SebieF · 2024-08-26T11:09:03Z

26.08.2024 - Version 0.9.2

Features

Improving memory management of embedding calculation by @SebieF in Improving memory management of embedding calculation #96
Use a strategy for sequence preprocessing by @SebieF in Use a strategy for sequence preprocessing #99
Adding ONNX support by @SebieF in Adding ONNX support #101
Adding saprot embedder example by @SebieF in Adding saprot embedder example #106

Maintenance

BREAKING Improving masking mechanisms in CNN and LightAttention models by @SebieF in Improving masking mechanisms in CNN and LightAttention models #102
Improving embedder model and tokenizer class recognition by @SebieF in Improving embedder model and tokenizer class recognition #105
Optimize Memory Handling in Embedding Computations and Refactor EmbeddingService by @heispv in Optimize Memory Handling in Embedding Computations and Refactor EmbeddingService #103
Updating dependencies

1. Saving embeddings to file after threshold is reached 2. Deleting and collecting memory after calculation, especially embedder model was causing troubles sometimes

It was found that Tokenizers for different pLMs work differently under the hood, as such it was necessary to find the correct strategy to provide the sequences to the tokenizers

LayerNorm is more commonly used in NLP and avoids problems with batches of size 1

…st model

…rotocol class

Padded residue embeddings are now masked out which improves reproducibility and avoids different predictions between batches and single inputs

Mask is now applied before the attention convolution, and `-float('inf')` is used instead of `-1e9` which seems to improve reproducibility and avoids different predictions between batches and single inputs

…ross tests

Using AutoTokenizer does not work for all models, but can help if it provides the class name

Embedder: https://github.com/westlake-repl/SaProt Discussion: J-SNACKKB/FLIP#26

Enhanced the embedding computation function to dynamically manage RAM by estimating the maximum embeddings that can fit and automatically saving them to optimize memory usage

- Extract core embedding service logic into embedding_service method - Add special case handling for ultra-long reads: - Immediately save ultra-long read embeddings to disk - Avoid loading additional sequences when ultra-long read detected - Dynamically calculate max embeddings that fit in available memory - Use this to determine when to flush embeddings to disk - Code cleanup: - Use type hints for improved readability - Docstrings for key methods

…ogging - Renamed embedding_service to _do_embeddings_computation for clarity. - Replaced while loop with for loop to prevent infinite loops and simplify processing. - Combined handling of ultra-long and normal reads into a single loop to reduce complexity. - Updated logging levels for better clarity and reduced unnecessary logs. - Optimized garbage collection by consolidating gc.collect() calls.

…gement - Replacing tearDown and manual cleanup with tempfile.TemporaryDirectory() for automatic resource management. - Adjusting long_length calculation to use a fixed value for local testing and a memory-based value for CI environments. - Enhancing _run_embedding_test to support both sequence_to_class and residue_to_class protocols. - Updating _verify_result to validate output based on the protocol used. - Adding new tests for comprehensive coverage of both embedding protocols.

- Moving progress bar initialization to ensure visibility during the first embedding calculation and updated its description. - Consolidating memory cleanup: moving del self._embedder and gc.collect() to avoid repeated calls. - Updating docstring in _max_embedding_fit to explain constants (0.75 and 18).

SebieF and others added 30 commits August 1, 2024 11:58

Improving memory management of embedding calculation

84830d0

1. Saving embeddings to file after threshold is reached 2. Deleting and collecting memory after calculation, especially embedder model was causing troubles sometimes

Adding testing on develop branch to ci pipeline

1de4904

Using a strategy for sequence preprocessing

865295f

It was found that Tokenizers for different pLMs work differently under the hood, as such it was necessary to find the correct strategy to provide the sequences to the tokenizers

Replacing BatchNorm1D with LayerNorm in LightAttention model

4f12a7c

LayerNorm is more commonly used in NLP and avoids problems with batches of size 1

Adding residues_to_value test model and updating residues_to_class te…

233a4ca

…st model

Adding get_dummy_input function and embedding specific protocols to P…

274f637

…rotocol class

Replacing reduced embeddings dict with Protocol methods

87be8ea

Making collate_functions more robust for batches of size 1 or 0

146a52c

Adding inference from and conversion to ONNX models

eda944d

Adding ONNX example

69b4f15

Adding missing empty line at end of file

604602a

Adding docstring for convert_to_onnx

2381380

Applying masking to CNN

7d5aa93

Padded residue embeddings are now masked out which improves reproducibility and avoids different predictions between batches and single inputs

Improving masking in LightAttention model

484c79f

Mask is now applied before the attention convolution, and `-float('inf')` is used instead of `-1e9` which seems to improve reproducibility and avoids different predictions between batches and single inputs

Updating trained models for example and tests

8fa0860

Adding test_single_vs_batch_prediction

ea74d70

Making error_tolerance(_factor) an attribute to keep it consistent ac…

5d49ad3

…ross tests

Improving embedder model and tokenizer class recognition

505bfb1

Using AutoTokenizer does not work for all models, but can help if it provides the class name

Adding saprot embedder example

03d8fe9

Embedder: https://github.com/westlake-repl/SaProt Discussion: J-SNACKKB/FLIP#26

Implementing dynamic memory handling for embedding computations

aaecd55

Enhanced the embedding computation function to dynamically manage RAM by estimating the maximum embeddings that can fit and automatically saving them to optimize memory usage

Removing SAVE_AFTER_N_EMBEDDINGS

9df72b2

Adding unit tests for EmbeddingService

d5420e8

Adding logging to clarify test duration in CI environments

35519a6

Updating dependencies

24ee278

Updating version

73adb65

Updating Changelog

b643d0d

SebieF added enhancement New feature or request breaking Breaking change maintenance Code or example maintainance labels Aug 26, 2024

SebieF self-assigned this Aug 26, 2024

SebieF merged commit 76a831e into sacdallago:main Aug 26, 2024
1 check passed

SebieF deleted the release/v-0-9-2 branch August 26, 2024 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 0.9.2 #107

Version 0.9.2 #107

SebieF commented Aug 26, 2024

Version 0.9.2 #107

Version 0.9.2 #107

Conversation

SebieF commented Aug 26, 2024

26.08.2024 - Version 0.9.2

Features

Maintenance