Bug Fix: Preserve full FASTA ID in alignment directory parsing to prevent truncation errors #1

juliocesar-io · 2024-09-20T05:07:30Z

Background

When running inference with run_pretrained_openfold.py and using precomputed alignments, the parse_fasta function is partially extracting the FASTA tag/ID from the original ID used to generate the alignments output folder. It removes special characters, such as hyphens (-) or periods (.), which are often used in FASTA IDs.

This causes the inference to fail, as the partially extracted ID does not match the alignments folder.

For example, if you have a FASTA file like this:

>my-fasta-sequence
AABBCC

Then, after running the precompute_alignments.py script, the following alignments are generated (as expected):

├── input
│   └── fasta_dir
│       └── my-fasta-sequence.fasta
├── output
│   ├── alignments
│   │   └── my-fasta-sequence
│   │       ├── bfd_uniclust_hits.a3m
│   │       ├── hhsearch_output.hhr
│   │       ├── mgnify_hits.sto
│   │       └── uniref90_hits.sto

However, when you run the run_pretrained_openfold.py script with the --use_precomputed_alignments flag, you will encounter the following error:

Traceback (most recent call last):
  File "/opt/openfold/run_pretrained_openfold.py", line 499, in <module>
    main(args)
  File "/opt/openfold/run_pretrained_openfold.py", line 299, in main
    feature_dict = generate_feature_dict(
  File "/opt/openfold/run_pretrained_openfold.py", line 151, in generate_feature_dict
    feature_dict = data_processor.process_fasta(
  File "/opt/openfold/openfold/data/data_pipeline.py", line 883, in process_fasta
    hits = self._parse_template_hit_files(
  File "/opt/openfold/openfold/data/data_pipeline.py", line 795, in _parse_template_hit_files
    for f in os.listdir(alignment_dir):
FileNotFoundError: [Errno 2] No such file or directory: '/run_path/output/alignments/my'

Fix

The error occurs because of the truncation performed by parse_fasta, causing it to look for "my" instead of the expected "my-fasta-sequence". I have updated the parse_fasta function to fix this issue.

Previously, the part of the code that split the IDs using the regex (re.split('\W|\|', t)) was cutting off parts of the ID. For the workflow using precomputed alignments to function correctly, the full ID must be preserved so that it matches the folder.

Changes:

Each entry is now split into the tag (header) and the sequence, while preserving the entire header.
The regex splitting that truncated the header has been removed, so the entire line after > is treated as the ID.

fix bug in parse fasta

3887d40

juliocesar-io self-assigned this Sep 20, 2024

juliocesar-io added the bug Something isn't working label Sep 20, 2024

juliocesar-io merged commit ac3c8bc into fastfold-server-optimizations Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Fix: Preserve full FASTA ID in alignment directory parsing to prevent truncation errors #1

Bug Fix: Preserve full FASTA ID in alignment directory parsing to prevent truncation errors #1

juliocesar-io commented Sep 20, 2024

Bug Fix: Preserve full FASTA ID in alignment directory parsing to prevent truncation errors #1

Bug Fix: Preserve full FASTA ID in alignment directory parsing to prevent truncation errors #1

Conversation

juliocesar-io commented Sep 20, 2024

Background

Fix