Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Bug Fix: Preserve full FASTA ID in alignment directory parsing to prevent truncation errors #1

Merged
merged 1 commit into from
Sep 20, 2024

Conversation

juliocesar-io
Copy link
Owner

Background

When running inference with run_pretrained_openfold.py and using precomputed alignments, the parse_fasta function is partially extracting the FASTA tag/ID from the original ID used to generate the alignments output folder. It removes special characters, such as hyphens (-) or periods (.), which are often used in FASTA IDs.

This causes the inference to fail, as the partially extracted ID does not match the alignments folder.

For example, if you have a FASTA file like this:

>my-fasta-sequence
AABBCC

Then, after running the precompute_alignments.py script, the following alignments are generated (as expected):

├── input
│   └── fasta_dir
│       └── my-fasta-sequence.fasta
├── output
│   ├── alignments
│   │   └── my-fasta-sequence
│   │       ├── bfd_uniclust_hits.a3m
│   │       ├── hhsearch_output.hhr
│   │       ├── mgnify_hits.sto
│   │       └── uniref90_hits.sto

However, when you run the run_pretrained_openfold.py script with the --use_precomputed_alignments flag, you will encounter the following error:

Traceback (most recent call last):
  File "/opt/openfold/run_pretrained_openfold.py", line 499, in <module>
    main(args)
  File "/opt/openfold/run_pretrained_openfold.py", line 299, in main
    feature_dict = generate_feature_dict(
  File "/opt/openfold/run_pretrained_openfold.py", line 151, in generate_feature_dict
    feature_dict = data_processor.process_fasta(
  File "/opt/openfold/openfold/data/data_pipeline.py", line 883, in process_fasta
    hits = self._parse_template_hit_files(
  File "/opt/openfold/openfold/data/data_pipeline.py", line 795, in _parse_template_hit_files
    for f in os.listdir(alignment_dir):
FileNotFoundError: [Errno 2] No such file or directory: '/run_path/output/alignments/my'

Fix

The error occurs because of the truncation performed by parse_fasta, causing it to look for "my" instead of the expected "my-fasta-sequence". I have updated the parse_fasta function to fix this issue.

Previously, the part of the code that split the IDs using the regex (re.split('\W|\|', t)) was cutting off parts of the ID. For the workflow using precomputed alignments to function correctly, the full ID must be preserved so that it matches the folder.

Changes:

  • Each entry is now split into the tag (header) and the sequence, while preserving the entire header.
  • The regex splitting that truncated the header has been removed, so the entire line after > is treated as the ID.

@juliocesar-io juliocesar-io self-assigned this Sep 20, 2024
@juliocesar-io juliocesar-io added the bug Something isn't working label Sep 20, 2024
@juliocesar-io juliocesar-io merged commit ac3c8bc into fastfold-server-optimizations Sep 20, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant