Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Usage for structure prediction tasks #726

Open
padix-key opened this issue Dec 20, 2024 · 5 comments
Open

Usage for structure prediction tasks #726

padix-key opened this issue Dec 20, 2024 · 5 comments
Labels
example idea An idea for a new example in the gallery

Comments

@padix-key
Copy link
Member

padix-key commented Dec 20, 2024

Biotite is integrated in workflows of many structure prediction models. Hence we could add an example script that serves as loose collection of possible uses of Biotite in this context.

Input topics:

  • One-hot sequence encoding (using ProteinSequence.code)
  • Getting the assembly (structure.io.pdbx.get_assembly())
  • Filtering high-quality structures from AFDB as training samples
  • Secondary structure as feature (DsspApp, structure.annotate_sse())

Output topics:

This list is probably not exhaustive, so if anyone has additional ideas, please add them to the issue!

Notably this script should not run any model itself. It is only about preparing features for a hypothetical model and evaluating the output structure poses (e.g. taken from AlphaFold DB).

@padix-key padix-key added the example idea An idea for a new example in the gallery label Dec 20, 2024
@cwognum
Copy link

cwognum commented Jan 27, 2025

Similar idea, but a bit of a tangent: I think using Biotite for ligand posing tasks (or blind docking) would be useful.

We're currently using OpenStructure's compare-ligand-structures command to get the RMSD, LDDT-PLI and LDDT-LP for evaluating a bunch of protein-ligand co-folding methods. This works, but I would love to switch over to biotite for this.

OpenStructure supports a much broader use case than just scoring (and is thus a heavy dependency), has limited cross-platform support (and can thus be hard to install), and has not been primarily built for a ML audience (e.g. the input is a PDB or CIF file, rather than some Pythonic representation of a system). On the other hand, we've seen AlphaFold3 and various of its replicates reimplement such scores themselves, see e.g. LDDT in AlphaFold, Boltz-1 and OpenFold, but these solutions were implemented specifically for those models and are not as robust as OpenStructure. Having a centralized, more generic solution in Biotite would still be valuable.

After a quick search, I found the following:

  • rmsd() is already supported in Biotite.
  • lddt() is not officially included yet in the latest release, but I did notice that it was implemented in #699.

As of now, the following functionality seems to be missing from Biotite:

  1. Symmetry corrections by reordering atoms according to the molecular graph isomorphisms. Since Biotite already depends on networkx, we could use its GraphMatcher, similar to what is done in spyrmsd.
  2. The identification of a binding site by filtering the residues of a receptor based on the minimum distance between all the residue's heavy atoms and all the heavy atoms of the ligand. To be fair: Seems possible already using distance() and some clever filtering.
  3. Chain mapping. The purpose of this one is not yet entirely clear to me, but it seems this is done to align the reference and predicted binding site.

Is this something you would be interested in supporting through Biotite? If so, any thoughts on how to implement this? I would be open to help!

@cwognum
Copy link

cwognum commented Jan 29, 2025

@padix-key I've raised the above proposal in various other groups and there's a need that Biotite could address.

I think I can get some folks together to work on this. If you and the other maintainers agree that these are features that you would like to have in Biotite, I would really appreciate your guidance on how to go about implementing this.

@padix-key
Copy link
Member Author

Hi @cwognum, at VantAI we are currently polishing a package that does more or less exactly what you proposed: I performs atom matching between reference and the predicted model (from small molecules to chains) and runs metrics on the matched AtomArrays - for both pure protein and protein-ligand models. We plan to make it openly available as extension package soon, but it will still take a few weeks. It is designed with extensibility in mind, so I would appreciate collaboration then to add further evaluation metrics and improve the atom matching. I will keep you up to date!

@cwognum
Copy link

cwognum commented Jan 30, 2025

@padix-key Cool stuff! Is there any way in which we can help accelerate the release of the package. Like I said, there's a group of folks who would love to see this happen and who are open to contribute. Could it be an idea to open-source it already and have some folks test it prior to the official release and launch?

@Croydon-Brixton
Copy link
Contributor

Croydon-Brixton commented Feb 1, 2025

Great to see this discussion (:
Just to weigh in here for further support: Biotite is used extensively at the IPD / Baker lab as well for bio-data wrangling and in ML workflows (dataset preparation & evaluation). So this would be very much of interest to the academic community too. Happy to help out where I can be helpful.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
example idea An idea for a new example in the gallery
Projects
None yet
Development

No branches or pull requests

3 participants