Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

mobley_3323117 (sulfolane) has non-standard SMILES #51

Open
jchodera opened this issue Dec 10, 2022 · 1 comment
Open

mobley_3323117 (sulfolane) has non-standard SMILES #51

jchodera opened this issue Dec 10, 2022 · 1 comment

Comments

@jchodera
Copy link
Contributor

Molecule mobley_3323117 (sulfolane) is written with the non-standard SMILES C1CC[S+2](C1)([O-])[O-], rather than the more standard C1CCS(=O)(=O)C1.

Despite being equivalent in total charge, these forms are inequivalent due to the provided formal charges (+2 for S, -1 for O) vs the standard SMILES (all atoms have 0 formal charge), which are rendered inequivalent in molecular representations in the OpenFF toolkit (with the OpenEye backend):

>>> from openff.toolkit.topology import Molecule
>>> freesolv_molecule = Molecule.from_smiles('C1CC[S+2](C1)([O-])[O-]')
>>> standard_molecule = Molecule.from_smiles('C1CCS(=O)(=O)C1')
>>> freesolv_molecule.generate_unique_atom_names()
>>> standard_molecule.generate_unique_atom_names()
>>> [(atom.name, atom.formal_charge.m) for atom in freesolv_molecule.atoms]
[('C1x', 0), ('C2x', 0), ('C3x', 0), ('S1x', 2), ('C4x', 0), ('O1x', -1), ('O2x', -1), ('H1x', 0), ('H2x', 0), ('H3x', 0), ('H4x', 0), ('H5x', 0), ('H6x', 0), ('H7x', 0), ('H8x', 0)]
>>> [(atom.name, atom.formal_charge.m) for atom in standard_molecule.atoms]
[('C1x', 0), ('C2x', 0), ('C3x', 0), ('S1x', 0), ('O1x', 0), ('O2x', 0), ('C4x', 0), ('H1x', 0), ('H2x', 0), ('H3x', 0), ('H4x', 0), ('H5x', 0), ('H6x', 0), ('H7x', 0), ('H8x', 0)]

Would it be reasonable to correct the non-standard SMILES string and re-generate the database?
Or are there ways to automatically standardize the formal charges?

@davidlmobley
Copy link
Member

This (or all SMILES) could be canonicalized. Probably the best option, I think, is to fix only THIS SMILES, however. Otherwise we run into the problem of "what should be considered the authoritative identifier for a molecule, from which everything else can be generated?" We've moved to treating SMILES as the source data and authoritative, meaning that re-generating all SMILES from the SMILES is probably unwise, since then we're overwriting our authoritative source data.

So, perhaps we should correct only this one?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants