Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix 491 #509

Merged
merged 11 commits into from
Feb 14, 2020
Merged

Fix 491 #509

merged 11 commits into from
Feb 14, 2020

Conversation

trevorgokey
Copy link
Collaborator

@trevorgokey trevorgokey commented Feb 9, 2020

  • Tag issue being addressed
  • Add tests

This turns off the max search limit in the toolkits so that parameters are assigned to the entire molecule rather than terminating early. This PR tests the ability to

  • Parse a large molecule, in this case a C-C bond in an n-alkane where n=3000
  • Create an OpenMM system of T4-protein.mol2, which is in the data/proteins folder

An issue that this brings (#510) is that after loading the protein, the formal charge is believed to be -55, although the partial charges in the mol2 sum to 8. This check is skipped in OpenMM, and is not addressed in this PR.

Closes #491

@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@f73a7dc). Click here to learn what that means.
The diff coverage is 88.76%.

@jchodera
Copy link
Member

jchodera commented Feb 9, 2020

@trevorgokey : Thanks for tackling this, and for adding the test!

Why would you need to use allow_nonintegral_charges if the charge is integral but nonzero? I imagine it's the case that whatever charge sum error is causing issues is resulting in a net charge that is not close to the integer -55, which is the real issue in requiring allow_nonintegral_charges=True.

Is there any danger in removing the bound on the maximum number of matches for all cases?

Finally, out of curiosity, how slow is it in applying parameters to an entire protein? I think we'll need to look into how badly this scales with biopolymer size.

Copy link
Member

@jchodera jchodera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, provided nobody else can come up with a reason why removing the limit on the number of matches in all cases might be a bad idea. I can't think of a reason why this would be problematic at this point.

@trevorgokey
Copy link
Collaborator Author

@jchodera Good to hear that it seems ok; I tried to write tests that match the others as closely as possible.

The allow_nonintegral_charges issue had to be addressed because OpenMM was raising an exception, and causing the tests to fail. The error is not really that the partial charges don't sum to an integer, but that they don't match the formal charge. I think the error is misleading in this case. You can see the output of this error in #510.

And as for the allow_undefined_stereo issue, I just bypassed it since it was not really part of the tests. With the updates described below, this can be set to False again since it checks out now.

I used both my own old amber14 installation, and ambermini from my conda env (16.16.0) and I get the same output. The bond orders are all 1. That said, plugging this into OE works and gets me 'am' and 'ar' bond orders. I inserted the charges I got from ff14SB and reran the tests with sanity checks turned back on, and they passed.

I want to say label_parameters takes about ~2 minutes, and create_openmm_system about ~5 minutes for this protein with 2634 atoms.

Thanks for the catch, I was weary using/saving the mol2 file, and glad you had the eyes to quickly catch what was wrong.

@jchodera
Copy link
Member

Thanks for the timing info! It's clear we will need to start another issue to talk about how to make biopolymer parameterizarion fast enough to be usable with openff-1.0.0. If you have a chance to do some timings with the RDKit and OpenEye backends and open a new issue about addressing the speed problems, that would be awesome!

Copy link
Member

@j-wags j-wags left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provided nobody else can come up with a reason why removing the limit on the number of matches in all cases might be a bad idea

I can't see any reason why this would be the case. This seems like a complete win.

Looks great -- Could you just update the comment in the protein test, and mention the new entries in docs/releasehistory.rst before you merge?

"""Test OpenEyeToolkitWrapper substructure search when a large number hits are found"""

tk = OpenEyeToolkitWrapper()
smiles = "C"*3000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha!

Comment on lines 2423 to 2425
# choose the largest unsigned int without overflow
# since the C++ signature is a uint
max_matches = np.iinfo(np.uintc).max
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fancy. I like it.

Comment on lines 1962 to 1967
# Need to set allow_nonintegral_charges=True in create_openmm_system
# since it believes the formal charge is -55
# and the sum of partial charges is ~8.0
# Happens because all bond orders in the mol2 file were 1 (made from tleap),
# causing the errors in formal charge calculation (#512)
# Fixed using OE to make the mol2, and inserting ff14SB charges
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first part of this comment is no longer applicable -- Could you reword it?

@davidlmobley
Copy link
Contributor

@trevorgokey looks like you are almost done here, just needs to have a couple changes and resolve conflicts.

@j-wags j-wags added this to the 0.7.0 milestone Feb 14, 2020
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Maximum Smarts Matches for Larger Molecules
5 participants