The 2019.03 RDKit release includes an optional integration with a semi-empirical quantum mechanics package called YAeHMOP. If you build the RDKit yourself, this is disabled by default, but I include it when I do the RDKit conda builds.
There's no documentation for this integration yet... thus this blog post.
There are a couple of applications where I think it might be interesting to have access to quick, interpretable QM results from the RDKit. These include getting partial charges and starting to think about quick and dirty descriptors for things like bond strengths and reactivity. So I decided to do an integration of YAeHMOP into the RDKit.
There are a bunch of books and papers describing extended Hueckel theory and the various types of analysis that one can do with it; I'm not going to cite all of those here. There's also no formal publication describing YAeHMOP (what a shock!), but there is a chapter in my PhD thesis that provides an fairly comprehensive tutorial overview of the theory.
There's no documentation for this integration yet... thus this blog post.
Some background¶
One of my major projects in grad school was to write a software package for performing extended Hueckel calculations and analyzing and visualizing the results. I released the source for YAeHMOP ("Yet Another extended Hueckel Molecular Orbital Package") way back when using some weird license (the open-source landscape in chemistry was "a bit" different back in the mid-90s). After changing fields, I more or less forgot about YAeHMOP until a couple of years ago when Patrick Avery contacted me about integrating it into Avogadro. To make things easier, I put the source up on Github and switched to a BSD license. Since that point we've done a bit of additional cleanup work on the eHT code (the visualization pieces are still there, but really only have historic interest) and made it useable as a library.There are a couple of applications where I think it might be interesting to have access to quick, interpretable QM results from the RDKit. These include getting partial charges and starting to think about quick and dirty descriptors for things like bond strengths and reactivity. So I decided to do an integration of YAeHMOP into the RDKit.
There are a bunch of books and papers describing extended Hueckel theory and the various types of analysis that one can do with it; I'm not going to cite all of those here. There's also no formal publication describing YAeHMOP (what a shock!), but there is a chapter in my PhD thesis that provides an fairly comprehensive tutorial overview of the theory.
In [1]:
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import rdDepictor
rdDepictor.SetPreferCoordGen(True)
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_3d = True
# this is the package including the connection to YAeHMOP
from rdkit.Chem import rdEHTTools
from rdkit.Chem import AllChem
import numpy as np
import rdkit
print(rdkit.__version__)
import time
print(time.asctime())
In [2]:
%pylab inline
Start by showing how to run an eHT calculation:
In [3]:
# imatinib
m = Chem.MolFromSmiles('Cc1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc2)cc1Nc1nccc(-c2cccnc2)n1')
m
Out[3]:
We need to add Hs and generate a conformation:
In [4]:
mh = Chem.AddHs(m)
AllChem.EmbedMultipleConfs(mh,50);
In [5]:
passed,res = rdEHTTools.RunMol(mh)
The first return value lets us know if the calculation succeeded:
In [6]:
passed
Out[6]:
The second contains the results of the calculation:
In [7]:
res
Out[7]:
We have access to the atomic charges:
In [8]:
res.GetAtomicCharges()
Out[8]:
The reduced overlap population matrix provides the Mulliken overlap population between atoms in the molecule. It's returned as a vector representing a symmetric matrix:
In [9]:
res.GetReducedOverlapPopulationMatrix()[:6]
Out[9]:
In [10]:
# convenience function to set a "MulilkenOverlapPopulation" property on bonds.
def set_overlap_populations(m,ehtRes):
rop = ehtRes.GetReducedOverlapPopulationMatrix()
for bnd in m.GetBonds():
a1 = bnd.GetBeginAtom()
a2 = bnd.GetEndAtom()
if a1.GetAtomicNum()==1:
continue
if a2.GetAtomicNum()==1:
continue
# symmetric matrix:
i1 = max(a1.GetIdx(),a2.GetIdx())
i2 = min(a1.GetIdx(),a2.GetIdx())
idx = (i1*(i1+1))//2 + i2
bnd.SetDoubleProp("MullikenOverlapPopulation",rop[idx])
In [11]:
set_overlap_populations(mh,res)
Look at a few of the overlap populations:
In [12]:
for bnd in list(mh.GetBonds())[:5]:
if not bnd.HasProp("MullikenOverlapPopulation"):
continue
print(f'{bnd.GetIdx()} {bnd.GetBeginAtom().GetSymbol()}({bnd.GetBeginAtomIdx()})-{bnd.GetEndAtom().GetSymbol()}({bnd.GetEndAtomIdx()}) {bnd.GetBondType()} {bnd.GetDoubleProp("MullikenOverlapPopulation") :.3f}')
The wrapper also currently provides the reduced charge matrix (described in that thesis chapter linked above). I'm not going to discuss that here.
Look at variability of charges between conformers¶
The rest of this blog post is going to show a straightforward analysis one can do with the eHT results: looking at the variability of the QM charges across the different conformers.
In [13]:
def runEHT(mh):
""" convenience function to run an eHT calculation on all of a molecule's conformers
returns a list of the results structures as well as a list of lists containing the
charges on each atom in each conformer: chgs[atomId][confId] to use this
"""
eres = []
charges = [[] for x in range(mh.GetNumHeavyAtoms())]
for cid in range(mh.GetNumConformers()):
passed,res = rdEHTTools.RunMol(mh,confId=cid)
eres.append(res)
if not passed:
raise ValueError("eHT failed")
hvyIdx = 0
echgs = res.GetAtomicCharges()
for atom in mh.GetAtoms():
if atom.GetAtomicNum()==1:
continue
charges[hvyIdx].append(echgs[atom.GetIdx()])
hvyIdx+=1
return (eres,charges)
In [14]:
eres,charges = runEHT(mh)
In [15]:
figsize(18,12)
violinplot(charges);
title("ETKDG");
xlabel('atom ID');
ylabel("partial charge");
To make sense of that, draw the molecule with the atoms numbered:
In [16]:
for at in m.GetAtoms():
at.SetProp("atomLabel",f'{at.GetSymbol()}{at.GetIdx()}')
m
Out[16]:
Look at the results for a couple of the atoms that show the most variability
In [17]:
tchgs = sorted((y,x) for x,y in enumerate(charges[13]))
tchgs[:2],tchgs[-2:]
Out[17]:
In [18]:
tchgs = sorted((y,x) for x,y in enumerate(charges[16]))
tchgs[:2],tchgs[-2:]
Out[18]:
In [19]:
tm1 = Chem.MolFromMolBlock(Chem.MolToMolBlock(mh,confId=1))
tm1
Out[19]:
In [20]:
tm24 = Chem.MolFromMolBlock(Chem.MolToMolBlock(mh,confId=24))
tm24
Out[20]:
The difference here is the conformation of the piperazine ring. In the first case (more negative N), the ring is in the chair form. In the second (less negative N), it's in the boat form.
In [21]:
tchgs = sorted((y,x) for x,y in enumerate(charges[36]))
tchgs[:2],tchgs[-2:]
Out[21]:
In [22]:
tchgs = sorted((y,x) for x,y in enumerate(charges[26]))
tchgs[:2],tchgs[-2:]
Out[22]:
In [23]:
tm5 = Chem.MolFromMolBlock(Chem.MolToMolBlock(mh,confId=5))
tm5
Out[23]:
In [24]:
tm2 = Chem.MolFromMolBlock(Chem.MolToMolBlock(mh,confId=2))
tm2
Out[24]:
The difference here is the orientation of the two Ns in the pyrimidine ring relative to the H on the exocyclic amine (atom 24). The N that is closer to the H has a more negative charge than the other.
What if we do an MMFF minimization of each conformer?
In [25]:
mh_mmff = Chem.Mol(mh)
optRes = AllChem.MMFFOptimizeMoleculeConfs(mh_mmff,maxIters=500)
for i,needsMore in enumerate(optRes):
while needsMore:
needsMore = AllChem.MMFFOptimizeMolecule(mh_mmff,maxIters=500,confId=i)
In [26]:
eres,mmff_charges = runEHT(mh_mmff)
In [27]:
figsize(18,12)
violinplot(mmff_charges);
title("MMFF");
xlabel('atom ID');
ylabel("partial charge");
Are the MMFF conformers actually different from each other?
In [28]:
rmsds = []
cp1 = Chem.RemoveHs(mh_mmff)
cp2 = Chem.RemoveHs(mh_mmff)
for i in range(mh_mmff.GetNumConformers()):
for j in range(i+1,mh_mmff.GetNumConformers()):
rmsds.append(AllChem.GetBestRMS(cp1,cp2,prbId=i,refId=j))
figsize(12,6)
hist(rmsds,bins=20);
The inter-conformer charge differences we saw before have vanished, are the structural features still different?
In [29]:
tm1 = Chem.MolFromMolBlock(Chem.MolToMolBlock(mh_mmff,confId=1))
tm1
Out[29]:
In [30]:
tm24 = Chem.MolFromMolBlock(Chem.MolToMolBlock(mh_mmff,confId=24))
tm24
Out[30]:
Nope, both piperazine rings are now chair. That explains the charges being more or less constant
In [31]:
tm5 = Chem.MolFromMolBlock(Chem.MolToMolBlock(mh_mmff,confId=5))
tm5
Out[31]:
In [32]:
tm2 = Chem.MolFromMolBlock(Chem.MolToMolBlock(mh_mmff,confId=2))
tm2
Out[32]:
Here the difference in the conformations is still present; the charge differences are just much smaller than previously
What about differences in charges between the conformation generation methods?
In [33]:
charge_compare = []
for i in range(len(charges)):
both = list(zip(charges[i],mmff_charges[i]))
charge_compare.extend(both)
In [34]:
scatter([y for x,y in charge_compare],[y-x for x,y in charge_compare])
xlabel('MMFF')
ylabel('MMFF-ETKDG');
title('eHT Charges');
There are differences, but they are mostly reasonably small.
Repeat that analysis for another molecule¶
In [35]:
# rivaroxaban
m2 = Chem.MolFromSmiles('O=C(NC[C@H]1CN(c2ccc(N3CCOCC3=O)cc2)C(=O)O1)c1ccc(Cl)s1')
m2
Out[35]:
In [36]:
mh2 = Chem.AddHs(m2)
AllChem.EmbedMultipleConfs(mh2,50)
mh2_mmff = Chem.Mol(mh2)
optRes = AllChem.MMFFOptimizeMoleculeConfs(mh2_mmff,maxIters=500)
for i,needsMore in enumerate(optRes):
while needsMore:
needsMore = AllChem.MMFFOptimizeMolecule(mh2_mmff,maxIters=500,confId=i)
In [37]:
eres,charges2 = runEHT(mh2)
eres,mmff_charges2 = runEHT(mh2_mmff)
In [38]:
figsize(18,12)
violinplot(charges2);
title("ETKDG");
xlabel('atom ID');
ylabel("partial charge");
In [39]:
for at in m2.GetAtoms():
at.SetProp("atomLabel",f'{at.GetSymbol()}{at.GetIdx()}')
m2
Out[39]:
In [40]:
figsize(18,12)
violinplot(mmff_charges2);
title("MMFF");
xlabel('atom ID');
ylabel("partial charge");
Are the MMFF conformers actually different?
In [41]:
rmsds = []
cp1 = Chem.RemoveHs(mh2_mmff)
cp2 = Chem.RemoveHs(mh2_mmff)
for i in range(mh2_mmff.GetNumConformers()):
for j in range(i+1,mh2_mmff.GetNumConformers()):
rmsds.append(AllChem.GetBestRMS(cp1,cp2,prbId=i,refId=j))
figsize(12,6)
hist(rmsds,bins=20);
In [42]:
charge_compare2 = []
for i in range(len(charges2)):
both = list(zip(charges2[i],mmff_charges2[i]))
charge_compare2.extend(both)
scatter([y for x,y in charge_compare2],[y-x for x,y in charge_compare2])
xlabel('MMFF')
ylabel('MMFF-ETKDG');
title('eHT Charges');
Larger differences between conformation-generation methods here.
Performance¶
How long does it take to do this analysis?
Start by looking at the time required to generate a conformer:
In [44]:
tempM = Chem.Mol(mh2)
%timeit _ = AllChem.EmbedMolecule(tempM)
Or to generate a conformer and MMFF minimize it:
In [47]:
%timeit _ = AllChem.EmbedMolecule(tempM);AllChem.MMFFOptimizeMolecule(tempM,maxIters=500)
Do the eHT calculation:
In [48]:
%timeit _ = rdEHTTools.RunMol(tempM)
So we can generate a conformer for rivaroxaban, MMFF minimize it, and then do the eHT calculation to get, for example, partial charges (remember these don't seem to change that much across MMFF conformers) in a bit more than 1/4 of a second. Not bad!
No comments:
Post a Comment