Thursday, September 22, 2016

Avoiding unnecessary work and constructing molecules more quickly

I have previously talked/posted about ways to construct RDKit molecules more quickly. This post revisits that topic.

By default the RDKit does a lot of work when constructing a molecule. The idea here is to set things up so that we only have to do that work once for a set of molecules that we're going to work with repeatedly.

There's also a diversion into thinking about what chemistry information is actually needed for things like substructure searching and tuning the molecule construction to only perceive that information.

In [1]:
from rdkit import Chem
from rdkit import RDConfig
import os,gzip
from rdkit import rdBase
rdBase.rdkitVersion
Out[1]:
'2016.09.1.dev1'

Setup

Start by reading in a set of ChEMBL molecules that we've used before and then reducing that to 50K examples.

In [2]:
ind = [x.strip().split() for x in open('../data/chembl16_2010-2012.smi')]
len(ind)
Out[2]:
234681
In [3]:
import random
random.seed(0xf00d)
random.shuffle(ind)
ind = ind[:50000]
In [4]:
ms = [Chem.MolFromSmiles(x[0]) for x in ind]
In [5]:
ms = [x for x in ms if x is not None]
len(ms)
Out[5]:
49994

Let's get RDKit-generated representations of the molecules:

  • A molecule's ToBinary() method returns a serialized form of the molecule that can be saved and then later efficiently converted back into a molecule.
  • RDKit SMILES
In [6]:
pkls = [x.ToBinary() for x in ms]
In [7]:
smis = [Chem.MolToSmiles(x,True) for x in ms]

Timing of standard parsing

How long does it take to generate the molecules from SMILES?

In [8]:
%timeit [Chem.MolFromSmiles(x) for x in smis]
1 loop, best of 3: 15.5 s per loop

And from the binary format?

In [9]:
%timeit [Chem.Mol(x) for x in pkls]
1 loop, best of 3: 1.96 s per loop

That dramatic difference isn't really that surprising given that the format was designed to be easy to parse and that no chemistry perception needs to be done.

So the binary format is really efficient, but it's not human readable and it's only useable with the RDKit. It would be cool to be able to take advantage of the portability and readability of SMILES but to have the processing not take quite so long.

As an aside: another benefit of using SMILES, for at least some applications, is that they are an order of magnitude smaller than the binary format:

In [12]:
print("binary:",sum(len(x) for x in pkls))
print("smiles:",sum(len(x) for x in smis))
binary: 22456727
smiles: 3029371

If we turn off the chemistry perception when parsing from the SMILES, we can see what a difference it makes:

In [10]:
%timeit [Chem.MolFromSmiles(x,sanitize=False) for x in smis]
1 loop, best of 3: 2.47 s per loop

Unfortunately, with no sanitization at all done, these molecules aren't that useful in many of the RDKit algorithms.

Partial sanitization

One of the standard operations done on reading molecules from SMILES is a call to the stereochemistry assignment code, which also removes redundant or extraneous stereochemistry specifications. This can be computationally expensive and most likely not needed when reading from an RDKit-generated canonical SMILES since that already has had the extranous and redundant specifications removed.

Let's see how long it takes if we skip that part (which is part of the call to MolFromSmiles()) and just do a normal sanitization post-SMILES parsing:

In [26]:
def partialSanit1(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    Chem.SanitizeMol(m,sanitizeOps=Chem.SANITIZE_ALL^Chem.SANITIZE_CLEANUP^Chem.SANITIZE_CLEANUPCHIRALITY)
    return m
In [27]:
%timeit [partialSanit1(x) for x in smis]
1 loop, best of 3: 9.91 s per loop

That's a solid reduction over the 15.5 seconds originally required, but is still a long way off from the 2.5 seconds for the completely unsanitized version.

Since the RDKit SMILES also contains information about aromaticity, we can also skip the kekulization and aromatization steps of the sanitization:

In [36]:
def partialSanit2(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    Chem.SanitizeMol(m,sanitizeOps=Chem.SANITIZE_ALL^Chem.SANITIZE_KEKULIZE^\
                     Chem.SANITIZE_SETAROMATICITY^Chem.SANITIZE_CLEANUP^Chem.SANITIZE_CLEANUPCHIRALITY)
    return m
In [37]:
%timeit [partialSanit2(x) for x in smis]
1 loop, best of 3: 8.04 s per loop

Even better.

We are still calling the ring-finding code, and sometimes we don't need information about rings (for example, all substructure queries from SMILES and many queries from SMILES), so what if we skip that?

In [30]:
def partialSanit3(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    Chem.SanitizeMol(m,sanitizeOps=Chem.SANITIZE_ALL^Chem.SANITIZE_KEKULIZE^\
                     Chem.SANITIZE_SETAROMATICITY^Chem.SANITIZE_CLEANUP^Chem.SANITIZE_CLEANUPCHIRALITY^\
                    Chem.SANITIZE_SYMMRINGS)
    return m
In [31]:
%timeit [partialSanit3(x) for x in smis]
1 loop, best of 3: 3.87 s per loop

If we're just concerned about doing substructure searches or generating RDKit fingerprints, this is still doing some extra work. Let's go to the bare minimum of sanitization: only updating the explicit and implicit valences of the atoms:

In [32]:
def partialSanit4(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    m.UpdatePropertyCache()
    return m
In [33]:
%timeit [partialSanit4(x) for x in smis]
1 loop, best of 3: 2.71 s per loop

That's pretty fast and those molecules are actually useful for many, many calculations.

We can add some ring information by calling FastFindRings(). This algorithm provides reliable information about whether or not atoms or bonds are in rings - and so can help with basic ring-membership queries on atoms and bonds in SMARTS or for generating Pattern fingerprints or standard Morgan fingerprints- but doesn't help with the number of smallest rings that an atom/bond is in.

In [34]:
def partialSanit5(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    m.UpdatePropertyCache()
    Chem.FastFindRings(m)
    return m
In [35]:
%timeit [partialSanit5(x) for x in smis]
1 loop, best of 3: 3.46 s per loop

I think those last two are pretty good. By thinking about the information we need and starting from a reliable SMILES we can get molecules that are useful for many RDKit operations much more quickly than the default.

Thursday, August 4, 2016

A question: RDKit performance on Windows

Updated 6 August 2016 to fix an incomplete sentence.

This one is a request for advice/expertise on performance tuning/compiler flag tweaking on Windows. The short story is that when the RDKit is built using Visual Studio on Windows it ends up being substantially slower than when it's built with g++ and run using the Windows Subsystem for Linux. This doesn't seem like it should be true, but I'm not an expert with either Visual Studio or Windows, so I'm asking for help.

Some more details:

When I've used the RDKit on Windows machines it has always seemed slower than it should. I've never really quantified that and so I've always just kind of shrugged and moved on. Now I've measured a real difference and I'd like to try and do something about it.

Some experiments that I did with Docker on Windows convinced me that the effect was real, but with the advent of Bash on Windows 10 (https://msdn.microsoft.com/en-us/commandline/wsl/install_guide) - an awesome thing, by the way - I have some real numbers.

The RDKit includes some code that I've used over the years to track the performance of some basic tasks. This script - https://github.com/rdkit/rdkit/blob/master/Regress/Scripts/timings.py -  looks at a broad subset of RDKit functionality.

The tests are:
  1. construct 1000 molecules from sdf
  2. construct 1000 molecules from smiles
  3. construct 823 fragment molecules from SMARTS (smiles really)
  4. 1000 x 100 HasSubstructMatch (100 from t3)
  5. 1000 x 100 GetSubstructMatches (100 from t3)
  6. construct 428 queries from RLewis_smarts.txt
  7. 1000 x 428 HasSubstructMatch
  8. 1000 x 428 GetSubstructMatches
  9. Generate canonical SMILES for 1000 molecules
  10. Generate mol blocks for 1000 molecules
  11. RECAP decomposition of the 1000 molecules
  12. Generate 2D coordinates for the 1000 molecules
  13. Generate 3D coordinates for the 1000 molecules
  14. Optimize those conformations using UFF
  15. Generate unique subgraphs of length 6 for the 1000 molecules
  16. Generate RDK fingerprints for the 1000 molecules
  17. Optimize the conformations above (test 13) using MMFF
Here are the results using the 2016.03 release conda builds (available from the rdkit channel in conda), the first line is the Windows build, the second is the Linux build, the tests were run directly after each other on the same laptop (a Dell XPS13 running Win10 Anniversary Edition):
0.8 || 0.4 || 0.1 || 1.2 || 1.3 || 0.0 || 4.2 || 4.2 || 0.2 || 0.3 || 7.4 || 0.3 || 7.5 || 18.5 || 2.2 || 1.4 || 41.7
0.6 || 0.3 || 0.1 || 0.9 || 1.0 || 0.0 || 3.1 || 3.2 || 0.1 || 0.2 || 6.3 || 0.3 || 6.2 || 15.2 || 2.1 || 1.0 || 29.8
that's a real difference.

The Windows build is done using build files generated by cmake. It's a release mode build with Visual Studio using the flags: "/MD /O2 /Ob2 /D NDEBUG" (those are the defaults that cmake creates).

It doesn't seem right to me that the code generated by Visual Studio and running under Windows should be so much slower than the code generated by g++ and running under the Windows Linux subsystem. I'm hoping, for the good of all of the users of the RDKit on Windows, to find a tweak for the Visual C++ command-line options that produces faster compild code.

For what it's worth, here's a different set of benchmarks, run on a larger set of molecules. The script is here (https://github.com/rdkit/rdkit/blob/master/Regress/Scripts/new_timings.py):

  1. construct 50K molecules from SMILES
  2. generate canonical SMILES for those
  3. construct 10K molecules from SDF
  4. construct 823 fragment molecules from SMARTS (smiles really)
  5. 60K x 100 HasSubstructMatch
  6. 60K x 100 GetSubstructMatches
  7. construct 428 queries from RLewis_smarts.txt
  8. 60K x 428 HasSubstructMatch
  9. 60K x 428 GetSubstructMatches
  10. Generate 60K mol blocks
  11. BRICS decomposition of the 60K molecules
  12. Generate 2D coords for the 60K molecules
  13. Generate RDKit fingerpirnts for the 60K molecules
  14. Generate Morgan (radius=2) fingerprints for the 60K molecules.

The timings show the same, at times dramatic, performance differences :
18.8 || 8.5 || 6.8 || 0.1 || 85.8 || 106.2 || 0.0 || 264.2 || 268.6 || 14.0 || 77.2 || 20.9 || 104.7 || 13.0
17.5 || 9.8 || 6.7 || 0.1 || 68.0 ||  74.2 || 0.0 || 204.6 || 208.2 ||  9.6 || 56.5 || 20.6 ||  89.0 ||  6.6

If you have thoughts about what's going on here, please comment here, reach out on twitter, google+, or linkedin, or post to the mailing list.
Thanks!

Wednesday, July 27, 2016

3D structure of a natural product

A recent post on In the Pipeline talked about a recent JACS communication on the total synthesis of the natural product maoecrystal V. The article is open access: http://pubs.acs.org/doi/abs/10.1021/jacs.6b06623
Here's the ToC figure, which shows a sketch of the final product on the left: ToC Figure
Since I've been looking at how well the RDKit does at conformation generation for highly constrained systems, I thought this would be a nice example molecule to try out. I was also quite curious about the shape of that beast.
The first problem was getting the structure in. I wasn't completely confident in my ability to reproduce that sketch and JACS, of course, doesn't have a way to provide structures in electronic form. Fortunately the authors had solved x-ray structures of their final product and a bunch of intermediates and the CIF files were in the supplementary material. So I converted the CIF to a mol file with OpenBabel (hopefully using the correct CIF file... they aren't particularly well labelled as far as I can tell) and got started from there.

Note: I haven't been able to make blogger work well with py3Dmol, so there are static screenshots below of the 3D structures. Interactive structures are available by viewing the original notebook: https://github.com/greglandrum/rdkit_blog/blob/master/notebooks/3D%20structure%20of%20a%20natural%20product.ipynb with the fantastic nbviewer: http://nbviewer.jupyter.org/github/greglandrum/rdkit_blog/blob/master/notebooks/3D%20structure%20of%20a%20natural%20product.ipynb

In [1]:
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
print(rdBase.rdkitVersion)
import os,time
print( time.asctime())
2016.09.1.dev1
Wed Jul 27 18:48:35 2016
We'll use py3Dmol to show structures in the notebook
In [5]:
import py3Dmol
def drawit(m,p=None,confId=-1):
        mb = Chem.MolToMolBlock(m,confId=confId)
        if p is None:
            p = py3Dmol.view(width=400,height=400)
        p.removeAllModels()
        p.addModel(mb,'sdf')
        p.setStyle({'stick':{}})
        p.setBackgroundColor('0xeeeeee')
        p.zoomTo()
        return p.show()
Start by reading in the mol file and displaying that:
In [8]:
mExpt = Chem.MolFromMolFile("/home/glandrum/Downloads/paper/ja6b06623_si_010.cif.mol",removeHs=False)
drawit(mExpt)
Out[8]:

Now generate a conformer for the molecule:
In [14]:
m = Chem.MolFromMolFile("/home/glandrum/Downloads/paper/ja6b06623_si_010.cif.mol",removeHs=False)
Chem.AssignAtomChiralTagsFromStructure(m)
AllChem.EmbedMolecule(m,useExpTorsionAnglePrefs=True,useBasicKnowledge=True)
drawit(m)
Out[14]:

At least by eye that looks very close, let's check the RMS:
In [20]:
print("RMS: ",AllChem.GetBestRMS(m,mExpt),
      "Heavy Atom RMS:",
      AllChem.GetBestRMS(Chem.RemoveHs(m),Chem.RemoveHs(mExpt)))
RMS:  0.6165388593927497 Heavy Atom RMS: 0.18186257676731582
Not bad! But then, this is a highly constrained structure.
Because we can, let's look at the two structures superimposed:
In [21]:
AllChem.AlignMol(m,mExpt)
mb = Chem.MolToMolBlock(m)
mbExpt = Chem.MolToMolBlock(mExpt)
p = py3Dmol.view(width=400,height=400)
p.removeAllModels()
p.addModel(mb,'sdf')
p.addModel(mbExpt,'sdf')

p.setStyle({'stick':{}})
p.setBackgroundColor('0xeeeeee')
p.zoomTo()
p.show()
Out[21]:

Finally, a bit of 2D work.
Just to have it, here's the SMILES for maoecrystal V
In [12]:
Chem.MolToSmiles(Chem.RemoveHs(m),True)
Out[12]:
'C[C@H]1C(=O)[C@@]23CC[C@@H]1C[C@@]21O[C@@H]2C(C)(C)C=CC(=O)[C@]23COC1=O'
Let's see how well the RDKit's 2D coordinate generation works here:
In [13]:
Chem.MolFromSmiles('C[C@H]1C(=O)[C@@]23CC[C@@H]1C[C@@]21O[C@@H]2C(C)(C)C=CC(=O)[C@]23COC1=O')
Out[13]:
Not great, but definitely not as bad as I had expected.
In [ ]:
 

Tuesday, July 26, 2016

Tuning substructure queries II

Post updated 6.11.2016 to reflect changes to the parameter names

Quite a while ago I did a blog post describing a method to exert more control over the way substructure queries match. The approach described in that post was integrated into the RDKit for the 2015.09 release; since then it's been expanded a bit.

This post shows how to use the approach from Python and directly within the RDKit PostgreSQL cartridge. Note that, as usual, I am using a development version of the RDKit, so a fair amount of the functionality I demo here is new/changed relative to the 2016.03 release.

In [1]:
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
%load_ext sql
print(rdBase.rdkitVersion)
import time
print(time.asctime())
/home/glandrum/anaconda3/lib/python3.5/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.
  "You should import from traitlets.config instead.", ShimWarning)
/home/glandrum/anaconda3/lib/python3.5/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.
  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")
2016.09.1.dev1
Wed Jul 27 06:05:49 2016

Adjusting queries in the cartridge

Our starting point is a query against ChEMBL for molecules that contain a pyridine ring.

In [2]:
data = %sql postgresql://localhost/chembl_21 \
    select molregno,molfile from rdk.mols join compound_structures \
    using (molregno) where m@>'c1ccccn1' limit 10;
mols = [Chem.MolFromMolBlock(y) for x,y in data]
Draw.MolsToGridImage(mols,legends=[str(x) for x,y in data],molsPerRow=4,useSVG=False)
10 rows affected.
Out[2]:

If we add two dummy atoms to that in order to look for 2-6 di-substituted pyridines we initially get no results:

In [3]:
data = %sql \
    select molregno,molfile from rdk.mols join compound_structures \
    using (molregno) where m@>'*c1cccc(*)n1' limit 10 ;
0 rows affected.

But we can change that by calling the new mol_adjust_query_properties() function:

In [4]:
data = %sql \
    select molregno,molfile from rdk.mols join compound_structures \
    using (molregno) where m@>mol_adjust_query_properties('*c1cccc(*)n1') limit 10 ;
mols = [Chem.MolFromMolBlock(y) for x,y in data]
Draw.MolsToGridImage(mols,legends=[str(x) for x,y in data],molsPerRow=4,useSVG=False)
10 rows affected.
Out[4]:

This function makes a number of changes by default:

  • Converts dummy atoms into "any" queries
  • Adds a degree query to every ring atom so that its substitution must match what was provided
  • Aromaticity perception is done (if it hasn't been done already)

The addition of degree queries has the consequence that we only find di-substituted pyridines. We can lift that restriction by providing some additional configuration information to mol_adjust_query_properties() as JSON:

In [5]:
data = %sql \
    select molregno,molfile from rdk.mols join compound_structures \
    using (molregno) where m@>mol_adjust_query_properties('*c1cccc(*)n1',\
                                                         '{"adjustDegree"\:false}') limit 10 ;
mols = [Chem.MolFromMolBlock(y) for x,y in data]
Draw.MolsToGridImage(mols,legends=[str(x) for x,y in data],molsPerRow=4,useSVG=False)
10 rows affected.
Out[5]:

Notice that now we have additional substitution options on the rings. However now there are also some examples where the pyridine is part of a fused ring. We can get rid of these by telling mol_adjust_query_properties() to also add a ring-count query to every ring atom :

In [6]:
data = %sql \
    select molregno,molfile from rdk.mols join compound_structures \
    using (molregno) where m@>mol_adjust_query_properties('*c1cccc(*)n1',\
                                                         '{"adjustDegree"\:false,\
                                                          "adjustRingCount"\:true}') limit 10 ;
mols = [Chem.MolFromMolBlock(y) for x,y in data]
Draw.MolsToGridImage(mols,legends=[str(x) for x,y in data],molsPerRow=4,useSVG=False)
10 rows affected.
Out[6]:

To explore the next set of options, we extend the query a little bit to search for 2-6 substituted pyridines where one of the substituents starts with an NC=O:

In [7]:
data = %sql \
    select molregno,molfile from rdk.mols join compound_structures \
    using (molregno) where m@>mol_adjust_query_properties('*c1cccc(NC(=O)*)n1') limit 10 ;
mols = [Chem.MolFromMolBlock(y) for x,y in data]
Draw.MolsToGridImage(mols,legends=[str(x) for x,y in data],molsPerRow=4,useSVG=False)
10 rows affected.
Out[7]:

Let's allow the degrees of ring atoms or dummies to change, but not the degree of chain atoms, this should remove matches like those to molecule 952459 above:

In [8]:
data = %sql \
    select molregno,molfile from rdk.mols join compound_structures \
    using (molregno) where m@>mol_adjust_query_properties('*c1cccc(NC(=O)*)n1',\
                                                         '{"adjustDegree"\:true,\
                                                          "adjustDegreeFlags"\:"IGNORERINGS|IGNOREDUMMIES"\
                                                          }') limit 10 ;
mols = [Chem.MolFromMolBlock(y) for x,y in data]
Draw.MolsToGridImage(mols,legends=[str(x) for x,y in data],molsPerRow=4,useSVG=False)
10 rows affected.
Out[8]:

And now get rid of the fused rings by turning on the ring-count changes too:

In [9]:
data = %sql \
    select molregno,molfile from rdk.mols join compound_structures \
    using (molregno) where m@>mol_adjust_query_properties('*c1cccc(NC(=O)*)n1',\
                                                         '{"adjustDegree"\:true,\
                                                          "adjustDegreeFlags"\:"IGNORERINGS|IGNOREDUMMIES",\
                                                          "adjustRingCount"\:true}') limit 10 ;
mols = [Chem.MolFromMolBlock(y) for x,y in data]
Draw.MolsToGridImage(mols,legends=[str(x) for x,y in data],molsPerRow=4,useSVG=False)
10 rows affected.
Out[9]:

The same adjustments can be used on query molecules constructed from mol blocks:

In [10]:
mb="""
  Mrv1561 07261609522D          

  8  8  0  0  0  0            999 V2000
   -1.9866    0.7581    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -2.7011    0.3455    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.7011   -0.4795    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.9866   -0.8920    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.2721   -0.4795    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.2721    0.3455    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.4155    0.7580    0.0000 A   0  0  0  0  0  0  0  0  0  0  0  0
   -0.5577    0.7580    0.0000 A   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  6  2  0  0  0  0
  2  3  2  0  0  0  0
  3  4  1  0  0  0  0
  4  5  2  0  0  0  0
  5  6  1  0  0  0  0
  2  7  1  0  0  0  0
  6  8  1  0  0  0  0
M  END
"""
Chem.MolFromMolBlock(mb)
Out[10]:

The semantics of mol block queries are a bit different since we already have query atoms:

In [11]:
data = %sql \
    select molregno,molfile from rdk.mols join compound_structures \
    using (molregno) where m@>mol_from_ctab(:mb) limit 10 ;
mols = [Chem.MolFromMolBlock(y) for x,y in data]
Draw.MolsToGridImage(mols,legends=[str(x) for x,y in data],molsPerRow=4,useSVG=False)
10 rows affected.
Out[11]:

Calling mol_adjust_query_properties() here effectively just adds the degree queries (since the dummies are already queries), providing the same results as in Block 4 above:

In [12]:
data = %sql \
    select molregno,molfile from rdk.mols join compound_structures \
    using (molregno) where m@>mol_adjust_query_properties(mol_from_ctab(:mb)) limit 10 ;
mols = [Chem.MolFromMolBlock(y) for x,y in data]
Draw.MolsToGridImage(mols,legends=[str(x) for x,y in data],molsPerRow=4,useSVG=False)
10 rows affected.
Out[12]:

Using that from Python

The same options are available from Python using the function Chem.AdjustQueryProperties():

In [13]:
m1 = Chem.MolFromSmiles('Cc1cc(C)nc(NC(=O)c2ccc3c(c2)OCO3)c1')
m2 = Chem.MolFromSmiles('c1cc(C)nc(NC(=O)c2ccc3c(c2)OCO3)c1')
m3 = Chem.MolFromSmiles('c1cc(C)nc(N(C)C(=O)c2ccc3c(c2)OCO3)c1')
Draw.MolsToGridImage((m1,m2,m3),legends=['m1','m2','m3'])
Out[13]:
N NH O O O m1