Thursday, September 22, 2016

Avoiding unnecessary work and constructing molecules more quickly

I have previously talked/posted about ways to construct RDKit molecules more quickly. This post revisits that topic.

By default the RDKit does a lot of work when constructing a molecule. The idea here is to set things up so that we only have to do that work once for a set of molecules that we're going to work with repeatedly.

There's also a diversion into thinking about what chemistry information is actually needed for things like substructure searching and tuning the molecule construction to only perceive that information.

In [1]:
from rdkit import Chem
from rdkit import RDConfig
import os,gzip
from rdkit import rdBase
rdBase.rdkitVersion
Out[1]:
'2016.09.1.dev1'

Setup

Start by reading in a set of ChEMBL molecules that we've used before and then reducing that to 50K examples.

In [2]:
ind = [x.strip().split() for x in open('../data/chembl16_2010-2012.smi')]
len(ind)
Out[2]:
234681
In [3]:
import random
random.seed(0xf00d)
random.shuffle(ind)
ind = ind[:50000]
In [4]:
ms = [Chem.MolFromSmiles(x[0]) for x in ind]
In [5]:
ms = [x for x in ms if x is not None]
len(ms)
Out[5]:
49994

Let's get RDKit-generated representations of the molecules:

  • A molecule's ToBinary() method returns a serialized form of the molecule that can be saved and then later efficiently converted back into a molecule.
  • RDKit SMILES
In [6]:
pkls = [x.ToBinary() for x in ms]
In [7]:
smis = [Chem.MolToSmiles(x,True) for x in ms]

Timing of standard parsing

How long does it take to generate the molecules from SMILES?

In [8]:
%timeit [Chem.MolFromSmiles(x) for x in smis]
1 loop, best of 3: 15.5 s per loop

And from the binary format?

In [9]:
%timeit [Chem.Mol(x) for x in pkls]
1 loop, best of 3: 1.96 s per loop

That dramatic difference isn't really that surprising given that the format was designed to be easy to parse and that no chemistry perception needs to be done.

So the binary format is really efficient, but it's not human readable and it's only useable with the RDKit. It would be cool to be able to take advantage of the portability and readability of SMILES but to have the processing not take quite so long.

As an aside: another benefit of using SMILES, for at least some applications, is that they are an order of magnitude smaller than the binary format:

In [12]:
print("binary:",sum(len(x) for x in pkls))
print("smiles:",sum(len(x) for x in smis))
binary: 22456727
smiles: 3029371

If we turn off the chemistry perception when parsing from the SMILES, we can see what a difference it makes:

In [10]:
%timeit [Chem.MolFromSmiles(x,sanitize=False) for x in smis]
1 loop, best of 3: 2.47 s per loop

Unfortunately, with no sanitization at all done, these molecules aren't that useful in many of the RDKit algorithms.

Partial sanitization

One of the standard operations done on reading molecules from SMILES is a call to the stereochemistry assignment code, which also removes redundant or extraneous stereochemistry specifications. This can be computationally expensive and most likely not needed when reading from an RDKit-generated canonical SMILES since that already has had the extranous and redundant specifications removed.

Let's see how long it takes if we skip that part (which is part of the call to MolFromSmiles()) and just do a normal sanitization post-SMILES parsing:

In [26]:
def partialSanit1(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    Chem.SanitizeMol(m,sanitizeOps=Chem.SANITIZE_ALL^Chem.SANITIZE_CLEANUP^Chem.SANITIZE_CLEANUPCHIRALITY)
    return m
In [27]:
%timeit [partialSanit1(x) for x in smis]
1 loop, best of 3: 9.91 s per loop

That's a solid reduction over the 15.5 seconds originally required, but is still a long way off from the 2.5 seconds for the completely unsanitized version.

Since the RDKit SMILES also contains information about aromaticity, we can also skip the kekulization and aromatization steps of the sanitization:

In [36]:
def partialSanit2(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    Chem.SanitizeMol(m,sanitizeOps=Chem.SANITIZE_ALL^Chem.SANITIZE_KEKULIZE^\
                     Chem.SANITIZE_SETAROMATICITY^Chem.SANITIZE_CLEANUP^Chem.SANITIZE_CLEANUPCHIRALITY)
    return m
In [37]:
%timeit [partialSanit2(x) for x in smis]
1 loop, best of 3: 8.04 s per loop

Even better.

We are still calling the ring-finding code, and sometimes we don't need information about rings (for example, all substructure queries from SMILES and many queries from SMILES), so what if we skip that?

In [30]:
def partialSanit3(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    Chem.SanitizeMol(m,sanitizeOps=Chem.SANITIZE_ALL^Chem.SANITIZE_KEKULIZE^\
                     Chem.SANITIZE_SETAROMATICITY^Chem.SANITIZE_CLEANUP^Chem.SANITIZE_CLEANUPCHIRALITY^\
                    Chem.SANITIZE_SYMMRINGS)
    return m
In [31]:
%timeit [partialSanit3(x) for x in smis]
1 loop, best of 3: 3.87 s per loop

If we're just concerned about doing substructure searches or generating RDKit fingerprints, this is still doing some extra work. Let's go to the bare minimum of sanitization: only updating the explicit and implicit valences of the atoms:

In [32]:
def partialSanit4(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    m.UpdatePropertyCache()
    return m
In [33]:
%timeit [partialSanit4(x) for x in smis]
1 loop, best of 3: 2.71 s per loop

That's pretty fast and those molecules are actually useful for many, many calculations.

We can add some ring information by calling FastFindRings(). This algorithm provides reliable information about whether or not atoms or bonds are in rings - and so can help with basic ring-membership queries on atoms and bonds in SMARTS or for generating Pattern fingerprints or standard Morgan fingerprints- but doesn't help with the number of smallest rings that an atom/bond is in.

In [34]:
def partialSanit5(smi):
    m = Chem.MolFromSmiles(smi,sanitize=False)
    m.UpdatePropertyCache()
    Chem.FastFindRings(m)
    return m
In [35]:
%timeit [partialSanit5(x) for x in smis]
1 loop, best of 3: 3.46 s per loop

I think those last two are pretty good. By thinking about the information we need and starting from a reliable SMILES we can get molecules that are useful for many RDKit operations much more quickly than the default.