Sunday, July 7, 2019

Using Intake for Chemistry Data

I noticed the mention of Intake an anaconda blog post last year and filed it away as: "oooo, that's interesting, I should look into that at some point." I really like the idea of having a straightforward Python-based system for creating and working with a data catalog; I've got way too many random datasets floating around that I use for various things and I think Intake could help bring some order to that.
Today I decided to celebrate Friday by doing a bit of tech exploration and picked Intake. After going quickly through the documentation and making sure the library is still alive (i.e. active on github), I dove in.
Of course what I'm most interested in doing is figuring out how to add chemistry file formats to Intake. So after some initial experimentation to make sure I can use the library at all and that it's actually as straightforward as it seems, I did the obvious first thing and added an SDF loader.
This post is mostly about that. Well, that and a short demo of using Intake.
I'm going to create a data catalog that contains three different data sources, all connected to the same dataset: a collection of compounds with HeRG data that came from ChEMBL. I'm not 100% sure where this dataset came from (see, I need something to help with this!), but I think it's from this blog post. I wanted multiple linked source files, so in addition to the original file, I created a CSV file with calculated 2D descriptors and an SDF with 3D structures of the molecules and some 3D descriptors.
Let's get started.
In order to use the code in this notebook, you will need to install intake and hvplot. I'm using conda (of course), so I got intake from conda-forge and used the hvplot version from the default channel:
conda install -c conda-forge intake
conda install hvplot
Note This is another one of those posts where blogger doesn't handle the output from nbconvert at all well. You can see the notebook with live bokeh plots on nbviewer or download it directly from github.
In [1]:
from rdkit import Chem
import intake
import pandas as pd
import hvplot.pandas
intake.output_notebook()
Look at the data sources that intake supports by default:
In [2]:
list(intake.registry)
Out[2]:
['yaml_file_cat',
 'yaml_files_cat',
 'csv',
 'textfiles',
 'catalog',
 'intake_remote',
 'numpy',
 'ndzarr']

Using data sources

In [3]:
activity_source = intake.open_csv('../data/herg_data.txt',csv_kwargs={'sep':' '})
props_source = intake.open_csv('../data/herg_mols_props.csv')
In [4]:
activity_source.discover()
Out[4]:
{'datashape': None,
 'dtype': {'canonical_smiles': 'object',
  'molregno': 'int64',
  'activity_id': 'int64',
  'standard_value': 'float64',
  'standard_units': 'object'},
 'shape': (None, 5),
 'npartitions': 1,
 'metadata': {}}
In [5]:
props_source.discover()
Out[5]:
{'datashape': None,
 'dtype': {'molregno': 'int64',
  'MolLogP': 'float64',
  'TPSA': 'float64',
  'NumRotatableBonds': 'int64',
  'MolWt': 'float64',
  'NumHAcceptors': 'int64',
  'NumHDonors': 'int64'},
 'shape': (None, 7),
 'npartitions': 1,
 'metadata': {}}
The most efficient way to work with the data in a data source is to read() it in. For these data sources, this returns a pandas dataframe:
In [6]:
props_df = props_source.read()
props_df.head()
Out[6]:

molregno MolLogP TPSA NumRotatableBonds MolWt NumHAcceptors NumHDonors
0 29272 1.6996 92.50 5 419.469 4 2
1 29758 1.4269 46.33 2 228.361 3 1
2 29449 2.1136 75.43 4 383.464 4 2
3 29244 1.7516 92.50 5 401.479 4 2
4 29265 2.3200 101.73 6 467.485 5 2
And since we've imported hvplot, we now have some nice plotting available:
In [2]:
props_df.hvplot(kind='scatter',x='MolWt',y='MolLogP')

Teaching intake to read SDF files

In [39]:
%%writefile ./intake_sdf.py
from rdkit import Chem
import pandas as pd
import intake
import numpy as np
class SDFSource(intake.source.base.DataSource):
    " Inspired by example here: https://github.com/intake/intake-examples/blob/master/tutorial/data_engineer.ipynb "
    container = 'dataframe'
    name = 'sdf'
    version = '0.0.1'
    partition_access = True
    suppl_factory = Chem.SDMolSupplier
    
    def __init__(self, filename, metadata=None, nPartitions=1, **kwargs):
        super(SDFSource, self).__init__(
            metadata=metadata
        )
        self._fname = filename
        self._supplkwargs = kwargs
        # this is, hopefully, for future use:
        self._nPartitions = nPartitions

    def _get_schema(self):
        " reads property names from the first molecule and uses those "
        # FIX handle error if the first molecule doesn't exist
        suppl = self.suppl_factory(self._fname,**self._supplkwargs)
        m0 = suppl[0]
        pd = m0.GetPropsAsDict()
        dt = {'mol':'O'}
        for pn,pv in pd.items():
            dt[pn] = np.dtype(type(pv)).name
        return intake.source.base.Schema(
            datashape=None,
            dtype=dt,
            shape=(None, len(dt)),
            npartitions=self._nPartitions,
            extra_metadata=self._supplkwargs
        )

    def _get_partition(self, i):
        suppl = self.suppl_factory(self._fname,**self._supplkwargs)
        res = []
        partSize = len(suppl)//self._nPartitions
        start = i*partSize
        if i==self._nPartitions-1:
            end = len(suppl)
        else:
            end = start + partSize
            
        for idx in range(start,end):
            m = suppl[idx]
            if m is not None:
                row = {'mol':m}
                row.update(m.GetPropsAsDict())
            res.append(row)
        return pd.DataFrame(res)

    def read(self):
        self._load_metadata()
        return pd.concat([self.read_partition(i) for i in range(self.npartitions)])
        
    def _close(self):
        # close any files, sockets, etc
        self.suppl=None
Overwriting ./intake_sdf.py
Intake looks for data source drivers when it is imported, so we need to reload the intake module after creating the driver file:
In [9]:
import imp
imp.reload(intake)
Out[9]:
<module 'intake' from '/other_linux/home/glandrum/anaconda3/envs/rdkit_blog/lib/python3.7/site-packages/intake/__init__.py'>
Notice that SDF is now in the registry:
In [10]:
list(intake.registry)
Out[10]:
['yaml_file_cat',
 'yaml_files_cat',
 'csv',
 'textfiles',
 'catalog',
 'intake_remote',
 'numpy',
 'ndzarr',
 'sdf']
Now we can load and work with the SDF:
In [11]:
sdf_source = intake.open_sdf('../data/herg_3dmols.sdf',nPartitions=4,removeHs=False)
In [12]:
sdf_source.discover()
Out[12]:
{'datashape': None,
 'dtype': {'mol': 'O',
  'SASA': 'float64',
  'molregno': 'int64',
  'NPR1': 'float64',
  'NPR2': 'float64',
  'asphericity': 'float64',
  'eccentricity': 'float64',
  'atom.dprop.EHTCharge': 'str'},
 'shape': (None, 8),
 'npartitions': 4,
 'metadata': {'removeHs': False}}
In [13]:
sdf_source.read().head()
Out[13]:

NPR1 NPR2 SASA asphericity atom.dprop.EHTCharge eccentricity mol molregno
0 0.344558 0.790223 474.773633 0.295012 -0.65766151302259046 0.28616869584385629 0.058... 0.938765 <rdkit.Chem.rdchem.Mol object at 0x7f2baa1c3170> 29272
1 0.174056 0.953789 312.566799 0.570840 -0.65058437663054747 0.29115551530406947 0.036... 0.984736 <rdkit.Chem.rdchem.Mol object at 0x7f2baa1c33f0> 29758
2 0.063609 0.975599 446.033757 0.822029 -0.65604834706359316 0.28231130455311593 0.045... 0.997975 <rdkit.Chem.rdchem.Mol object at 0x7f2baa1c35d0> 29449
3 0.284320 0.803002 473.661314 0.376431 -0.70531069101833488 0.30094745736190931 0.077... 0.958730 <rdkit.Chem.rdchem.Mol object at 0x7f2baa1c3710> 29244
4 0.253028 0.828362 520.615556 0.424006 -0.70345551365543968 0.29596092056375445 0.061... 0.967459 <rdkit.Chem.rdchem.Mol object at 0x7f2baa1c34e0> 29265

Setup the data catalog

Now we're going to combine those three data sources into a single catalog that we can use later. We only have to do this once and can then use it in other scripts/notebooks
Look at the YAML describing each of the data sources; this is the information we will put into the data catalog
In [14]:
print(activity_source.yaml())
sources:
  csv:
    args:
      csv_kwargs:
        sep: ' '
      urlpath: ../data/herg_data.txt
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}

In [15]:
print(props_source.yaml())
sources:
  csv:
    args:
      urlpath: ../data/herg_mols_props.csv
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}

In [16]:
print(sdf_source.yaml())
sources:
  sdf:
    args:
      filename: ../data/herg_3dmols.sdf
      nPartitions: 4
      removeHs: false
    description: ''
    driver: intake_sdf.SDFSource
    metadata:
      removeHs: false

Now create the YAML file that describes the catalog:
In [26]:
%%writefile ../data/herg.yaml
sources:
  activities:
    args:
      csv_kwargs:
        sep: ' '
      urlpath: ../data/herg_data.txt
    description: 'measured values from ChEMBL'
    driver: intake.source.csv.CSVSource
    metadata: {}
  props:
    args:
      urlpath: ../data/herg_mols_props.csv
    description: 'calculated values'
    driver: intake.source.csv.CSVSource
    metadata:
        plots:
          logp_mw:
            kind: scatter
            x: MolWt
            y: MolLogP
            width: 700
            height: 400
  sdf3d:
    args:
      filename: ../data/herg_3dmols.sdf
      nPartitions: 4
      removeHs: false
    description: '3D structures and descriptors of the compounds'
    driver: intake_sdf.SDFSource
    metadata:
      removeHs: true
      plots:
        shape:
           kind: scatter
           x: asphericity
           y: NPR1
           width: 700
           height: 400
Overwriting ../data/herg.yaml

Work with an existing data catalog

For future work with this data, the pieces below, together with the intake_sdf.py file created above, is all we need.
In [27]:
cat = intake.open_catalog('../data/herg.yaml')
We can introspect in the catalog to see what's there:
In [28]:
cat.items()
Out[28]:
dict_items([('activities', name: activities
container: dataframe
plugin: ['csv']
description: measured values from ChEMBL
direct_access: forbid
user_parameters: []
metadata: 
args: 
  csv_kwargs: 
    sep:  
  urlpath: ../data/herg_data.txt), ('props', name: props
container: dataframe
plugin: ['csv']
description: calculated values
direct_access: forbid
user_parameters: []
metadata: 
  plots: 
    logp_mw: 
      kind: scatter
      x: MolWt
      y: MolLogP
      width: 700
      height: 400
args: 
  urlpath: ../data/herg_mols_props.csv), ('sdf3d', name: sdf3d
container: dataframe
plugin: ['sdf']
description: 3D structures and descriptors of the compounds
direct_access: forbid
user_parameters: []
metadata: 
  removeHs: True
  plots: 
    shape: 
      kind: scatter
      x: asphericity
      y: NPR1
      width: 700
      height: 400
args: 
  filename: ../data/herg_3dmols.sdf
  nPartitions: 4
  removeHs: False)])
Here's how you work with one of the sources:
In [29]:
cat.props.describe()
Out[29]:
{'name': 'props',
 'container': 'dataframe',
 'plugin': ['csv'],
 'description': 'calculated values',
 'direct_access': 'forbid',
 'user_parameters': [],
 'metadata': {'plots': {'logp_mw': {'kind': 'scatter',
    'x': 'MolWt',
    'y': 'MolLogP',
    'width': 700,
    'height': 400}}},
 'args': {'urlpath': '../data/herg_mols_props.csv'}}
The plots we defined in the metadata are available:
In [3]:
cat.props.plot.logp_mw()

Here's a plot we defined as part of the catalog, but haven't actually looked at yet:
In [4]:
cat.sdf3d.plot.shape()

And, of course, we can create new plots:
In [5]:
# it's more efficient to create a data frame from the datasource once:
sdf_df = cat.sdf3d.read()
sdf_df.hvplot(kind='scatter',x='NPR1',y='NPR2')

An aside: the plots here are being done with bokeh, which has all kinds of cool functionality that is worth exploring.
It's also interesting to combine the datasets together:
In [34]:
activity_df = cat.activities.read()
props_df = cat.props.read()
In [35]:
merged_df = pd.merge(pd.merge(activity_df,props_df,on='molregno'),sdf_df,on='molregno')
merged_df.head()
Out[35]:

canonical_smiles molregno activity_id standard_value standard_units MolLogP TPSA NumRotatableBonds MolWt NumHAcceptors NumHDonors NPR1 NPR2 SASA asphericity atom.dprop.EHTCharge eccentricity mol
0 N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)c... 29272 671631 49000.0 nM 1.6996 92.5 5 419.469 4 2 0.344558 0.790223 474.773633 0.295012 -0.65766151302259046 0.28616869584385629 0.058... 0.938765 <rdkit.Chem.rdchem.Mol object at 0x7f2ba8e4ee90>
1 N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)c... 29272 671631 49000.0 nM 1.6996 92.5 5 419.469 4 2 0.344558 0.790223 474.773633 0.295012 -0.65766151302259046 0.28616869584385629 0.058... 0.938765 <rdkit.Chem.rdchem.Mol object at 0x7f2ba8555490>
2 N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)c... 29272 671631 49000.0 nM 1.6996 92.5 5 419.469 4 2 0.344558 0.790223 474.773633 0.295012 -0.65766151302259046 0.28616869584385629 0.058... 0.938765 <rdkit.Chem.rdchem.Mol object at 0x7f2ba8e4ee90>
3 N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)c... 29272 671631 49000.0 nM 1.6996 92.5 5 419.469 4 2 0.344558 0.790223 474.773633 0.295012 -0.65766151302259046 0.28616869584385629 0.058... 0.938765 <rdkit.Chem.rdchem.Mol object at 0x7f2ba8555490>
4 N[C@@H]([C@@H]1CC[C@H](CC1)NS(=O)(=O)c2ccc(F)c... 29272 2606579 49000.0 nM 1.6996 92.5 5 419.469 4 2 0.344558 0.790223 474.773633 0.295012 -0.65766151302259046 0.28616869584385629 0.058... 0.938765 <rdkit.Chem.rdchem.Mol object at 0x7f2ba8e4ee90>
Look at the impact of MolLogP on activity:
In [6]:
import math
merged_df['pActivity'] = merged_df.apply(lambda row: -1*math.log10(row.standard_value*1e-9),axis=1)
merged_df.hvplot(kind='scatter',x='MolLogP',y='pActivity')

What about the SASA value and TPSA?
In [7]:
merged_df.hvplot(kind='scatter',x='SASA',y='TPSA')

I think this is super promising and there's a lot more we can do. I will probably be doing another post or two as I learn more and start making more use of intake to organize my data sets.

Thursday, July 4, 2019

A couple of substructure search topics

I had a couple of related conversations at Sheffield about substructure searching which seemed like they could be combined into a short and hopefully interesting blog post.
The first conversation actually started before the meeting: John Mayfield (at NextMove Software) submitted a pull request with some improvements to the RDKit's implementation of the vf2 algorithm, which is what is used to actually do substructure matching. John did a presentation about speeding up substructure matches at the meeting and implemented some of the improvements he described in his talk for the RDKit. If you're interested in details, here's the pull request that has been merged and will be part of the next release. As I'll show below, this leads to a significant improvement in the performance of the RDKit's substructure matcher. Thanks John!
Of course, the best way to speed up substructure matching is to not have to do it at all. This is what the RDKit's pattern fingerprint is for, and the second conversation was about how exactly the pattern fingerprint works. Fortunately I'd recently written this up and added it to the RDKit docs. The new documentation is now part of the "RDKit Book" and will be in the online documentation for the next release. Here's the version of that documentation as of the writing of this blog post:
These fingerprints were designed to be used in substructure screening. These are, as far as I know, unique to the RDKit. The algorithm identifies features in the molecule by doing substructure searches using a small number (12 in the 2019.03 release of the RDKit) of very generic SMARTS patterns - like [*]~[*]~[*](~[*])~[*] or [R]~1[R]~[R]~[R]~1, and then hashing each occurence of a pattern based on the atom and bond types involved. The fact that particular pattern matched the molecule at all is also stored by hashing the pattern ID and size. If a particular feature contains either a query atom or a query bond (e.g. something generated from SMARTS), the only information that is hashed is the fact that the generic pattern matched.
For the 2019.03 release, the atom types use just the atomic number of the atom and the bond types use the bond type, or AROMATIC for aromatic bonds).
NOTE: Because it plays an important role in substructure screenout, the internals of this fingerprint (the generic patterns used and/or the details of the hashing algorithm) may change from one release to the next.
Doing this work made me realize that it's been quite a while since I did any benchmarking to see how effective the pattern fingerprint. The last time I looked at this was a series of blog posts in 2013: here, here, and here. There have been a number of changes to the toolkit since then, so it seemed worthwhile to revisit that benchmarking exercise.
Since benchmarking screenout performance and substructure search speeds seems like a useful thing to be able to do, I created a new script that will be part of the RDKit source distribution in future releases; that's here.
Rather than duplicating a bunch of code from the new benchmarking script here, I'll just show how to run it and then talk about the results.
Note that if you want to follow along with this, you will need to download the datasets that are being used (they are big enough that I didn't want to make them part of the RDKit source distro. URLs for where to find the data are in the script's source. The datasets themselves are described in the first blog post mentioned above (though note that the 25K pairs of molecules are pulled from ChEMBL21, not ChEMBL16 as in the original post).
Running the script in Jupyter is pretty easy:
In [2]:
from rdkit import RDConfig
In [8]:
loc = f"{RDConfig.RDBaseDir}/Regress/Scripts"
!cd $loc;python fingerprint_screenout.py
[07:21:19] INFO: mols from smiles
[07:21:27] INFO: Results1:  7.77 seconds, 50000 mols
[07:21:27] INFO: queries from smiles
[07:21:27] INFO: Results2:  0.16 seconds
[07:21:27] INFO: generating pattern fingerprints for mols
[07:21:43] INFO: Results3:  16.11 seconds
[07:21:43] INFO: generating pattern fingerprints for queries
[07:21:43] INFO: Results4:  0.34 seconds
[07:21:43] INFO: testing frags queries
[07:22:03] INFO: Results5:  19.90 seconds. 6753 tested (0.0003 of total), 3989 found,  0.59 accuracy. 0 errors.
[07:22:03] INFO: testing leads queries
[07:22:23] INFO: Results6:  19.77 seconds. 1586 tested (0.0001 of total), 1067 found,  0.67 accuracy. 0 errors.
[07:22:23] INFO: testing pieces queries
[07:23:19] INFO: Results7:  55.37 seconds. 3333202 tested (0.0810 of total), 1925628 found,  0.58 accuracy. 0 errors.
| 2019.09.1dev1 | 7.8 | 0.2 | 16.1 | 0.3 | 19.9 | 19.8 | 55.4 |
Let's start by looking at how effective the screenout is. This is captured in the lines for Results5 (fragments) Results6 (leads) and Results7 (pieces). Comparing the screenout accuracy (this is the fraction of compounds passing the fingerprint screen that actually had a match) we see comparable values to 2013:
20132019
Fragments0.590.59
Leads0.720.67
Pieces0.570.58
This is what I'd hope to see: we haven't made any substantial changes to the fingerprinter itself since 2013 except to improve the performance on single-atom queries of "strange" elements, so the screenout accuracy shouldn't have changed much. Note that we wouldn't expect the results to be exactly the same since the set of molecules being searched through is different.
It's worth pointing out the effectiveness of the pattern fingerprints in reducing the number of substructure searches that actually have to be done here: for the fragments queries only 6753 of the 25 million possible comparisons, 0.03%, actually need to be done. For the leads it's 0.01%, and for the pieces it's 8.1%. Nice!
Now let's look at the impact of John's vf2 changes to the overall runtime. For that we can just look at summary timing information in the last line of the output above and compare it to what I got when I ran the same script using the most recent RDKit release:
| 2019.03.2     | 7.8 | 0.2 | 24.9 | 0.5 | 20.6 | 20.3 | 65.5 |
| 2019.09.1dev1 | 7.8 | 0.2 | 16.1 | 0.3 | 19.9 | 19.8 | 55.4 |
Given the low number of substructure searches run for the fragments and leads queries, the times in columns 6 and 7 are dominated by the fingerprint comparisons, so there's not much difference. The pieces queries, on the other hand, do show a nice improvement: the overall runtime (including the fingerprint screening) drops from 65.5 seconds to 55.4 seconds. The other nice improvement is in the amount of time required to generate the pattern fingerprints for the 50K molecules (column 4): this drops from 24.9 seconds to 16.1 seconds: it's now running in 65% of the time.
In order to get a better feeling for the speedup from the vf2 changes I ran the fingerprint_screenout.py script with the --validate argument; this performs all substructure matches in order to validate that the pattern fingerprints aren't filtering out any true matches (they aren't). That takes a lot longer to run, so I will just show the results I got when I ran it:
| 2019.03.2     | 8.1 | 0.2 | 24.9 | 0.5 | 356.3 | 372.1 | 432.1 |
| 2019.09.1dev1 | 7.9 | 0.2 | 16.2 | 0.3 | 205.1 | 207.4 | 276.4 |
From these results we can see that the substructure searches now run in 56% to 64% of the time. Very nice!
In [ ]: