I noticed the mention of Intake an anaconda blog post last year and filed it away as: "oooo, that's interesting, I should look into that at some point." I really like the idea of having a straightforward Python-based system for creating and working with a data catalog; I've got way too many random datasets floating around that I use for various things and I think Intake could help bring some order to that.
Today I decided to celebrate Friday by doing a bit of tech exploration and picked Intake. After going quickly through the documentation and making sure the library is still alive (i.e. active on github), I dove in.
Of course what I'm most interested in doing is figuring out how to add chemistry file formats to Intake. So after some initial experimentation to make sure I can use the library at all and that it's actually as straightforward as it seems, I did the obvious first thing and added an SDF loader.
This post is mostly about that. Well, that and a short demo of using Intake.
I'm going to create a data catalog that contains three different data sources, all connected to the same dataset: a collection of compounds with HeRG data that came from ChEMBL. I'm not 100% sure where this dataset came from (see, I need something to help with this!), but I think it's from this blog post. I wanted multiple linked source files, so in addition to the original file, I created a CSV file with calculated 2D descriptors and an SDF with 3D structures of the molecules and some 3D descriptors.
Let's get started.
In order to use the code in this notebook, you will need to install intake and hvplot. I'm using conda (of course), so I got intake from conda-forge and used the hvplot version from the default channel:
Today I decided to celebrate Friday by doing a bit of tech exploration and picked Intake. After going quickly through the documentation and making sure the library is still alive (i.e. active on github), I dove in.
Of course what I'm most interested in doing is figuring out how to add chemistry file formats to Intake. So after some initial experimentation to make sure I can use the library at all and that it's actually as straightforward as it seems, I did the obvious first thing and added an SDF loader.
This post is mostly about that. Well, that and a short demo of using Intake.
I'm going to create a data catalog that contains three different data sources, all connected to the same dataset: a collection of compounds with HeRG data that came from ChEMBL. I'm not 100% sure where this dataset came from (see, I need something to help with this!), but I think it's from this blog post. I wanted multiple linked source files, so in addition to the original file, I created a CSV file with calculated 2D descriptors and an SDF with 3D structures of the molecules and some 3D descriptors.
Let's get started.
In order to use the code in this notebook, you will need to install intake and hvplot. I'm using conda (of course), so I got intake from conda-forge and used the hvplot version from the default channel:
conda install -c conda-forge intake
conda install hvplot
Note This is another one of those posts where blogger doesn't handle the output from nbconvert at all well. You can see the notebook with live bokeh plots on nbviewer or download it directly from github.
In [1]:
from rdkit import Chem
import intake
import pandas as pd
import hvplot.pandas
intake.output_notebook()
Look at the data sources that intake supports by default:
In [2]:
list(intake.registry)
Out[2]:
Using data sources¶
In [3]:
activity_source = intake.open_csv('../data/herg_data.txt',csv_kwargs={'sep':' '})
props_source = intake.open_csv('../data/herg_mols_props.csv')
In [4]:
activity_source.discover()
Out[4]:
In [5]:
props_source.discover()
Out[5]:
The most efficient way to work with the data in a data source is to
read()
it in. For these data sources, this returns a pandas dataframe:
In [6]:
props_df = props_source.read()
props_df.head()
Out[6]:
And since we've imported hvplot, we now have some nice plotting available:
Teaching intake to read SDF files¶
In [39]:
%%writefile ./intake_sdf.py
from rdkit import Chem
import pandas as pd
import intake
import numpy as np
class SDFSource(intake.source.base.DataSource):
" Inspired by example here: https://github.com/intake/intake-examples/blob/master/tutorial/data_engineer.ipynb "
container = 'dataframe'
name = 'sdf'
version = '0.0.1'
partition_access = True
suppl_factory = Chem.SDMolSupplier
def __init__(self, filename, metadata=None, nPartitions=1, **kwargs):
super(SDFSource, self).__init__(
metadata=metadata
)
self._fname = filename
self._supplkwargs = kwargs
# this is, hopefully, for future use:
self._nPartitions = nPartitions
def _get_schema(self):
" reads property names from the first molecule and uses those "
# FIX handle error if the first molecule doesn't exist
suppl = self.suppl_factory(self._fname,**self._supplkwargs)
m0 = suppl[0]
pd = m0.GetPropsAsDict()
dt = {'mol':'O'}
for pn,pv in pd.items():
dt[pn] = np.dtype(type(pv)).name
return intake.source.base.Schema(
datashape=None,
dtype=dt,
shape=(None, len(dt)),
npartitions=self._nPartitions,
extra_metadata=self._supplkwargs
)
def _get_partition(self, i):
suppl = self.suppl_factory(self._fname,**self._supplkwargs)
res = []
partSize = len(suppl)//self._nPartitions
start = i*partSize
if i==self._nPartitions-1:
end = len(suppl)
else:
end = start + partSize
for idx in range(start,end):
m = suppl[idx]
if m is not None:
row = {'mol':m}
row.update(m.GetPropsAsDict())
res.append(row)
return pd.DataFrame(res)
def read(self):
self._load_metadata()
return pd.concat([self.read_partition(i) for i in range(self.npartitions)])
def _close(self):
# close any files, sockets, etc
self.suppl=None
Intake looks for data source drivers when it is imported, so we need to reload the intake module after creating the driver file:
In [9]:
import imp
imp.reload(intake)
Out[9]:
Notice that SDF is now in the registry:
In [10]:
list(intake.registry)
Out[10]:
Now we can load and work with the SDF:
In [11]:
sdf_source = intake.open_sdf('../data/herg_3dmols.sdf',nPartitions=4,removeHs=False)
In [12]:
sdf_source.discover()
Out[12]:
In [13]:
sdf_source.read().head()
Out[13]:
Setup the data catalog¶
Now we're going to combine those three data sources into a single catalog that we can use later. We only have to do this once and can then use it in other scripts/notebooks
Look at the YAML describing each of the data sources; this is the information we will put into the data catalog
In [14]:
print(activity_source.yaml())
In [15]:
print(props_source.yaml())
In [16]:
print(sdf_source.yaml())
Now create the YAML file that describes the catalog:
In [26]:
%%writefile ../data/herg.yaml
sources:
activities:
args:
csv_kwargs:
sep: ' '
urlpath: ../data/herg_data.txt
description: 'measured values from ChEMBL'
driver: intake.source.csv.CSVSource
metadata: {}
props:
args:
urlpath: ../data/herg_mols_props.csv
description: 'calculated values'
driver: intake.source.csv.CSVSource
metadata:
plots:
logp_mw:
kind: scatter
x: MolWt
y: MolLogP
width: 700
height: 400
sdf3d:
args:
filename: ../data/herg_3dmols.sdf
nPartitions: 4
removeHs: false
description: '3D structures and descriptors of the compounds'
driver: intake_sdf.SDFSource
metadata:
removeHs: true
plots:
shape:
kind: scatter
x: asphericity
y: NPR1
width: 700
height: 400
Work with an existing data catalog¶
For future work with this data, the pieces below, together with theintake_sdf.py
file created above, is all we need.
In [27]:
cat = intake.open_catalog('../data/herg.yaml')
We can introspect in the catalog to see what's there:
In [28]:
cat.items()
Out[28]:
Here's how you work with one of the sources:
In [29]:
cat.props.describe()
Out[29]:
The plots we defined in the metadata are available:
In [3]:
cat.props.plot.logp_mw()
In [4]:
cat.sdf3d.plot.shape()
And, of course, we can create new plots:
In [5]:
# it's more efficient to create a data frame from the datasource once:
sdf_df = cat.sdf3d.read()
sdf_df.hvplot(kind='scatter',x='NPR1',y='NPR2')
An aside: the plots here are being done with bokeh, which has all kinds of cool functionality that is worth exploring.
It's also interesting to combine the datasets together:
In [34]:
activity_df = cat.activities.read()
props_df = cat.props.read()
In [35]:
merged_df = pd.merge(pd.merge(activity_df,props_df,on='molregno'),sdf_df,on='molregno')
merged_df.head()
Out[35]:
Look at the impact of MolLogP on activity:
In [6]:
import math
merged_df['pActivity'] = merged_df.apply(lambda row: -1*math.log10(row.standard_value*1e-9),axis=1)
merged_df.hvplot(kind='scatter',x='MolLogP',y='pActivity')
What about the SASA value and TPSA?
In [7]:
merged_df.hvplot(kind='scatter',x='SASA',y='TPSA')
I think this is super promising and there's a lot more we can do. I will probably be doing another post or two as I learn more and start making more use of intake to organize my data sets.