A set of scaffolds from CheMBL papers¶
Andrew Dalke has been working on putting together a dataset for testing substructure screenout performance for a while now. Due to concerns about privacy and/or confidentiality it's really difficult to get "real world" data on the kinds of things that scientists (really mostly chemists) look for in large-scale research systems like the data warehouses one finds in pharma companies. We have the data, but because of the nature of the searches we can't share it. Andrew visited us recently and the topic of substructure screening came up again. (Aside: There were three people in that room who cared deeply about substructure screening and who have invested some real thought and effort in it, how often does that happen? Sometimes I love my job!) During the conversation we came up with a way to fake having access to our actual queries. This post is about that.
One of the common use cases we see for substructure searching in the warehouse is to find all compounds that are in a particular chemical series. We obviously can't disclose the series that people are searching our database for, but we can fake it by collecting a bunch of substructure queries defining medchem-relevant chemical scaffolds and using those. This would have the added advantage of including things that have query features (variable atom and bond lists). Now we just need a source of medchem relevant scaffolds.
I did a post a while ago on the composition of the molecule sets we called "Datasets 2" in the model fusion paper. Since these are pulled from individual publications found in ChEMBL they tend to be made up of a set of molecules from one scaffold and then a few other molecules. One of the things I showed in that post was a method for estimating what the scaffold is in these sets using the RDKit's MCS code. This builds on that.
On to the work¶
Start with the preliminaries¶
We need to do a bunch of imports and define some functions we'll use later.
Note: This is a Python3 notebook, so the code below may not work in Python2.
import numpy
from rdkit.Chem.Draw import IPythonConsole
import pandas as pd
from rdkit.Chem import PandasTools
from rdkit.Chem import AllChem as Chem
from rdkit.Chem import rdFMCS as MCS
import pickle
from rdkit import rdBase
print(rdBase.rdkitVersion)
import time
print(time.asctime())
# define a function to convert a query molecule to SVG using the new rendering code.
from IPython.display import SVG
from rdkit.Chem import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
def moltosvg(mol,molSize=(450,250),kekulize=True):
mc = Chem.Mol(mol.ToBinary())
if kekulize:
try:
Chem.Kekulize(mc)
except:
mc = Chem.Mol(mol.ToBinary())
if not mc.GetNumConformers():
rdDepictor.Compute2DCoords(mc)
drawer = rdMolDraw2D.MolDraw2DSVG(molSize[0],molSize[1])
# the MolDraw2D code is not very good at the moment at dealing with atom queries,
# this is a workaround until that's fixed.
# The rendering is still not going to be perfect because query bonds are not properly indicated
opts = drawer.drawOptions()
for atom in mc.GetAtoms():
if atom.HasQuery() and atom.DescribeQuery().find('AtomAtomicNum')!=0:
opts.atomLabels[atom.GetIdx()]=atom.GetSmarts()
drawer.DrawMolecule(mc)
drawer.FinishDrawing()
svg = drawer.GetDrawingText()
# It seems that the svg renderer used doesn't quite hit the spec.
# Here are some fixes to make it work in the notebook, although I think
# the underlying issue needs to be resolved at the generation step
return svg.replace('svg:','')
# define a function to find the MCS for a set of molecules and return some summary info about that MCS
import time
# we will use a namedtuple to return the results
from collections import namedtuple
MCSRes=namedtuple('MCSRes',('smarts','numAtoms','numMols','avgNumMolAtoms','mcsTime'))
def MCS_Report(ms,printSmarts=True,
atomCompare=MCS.AtomCompare.CompareAny,
bondCompare=MCS.BondCompare.CompareAny,
completeRingsOnly=True,
**kwargs):
t1=time.time()
mcs = MCS.FindMCS(ms,atomCompare=atomCompare,bondCompare=bondCompare,timeout=30,completeRingsOnly=completeRingsOnly,
**kwargs)
t2=time.time()
nAts = numpy.array([x.GetNumAtoms() for x in ms])
print('Mean nAts %.1f, mcs nAts: %d'%(nAts.mean(),mcs.numAtoms))
if printSmarts:
print(mcs.smartsString)
mcsM = Chem.MolFromSmarts(mcs.smartsString)
mcsM.UpdatePropertyCache(False)
Chem.SetHybridization(mcsM)
svg = moltosvg(mcsM,kekulize=False)
tpl = MCSRes(mcs.smartsString,mcs.numAtoms,len(ms),nAts,t2-t1)
return tpl,svg
Grab the benchmarking data¶
import glob
pkls = glob.glob('/home/glandrum/Code/benchmarking_platform/compounds/ChEMBL_II/*.pkl')
print(len(pkls))
Get target information from our local ChEMBL copy:
%load_ext sql
%config SqlMagic.feedback = False
import os
tgts={}
for pkl in pkls:
fn = os.path.split(pkl)[-1]
basen = os.path.splitext(fn)[0]
tgtn=int(basen.split('_')[-1])
data = %sql postgresql://localhost/chembl_20 \
select tid,pref_name,organism from target_dictionary where tid=:tgtn ;
tgts[data[0][0]]=(pkl,data[0][1],data[0][2])
for idx in sorted(tgts.keys()):
pkl,tgt,species = tgts[idx]
print(idx,tgt,species)
The data is organized in a dictionary with one entry per paper.
The MCS_Report()
function defined above prints out the mean number of atoms per molecule in the input along with the size of the MCS. This is intended to help assess whether or not the MCS is actually a scaffold.
sets = pickle.load(open(tgts[11631][0],'rb'))
docid,cmpds = next(iter(sets.items()))
ids,smis= zip(*cmpds)
mols = [Chem.MolFromSmiles(x) for x in smis]
tpl,svg = MCS_Report(mols,threshold=0.8,completeRingsOnly=True)
SVG(svg)
Find scaffolds in bulk¶
Now let's look at all the papers
def processTarget(tid):
pkl,target,species=tgts[tid]
print('-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*')
print('\t',target,species)
sets = pickle.load(open(pkl,'rb'))
print('\t\t %d papers'%(len(sets)))
alldata=[]
for docid,cmpds in sets.items():
data = %sql \
select journal,volume,first_page,year from docs where doc_id=:docid ;
ref=', '.join(str(x) for x in data[0])
print( '----------------')
print(ref)
ids,smis= zip(*cmpds)
mols = [Chem.MolFromSmiles(x) for x in smis]
tpl,svg = MCS_Report(mols,printSmarts=True,threshold=0.8,completeRingsOnly=True)
alldata.append((tid,docid,ref,tpl,mols))
return alldata
alldata=processTarget(11631)
Cool, do it for every target. Note that this takes a while; it would have been clever to do this using IPython's clustering, but I ended up just starting it and going off to do something else.
with open('results/papers_mcs_results.pkl','wb+') as outf:
for tid in tgts:
data = processTarget(tid)
pickle.dump(data,outf)
Read the data back in (this allows us to come back later and repeat analysis work without having to re-calculate all the scaffolds).
alldata = []
with open('results/papers_mcs_results.pkl','rb') as inf:
while 1:
try:
row = pickle.load(inf)
except EOFError:
break
alldata.append(row)
len(alldata)
alldata[0]
Create a Pandas DataFrame with the data so that we can explore it a bit.
Note The molecule renderings in this table are not optimal. I was unable to figure out how to get the SVG to render properly in the notebook. Still, this is hopefully better than nothing.
tbl = []
for row in alldata:
for r in row:
tid,docid,ref,tpl,_ = r
m = Chem.MolFromSmarts(tpl.smarts)
m.UpdatePropertyCache(False)
#svg = moltosvg(m,kekulize=False)
tbl.append((tid,tpl.numAtoms,tpl.avgNumMolAtoms.mean(),m,tpl.smarts,tpl.mcsTime))
df = pd.DataFrame(data=tbl,columns=('tid','numAtoms','avgNumMolAtoms','mol','smarts','time'))
PandasTools.RenderImagesInAllDataFrames(images=True)
df.head(5)
How many "scaffolds" do we have in total?
len(df)
How many are reasonably sized?
len(df[df.numAtoms>6])
Not bad! It's still worth looking at the ones that are super small:
df[df.numAtoms<=6]
We can also render those better:
SVG(moltosvg(df[df.numAtoms<=6].iloc[0,3],kekulize=False,molSize=(250,150)))
SVG(moltosvg(df[df.numAtoms<=6].iloc[1,3],kekulize=False,molSize=(250,150)))
SVG(moltosvg(df[df.numAtoms<=6].iloc[2,3],kekulize=False,molSize=(250,150)))
SVG(moltosvg(df[df.numAtoms<=6].iloc[3,3],kekulize=False,molSize=(250,150)))
Wrapping up¶
There's a lot more we can do with this set of scaffolds. There are more posts to come...
No comments:
Post a Comment