This one was inspired by a thread on the rdkit-discuss mailing list: https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg09068.html
An SAR table is often a convenient way to summarize a dataset. These compact views of an SAR dataset have R1 structures in the rows, R2 structures in the columns, and measured values in the cells. I guess you probably know what I'm talking about even if I'm not describing it particularly well; if not, scroll down a bit and you'll see one. :-)
The question on the mailing list was how to build one of these with the RDKit and display it in Jupyter. That's what we'll do in this blog post.
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdRGroupDecomposition
from rdkit.Chem import rdDepictor
import rdkit
import pandas as pd
import numpy as np
from rdkit.Chem import PandasTools
import time
print(rdkit.__version__)
print(time.ctime())
Read in our dataset¶
This is data from ChEMBL for a set of compounds with EC50 values measured against S1P1.
!head ../data/S1P1_data.csv
df = pd.read_csv('../data/S1P1_data.csv')
PandasTools.AddMoleculeColumnToFrame(df,smilesCol='canonical_smiles')
df.head()
def groups_to_df(groups,mols,include_core=False,redraw_sidechains=False):
""" converts the results of an r-group decomposition into a humanly useful
DataFrame
"""
cols = ['Mol']+list(groups.keys())
if redraw_sidechains:
for k,vl in groups.items():
if k=='Core':
continue
for i,v in enumerate(vl):
vl[i] = Chem.RemoveHs(v)
rdDepictor.Compute2DCoords(vl[i])
if not include_core:
cols.remove('Core')
del groups['Core']
groups['Mol'] = mols
frame = pd.DataFrame(groups,columns=cols)
PandasTools.ChangeMoleculeRendering(frame)
return frame
Do the R-group decomposition to get something to work with¶
For the demo here I'm going to use the structures from this paper, which is in ChEMBL as CHEMBL3352346.
To get the data in the form we need, we start with an R-group decomposition. This is another one of those topics where I need to do a blog post and/or write more documentation... fortunately, we're just doing the basics here:
# we need to provide a scaffold to use for the r-group decomposition:
doc_scaffold = Chem.MolFromSmarts('[*:1]c1nc([*:2])on1')
# filter the data to just include the rows from our document:
doc_id = 89753
docdf = df[df.doc_id==doc_id]
# align all the molecules to the scaffold:
rdDepictor.Compute2DCoords(doc_scaffold)
for m in docdf.ROMol:
rdDepictor.GenerateDepictionMatching2DStructure(m,doc_scaffold)
# do an r-group decomposition:
groups,unmatched = rdRGroupDecomposition.RGroupDecompose([doc_scaffold],docdf.ROMol,asSmiles=False,asRows=False)
# and look at the results:
ms = [y for x,y in enumerate(docdf.ROMol) if x not in unmatched]
res = groups_to_df(groups,ms,include_core=False)
res.head()
For the purposes of this analysis, it's quicker/easier to get the data from the RGD code in a different format:
groups,unmatched = rdRGroupDecomposition.RGroupDecompose([doc_scaffold],docdf.ROMol,asSmiles=False,asRows=False)
This next block defines the function that actually creates the SAR matrix as HTML
from collections import Counter
from IPython.display import HTML
import base64
def mol_to_img(m):
dm = Draw.PrepareMolForDrawing(m)
d2d = Draw.MolDraw2DCairo(250,200)
dopts = d2d.drawOptions()
dopts.dummiesAreAttachments=True
d2d.DrawMolecule(dm)
d2d.FinishDrawing()
png_data = d2d.GetDrawingText()
png_data = base64.encodebytes(png_data)
html ='<img src="data:image/png;base64,%s">'%png_data.decode()
return html
def run_groups(groups,unmatched,mols,values,r1_label='R1',r2_label='R2',threshold=1):
# generate SAR matrix
# generate SMILES for each of the R-groups and map those
# to the R-group's molecule objects:
r1_smiles = [Chem.MolToSmiles(x) for x in groups[r1_label]]
r2_smiles = [Chem.MolToSmiles(x) for x in groups[r2_label]]
r1_lookup = dict(zip(r1_smiles,groups[r1_label]))
r2_lookup = dict(zip(r2_smiles,groups[r2_label]))
# all_r1s and all_r2s map R indices to the corresponding SMILES:
all_r1s = dict([(y,x) for x,y in enumerate(r1_lookup.keys())])
all_r2s = dict([(y,x) for x,y in enumerate(r2_lookup.keys())])
# labelled_mols will contain 3-tuples:
# (molecule_index,R1_index,R2_index)
labelled_mols = []
residx = 0
for i,m in enumerate(mols):
if i in unmatched:
continue
r1_idx = all_r1s[r1_smiles[residx]]
r2_idx = all_r2s[r2_smiles[residx]]
residx += 1
labelled_mols.append((i,r1_idx,r2_idx))
# We only keep r groups that appear at least `threshold times in the full list:
c1 = Counter()
c2 = Counter()
for idx,i,j in labelled_mols:
c1[i] += 1
c2[j] += 1
freq1 = [x for x,y in c1.items() if y>=threshold]
freq2 = [x for x,y in c2.items() if y>=threshold]
reverse_r1s = dict([(y,x) for x,y in all_r1s.items()])
reverse_r2s = dict([(y,x) for x,y in all_r2s.items()])
freq_r1s = [reverse_r1s[i] for i in freq1]
freq_r2s = [reverse_r2s[i] for i in freq2]
n_r1 = len(freq1)
n_r2 = len(freq2)
# now construct a matrix
matrix = [None]*(n_r1*n_r2)
matrix = np.reshape(matrix,(n_r1,n_r2))
for idx,i,j in labelled_mols:
if i not in freq1 or j not in freq2:
continue
r1idx = freq1.index(i)
r2idx = freq2.index(j)
matrix[r1idx,r2idx] = idx
# now create the html from that
html = "<table>"
ths = "".join("<th>%s</th>"%mol_to_img(r2_lookup[x]) for x in freq_r2s)
html += f"<tr><td></td>{ths}</tr>"
for i1,x in enumerate(freq_r1s):
img = mol_to_img(r1_lookup[x])
row = f"<tr><td>{img}</td>"
for i2,y in enumerate(freq_r2s):
if matrix[i1,i2] is not None:
elem = matrix[i1,i2]
elem = values.iloc[elem]
else:
elem = ''
row += f'<td>{elem}</td>'
row += "</tr>"
html += row
html += "</table>"
return html
html = run_groups(groups,unmatched,docdf.ROMol,docdf.pchembl_value)
HTML(html)
Let's make that more convenient by creating a single wrapper function:
def create_SAR_table(mols,scaffold,values,r1_label='R1',r2_label='R2',threshold=1):
# align all the molecules to the scaffold:
rdDepictor.Compute2DCoords(scaffold)
for m in mols:
if m.HasSubstructMatch(scaffold):
rdDepictor.GenerateDepictionMatching2DStructure(m,scaffold)
# do an r-group decomposition:
groups,unmatched = rdRGroupDecomposition.RGroupDecompose([scaffold],mols,asSmiles=False,asRows=False)
html = run_groups(groups,unmatched,mols,values,
r1_label=r1_label,r2_label=r2_label,threshold=threshold)
return html
# Pull all the structures that contain a common scaffold:
doc_scaffold = Chem.MolFromSmarts('[*:1]c1nc([*:2])on1')
match_df = df[df.ROMol>=doc_scaffold]
# Show the table for R-groups that appear at least 5 times:
html = create_SAR_table(match_df.ROMol,doc_scaffold,match_df.pchembl_value,threshold=5)
HTML(html)
It'd be shiny if we could get that to pop up into a separate HTML window so that the images were a bit bigger, but I'm not willing to dive down that hole at the moment.
This is clearly just a starting point (particularly since it can only show two sets of R groups at a time), but hopefully it's already useful.
2 comments:
Hello Greg,
Thank you so much for your post! It was very helpful. I was wondering if you knew what parameters to tweak in order to get an R decomposition where the R groups contained the same atom indices for each of the query molecules and, if so, how or where to access this information.
I did the R group decomposition by specifying the following:
params = rdRGroupDecomposition.RGroupDecompositionParameters()
params.labels = rdRGroupDecomposition.RGroupLabels.AtomIndexLabels
groups, unmatched = rdRGroupDecomposition.RGroupDecompose([scaffold],mols,asSmiles=False,asRows=False,options=params)
I thought I could retrieve the atom indices by simply iterating through every atom in every R group in groups (nested for loop) and use the .GetIdx() method on each atom. However, the indices that I get for each R group are not the same as those corresponding to the same R group in each query molecule (contained in mols). Any help on this would be greatly appreciated.
Thanks!
Post a Comment