Wednesday, November 6, 2019

Constructing SAR tables in Jupyter

This one was inspired by a thread on the rdkit-discuss mailing list: https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg09068.html

An SAR table is often a convenient way to summarize a dataset. These compact views of an SAR dataset have R1 structures in the rows, R2 structures in the columns, and measured values in the cells. I guess you probably know what I'm talking about even if I'm not describing it particularly well; if not, scroll down a bit and you'll see one. :-)

The question on the mailing list was how to build one of these with the RDKit and display it in Jupyter. That's what we'll do in this blog post.

In [1]:
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdRGroupDecomposition
from rdkit.Chem import rdDepictor
import rdkit
import pandas as pd
import numpy as np
from rdkit.Chem import PandasTools
import time
print(rdkit.__version__)
print(time.ctime())
2019.09.1
Wed Nov  6 09:14:28 2019
RDKit WARNING: [09:14:28] Enabling RDKit 2019.09.1 jupyter extensions

Read in our dataset

This is data from ChEMBL for a set of compounds with EC50 values measured against S1P1.

In [2]:
!head ../data/S1P1_data.csv
"doc_id","molregno","standard_relation","standard_value","standard_units","standard_flag","standard_type","pchembl_value","canonical_smiles","compound_chembl_id"
5839,189018,"=",82,"nM",1,"EC50",7.09,"CCCCCCCCc1ccc(NC(=O)[C@@H](N)COP(=O)(O)O)cc1.OC(=O)C(F)(F)F","CHEMBL332050"
5839,188442,"=",322.1,"nM",1,"EC50",6.49,"CCCCCCCCCCCCCCONC(=O)[C@@H](N)COP(=O)(O)O.OC(=O)C(F)(F)F","CHEMBL115505"
5839,188375,"=",260,"nM",1,"EC50",6.58,"CCCCCCc1ccc(NC(=O)[C@H](N)COP(=O)(O)O)cc1.OC(=O)C(F)(F)F","CHEMBL115344"
5839,188376,"=",598.4,"nM",1,"EC50",6.22,"CCCCCCCCCCCCNC(=O)[C@@H](N)COP(=O)(O)O.OC(=O)C(F)(F)F","CHEMBL324358"
5839,188766,"=",12.7,"nM",1,"EC50",7.9,"CCCCCCCCCCCCCCNC(=O)[C@H](N)COP(=O)(O)O.OC(=O)C(F)(F)F","CHEMBL332472"
5839,188278,"=",58,"nM",1,"EC50",7.24,"CCCCCCCCc1ccc(NC(=O)[C@H](N)COP(=O)(O)O)cc1.OC(=O)C(F)(F)F","CHEMBL422074"
5839,188644,"=",130,"nM",1,"EC50",6.89,"CCCCCCCCCCCCc1ccc(NC(=O)[C@H](N)COP(=O)(O)O)cc1.OC(=O)C(F)(F)F","CHEMBL116953"
5839,188322,"=",8.6,"nM",1,"EC50",8.07,"CCCCCCCCCCc1ccc(NC(=O)[C@H](N)COP(=O)(O)O)cc1.OC(=O)C(F)(F)F","CHEMBL334038"
5839,188407,"=",1805,"nM",1,"EC50",5.74,"CCCCCCCCCCCCCCCCNC(=O)[C@@H](N)COP(=O)(O)O.OC(=O)C(F)(F)F","CHEMBL325408"
In [3]:
df = pd.read_csv('../data/S1P1_data.csv')
PandasTools.AddMoleculeColumnToFrame(df,smilesCol='canonical_smiles')
df.head()
Out[3]:
doc_id molregno standard_relation standard_value standard_units standard_flag standard_type pchembl_value canonical_smiles compound_chembl_id ROMol
0 5839 189018 = 82.0 nM 1 EC50 7.09 CCCCCCCCc1ccc(NC(=O)[C@@H](N)COP(=O)(O)O)cc1.O... CHEMBL332050 Mol
1 5839 188442 = 322.1 nM 1 EC50 6.49 CCCCCCCCCCCCCCONC(=O)[C@@H](N)COP(=O)(O)O.OC(=... CHEMBL115505 Mol
2 5839 188375 = 260.0 nM 1 EC50 6.58 CCCCCCc1ccc(NC(=O)[C@H](N)COP(=O)(O)O)cc1.OC(=... CHEMBL115344 Mol
3 5839 188376 = 598.4 nM 1 EC50 6.22 CCCCCCCCCCCCNC(=O)[C@@H](N)COP(=O)(O)O.OC(=O)C... CHEMBL324358 Mol
4 5839 188766 = 12.7 nM 1 EC50 7.90 CCCCCCCCCCCCCCNC(=O)[C@H](N)COP(=O)(O)O.OC(=O)... CHEMBL332472 Mol
In [4]:
def groups_to_df(groups,mols,include_core=False,redraw_sidechains=False):
    """ converts the results of an r-group decomposition into a humanly useful
    DataFrame
    
    """
    cols = ['Mol']+list(groups.keys())
    if redraw_sidechains:
        for k,vl in groups.items():
            if k=='Core':
                continue
            for i,v in enumerate(vl):
                vl[i] = Chem.RemoveHs(v)
                rdDepictor.Compute2DCoords(vl[i])

    
    if not include_core:
        cols.remove('Core')
        del groups['Core']
    groups['Mol'] = mols
    frame = pd.DataFrame(groups,columns=cols)
    PandasTools.ChangeMoleculeRendering(frame)
    return frame

Do the R-group decomposition to get something to work with

For the demo here I'm going to use the structures from this paper, which is in ChEMBL as CHEMBL3352346.

To get the data in the form we need, we start with an R-group decomposition. This is another one of those topics where I need to do a blog post and/or write more documentation... fortunately, we're just doing the basics here:

In [5]:
# we need to provide a scaffold to use for the r-group decomposition:
doc_scaffold = Chem.MolFromSmarts('[*:1]c1nc([*:2])on1')

# filter the data to just include the rows from our document:
doc_id = 89753
docdf = df[df.doc_id==doc_id]

# align all the molecules to the scaffold:
rdDepictor.Compute2DCoords(doc_scaffold)
for m in docdf.ROMol:
    rdDepictor.GenerateDepictionMatching2DStructure(m,doc_scaffold)

# do an r-group decomposition:
groups,unmatched = rdRGroupDecomposition.RGroupDecompose([doc_scaffold],docdf.ROMol,asSmiles=False,asRows=False) 

# and look at the results:
ms = [y for x,y in enumerate(docdf.ROMol) if x not in unmatched]
res = groups_to_df(groups,ms,include_core=False)
res.head()
Out[5]:
Mol R1 R2
0 Mol Mol Mol
1 Mol Mol Mol
2 Mol Mol Mol
3 Mol Mol Mol
4 Mol Mol Mol

For the purposes of this analysis, it's quicker/easier to get the data from the RGD code in a different format:

In [6]:
groups,unmatched = rdRGroupDecomposition.RGroupDecompose([doc_scaffold],docdf.ROMol,asSmiles=False,asRows=False) 

This next block defines the function that actually creates the SAR matrix as HTML

In [7]:
from collections import Counter
from IPython.display import HTML
import base64
def mol_to_img(m):
    dm = Draw.PrepareMolForDrawing(m)
    d2d = Draw.MolDraw2DCairo(250,200)
    dopts = d2d.drawOptions()
    dopts.dummiesAreAttachments=True
    d2d.DrawMolecule(dm)
    d2d.FinishDrawing()
    png_data = d2d.GetDrawingText() 
    png_data = base64.encodebytes(png_data)
    html ='<img src="data:image/png;base64,%s">'%png_data.decode()
    return html

def run_groups(groups,unmatched,mols,values,r1_label='R1',r2_label='R2',threshold=1):
    # generate SAR matrix
    
    # generate SMILES for each of the R-groups and map those 
    # to the R-group's molecule objects:
    r1_smiles = [Chem.MolToSmiles(x) for x in groups[r1_label]]
    r2_smiles = [Chem.MolToSmiles(x) for x in groups[r2_label]]
    r1_lookup = dict(zip(r1_smiles,groups[r1_label]))
    r2_lookup = dict(zip(r2_smiles,groups[r2_label]))
    
    # all_r1s and all_r2s map R indices to the corresponding SMILES:
    all_r1s = dict([(y,x) for x,y in enumerate(r1_lookup.keys())])
    all_r2s = dict([(y,x) for x,y in enumerate(r2_lookup.keys())])
    
    # labelled_mols will contain 3-tuples:
    #   (molecule_index,R1_index,R2_index)
    labelled_mols = []
    residx = 0
    for i,m in enumerate(mols):
        if i in unmatched:
            continue
        r1_idx = all_r1s[r1_smiles[residx]]
        r2_idx = all_r2s[r2_smiles[residx]]
        residx += 1
        labelled_mols.append((i,r1_idx,r2_idx))
        
    # We only keep r groups that appear at least `threshold times in the full list:
    c1 = Counter()
    c2 = Counter()
    for idx,i,j in labelled_mols:
        c1[i] += 1
        c2[j] += 1
    freq1 = [x for x,y in c1.items() if y>=threshold]
    freq2 = [x for x,y in c2.items() if y>=threshold]
    reverse_r1s = dict([(y,x) for x,y in all_r1s.items()])
    reverse_r2s = dict([(y,x) for x,y in all_r2s.items()])
    freq_r1s = [reverse_r1s[i] for i in freq1]
    freq_r2s = [reverse_r2s[i] for i in freq2]
    n_r1 = len(freq1)
    n_r2 = len(freq2)
    
    # now construct a matrix 
    matrix = [None]*(n_r1*n_r2)
    matrix = np.reshape(matrix,(n_r1,n_r2))
    for idx,i,j in labelled_mols:
        if i not in freq1 or j not in freq2:
            continue
        r1idx = freq1.index(i)
        r2idx = freq2.index(j)
        matrix[r1idx,r2idx] = idx
        
    # now create the html from that
    html = "<table>"
    ths = "".join("<th>%s</th>"%mol_to_img(r2_lookup[x]) for x in freq_r2s)
    html += f"<tr><td></td>{ths}</tr>"
    for i1,x in enumerate(freq_r1s):
        img = mol_to_img(r1_lookup[x])
        row = f"<tr><td>{img}</td>"
        for i2,y in enumerate(freq_r2s):
            if matrix[i1,i2] is not None:
                elem = matrix[i1,i2]
                elem = values.iloc[elem]
            else:
                elem = ''
            row += f'<td>{elem}</td>'    
        row += "</tr>"
        html += row
    html += "</table>"
    return html
In [8]:
html = run_groups(groups,unmatched,docdf.ROMol,docdf.pchembl_value)
HTML(html)
Out[8]:
8.07.4
8.3
8.6
8.3
7.78.3
8.1
8.1
8.4
7.4
8.1
8.2
7.2
6.97.3
5.6
6.8
6.36.4
6.5
6.2
6.46.9
6.57.1
6.87.6
6.97.6
7.1
7.4
8.4
7.88.7
8.1
7.7
7.5
6.6

Let's make that more convenient by creating a single wrapper function:

In [9]:
def create_SAR_table(mols,scaffold,values,r1_label='R1',r2_label='R2',threshold=1):
    # align all the molecules to the scaffold:
    rdDepictor.Compute2DCoords(scaffold)
    for m in mols:
        if m.HasSubstructMatch(scaffold):
            rdDepictor.GenerateDepictionMatching2DStructure(m,scaffold)

    # do an r-group decomposition:
    groups,unmatched = rdRGroupDecomposition.RGroupDecompose([scaffold],mols,asSmiles=False,asRows=False) 
    html = run_groups(groups,unmatched,mols,values,
                      r1_label=r1_label,r2_label=r2_label,threshold=threshold)
    return html
In [10]:
# Pull all the structures that contain a common scaffold:
doc_scaffold = Chem.MolFromSmarts('[*:1]c1nc([*:2])on1')
match_df = df[df.ROMol>=doc_scaffold]

# Show the table for R-groups that appear at least 5 times:
html = create_SAR_table(match_df.ROMol,doc_scaffold,match_df.pchembl_value,threshold=5)
HTML(html)
Out[10]:
9.1410.19.829.310.1
9.7
7.417.51
8.198.07
8.4
9.410.98.5
8.127.127.44
9.218.688.32
10.0
8.51
8.77
7.8910.08.397.78.4
9.77nannan
nan10.310.3
nan9.310.3
8.54
8.466.46
8.17

It'd be shiny if we could get that to pop up into a separate HTML window so that the images were a bit bigger, but I'm not willing to dive down that hole at the moment.

This is clearly just a starting point (particularly since it can only show two sets of R groups at a time), but hopefully it's already useful.

2 comments:

Luis Fernando said...
This comment has been removed by the author.
Luis Fernando said...

Hello Greg,

Thank you so much for your post! It was very helpful. I was wondering if you knew what parameters to tweak in order to get an R decomposition where the R groups contained the same atom indices for each of the query molecules and, if so, how or where to access this information.

I did the R group decomposition by specifying the following:
params = rdRGroupDecomposition.RGroupDecompositionParameters()
params.labels = rdRGroupDecomposition.RGroupLabels.AtomIndexLabels

groups, unmatched = rdRGroupDecomposition.RGroupDecompose([scaffold],mols,asSmiles=False,asRows=False,options=params)


I thought I could retrieve the atom indices by simply iterating through every atom in every R group in groups (nested for loop) and use the .GetIdx() method on each atom. However, the indices that I get for each R group are not the same as those corresponding to the same R group in each query molecule (contained in mols). Any help on this would be greatly appreciated.

Thanks!