Friday, November 29, 2019

A buried data treasure

This is a relatively short one because I just wanted to point people to an older (thus buried) dataset from some former colleagues that I've found really useful in the past but that I think a lot of folks aren't aware of. The dataset is in the supplementary material to this paper: https://pubs.acs.org/doi/abs/10.1021/jm020472j "Informative Library Design as an Efficient Strategy to Identify and Optimize Leads:  Application to Cyclin-Dependent Kinase 2 Antagonists" by Erin Bradley et al. The paper itself is worth reading, but the buried treasure is the Excel file in the supplementary material, which contains SMILES and measured data for >17K compounds. There's also a very useful PDF which explains the columns in that file.

What's very cool is that the compounds are a mix of things from a small general-purpose screening library and compounds purchased or synthesized for a med chem project. I'm not aware of any other public datasets that have this type of information.

Let's look at what's there.

Note: This is another one of those posts that blogger couldn't quite deal with, so I did some editing here. There's a bit more in the jupyter notebook in github
In [1]:
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdFMCS
import pandas as pd
import rdkit
print(rdkit.__version__)
RDKit WARNING: [11:36:05] Enabling RDKit 2020.03.1dev1 jupyter extensions
2020.03.1dev1

Start by reading in the file and adding a molecule column:

In [2]:
df = pd.read_excel('../data/jm020472j_s2.xls')
df.head()
Out[2]:
Smiles mol_name cdk2_ic50 cdk2_inhib scaffold sourcepool cdk_act_bin_1
0 C#CC(/C=C/C)OC(=O)c(ccc1)c(c1)C(=O)O mol_1 None 23.3 Scaffold_00 divscreen 0
1 C#CC(C)(C)N(CC1C)CC\C1=N\OC(=O)c2cc([N+]([O-])... mol_2 None 20.5 Scaffold_00 divscreen 0
2 C#CC(C)(C)N(CN1)CNC1=S mol_3 None 20.7 Scaffold_00 divscreen 0
3 C#CC(C)(C)N(CN1)C\N=C/1SC mol_4 None 1.9 Scaffold_00 divscreen 0
4 C#CC(C)(C)NC(=O)CN(c(c1)cccc1)[S](=O)(=O)c2ccc... mol_5 None 19.8 Scaffold_00 divscreen 0
In [3]:
PandasTools.AddMoleculeColumnToFrame(df)
df = df.drop('Smiles',axis=1)

The "sourcepool" column relates to which kind of compound it is:

  • divscreen is from the screening library
  • simscreen are vendor compounds picked by chemists using similarity screening
  • synscreen are compounds made and screened during the course of the project
In [4]:
df.groupby('sourcepool')['mol_name'].count()
Out[4]:
sourcepool
divscreen    13359
simscreen      951
synscreen     3240
Name: mol_name, dtype: int64

Another interesting column is the scaffold, which contains human-assigned scaffolds for the compounds

In [5]:
df[df.sourcepool=='divscreen'].groupby('scaffold')['mol_name'].count()
Out[5]:
scaffold
Scaffold_00    12280
Scaffold_01       32
Scaffold_02      111
Scaffold_03       57
Scaffold_04      461
Scaffold_05       37
Scaffold_06       16
Scaffold_07       15
Scaffold_08      103
Scaffold_09       88
Scaffold_11        7
Scaffold_12      109
Scaffold_18        2
Scaffold_20       41
Name: mol_name, dtype: int64
In [6]:
df[df.sourcepool=='simscreen'].groupby('scaffold')['mol_name'].count()
Out[6]:
scaffold
Scaffold_00    460
Scaffold_01     90
Scaffold_03      2
Scaffold_04     91
Scaffold_05     18
Scaffold_08    216
Scaffold_10     37
Scaffold_11      8
Scaffold_12     29
Name: mol_name, dtype: int64
In [7]:
synscreen = df[df.sourcepool=='synscreen']
synscreen.groupby('scaffold')['mol_name'].count()
Out[7]:
scaffold
Scaffold_01    204
Scaffold_02    106
Scaffold_04     78
Scaffold_05    265
Scaffold_06    273
Scaffold_07     76
Scaffold_08     13
Scaffold_09    344
Scaffold_10    502
Scaffold_12     61
Scaffold_13     17
Scaffold_14     12
Scaffold_15     11
Scaffold_16    269
Scaffold_17    157
Scaffold_18     38
Scaffold_19    256
Scaffold_21    558
Name: mol_name, dtype: int64

Let's look at some compounds:

In [8]:
IPythonConsole.drawOptions.bondLineWidth=1
IPythonConsole.ipython_useSVG = True
PandasTools.FrameToGridImage(synscreen[synscreen.scaffold=='Scaffold_18'],molsPerRow=4,maxMols=8)
/scratch/RDKit_git/rdkit/Chem/Draw/IPythonConsole.py:188: UserWarning: Truncating the list of molecules to be displayed to 8. Change the maxMols value to display more.
  % (maxMols))
Out[8]:
N N N O NH O O N N S N N NH O O N N S N N NH O O O O NH N O O O NH N N O O NH N N N O N N O N NH O O O O NH N NH

We don't have the actual scaffold definitions, but we can use the standard MCS trick to guess:

In [10]:
scaff = 'Scaffold_18'
params = rdFMCS.MCSParameters()
params.BondCompareParameters.CompleteRingsOnly=True
params.Threshold=0.8
mcs = rdFMCS.FindMCS(list(synscreen[synscreen.scaffold==scaff].ROMol),params)
print(mcs.smartsString)
tmp = synscreen[synscreen.scaffold==scaff]
tmp[tmp.ROMol>=Chem.MolFromSmarts(mcs.smartsString)].head()
[#6]-&!@[#8]-&!@[#6](=&!@[#8])-&!@[#7]-&!@[#6]
Out[10]:
mol_name cdk2_ic50 cdk2_inhib scaffold sourcepool cdk_act_bin_1 ROMol
946 mol_947 None 11.1111 Scaffold_18 synscreen 0 Mol
1528 mol_1529 None 12.9218 Scaffold_18 synscreen 0 Mol
1529 mol_1530 None 6.47714 Scaffold_18 synscreen 0 Mol
4923 mol_4924 None 0 Scaffold_18 synscreen 0 Mol
4924 mol_4925 None 0.905505 Scaffold_18 synscreen 0 Mol
In [11]:
scaff = 'Scaffold_12'
params = rdFMCS.MCSParameters()
params.BondCompareParameters.CompleteRingsOnly=True
params.Threshold=0.8
mcs = rdFMCS.FindMCS(list(synscreen[synscreen.scaffold==scaff].ROMol),params)
print(mcs.smartsString)
tmp = synscreen[synscreen.scaffold==scaff]
tmp[tmp.ROMol>=Chem.MolFromSmarts(mcs.smartsString)].head()
[#6]-&!@[#7]1:&@[#6](=&!@[#8]):&@[#6]:&@[#6]:&@[#7]:&@[#6]:&@1=&!@[#8]
Out[11]:
mol_name cdk2_ic50 cdk2_inhib scaffold sourcepool cdk_act_bin_1 ROMol
1202 mol_1203 None 19.7348 Scaffold_12 synscreen 0 Mol
3236 mol_3237 None 32.6813 Scaffold_12 synscreen 50 Mol
3318 mol_3319 None 18.6061 Scaffold_12 synscreen 0 Mol
3769 mol_3770 None 16.5904 Scaffold_12 synscreen 0 Mol
3787 mol_3788 None 5.70459 Scaffold_12 synscreen 0 Mol
In [99]:
from collections import defaultdict
params = rdFMCS.MCSParameters()
params.BondCompareParameters.CompleteRingsOnly=True
params.Threshold=0.8
scaffs = defaultdict(list)
for scaff in synscreen.scaffold.unique():
    subset = synscreen[synscreen.scaffold==scaff]
    subset['ROMol'] = [Chem.Mol(x) for x in list(subset.ROMol)]
#     if len(subset)>100:
#         continue
    print(f'Doing {scaff} with {len(subset)} mols')
    mcs = rdFMCS.FindMCS(list(subset.ROMol),params)
    print(mcs.smartsString)
    matches = subset[subset.ROMol>=Chem.MolFromSmarts(mcs.smartsString)]
    mol = matches.iloc[0].ROMol
    scaffs['scaffold'].append(scaff)
    scaffs['ROMol'].append(mol)
    scaffs['smarts'].append(mcs.smartsString)
    scaffs['timed_out'].append(mcs.canceled)
scaffs = pd.DataFrame(scaffs)
/other_linux/home/glandrum/anaconda3/envs/rdkit_blog/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Doing Scaffold_05 with 265 mols
[#6]-&!@[#7]-&!@[#6]1:&@[#6]:&@[#6]:&@[#7]:&@[#6](:&@[#7]:&@1)-&!@[#7]-&!@[#6]
Doing Scaffold_10 with 502 mols
[#6]12:&@[#6]:&@[#6]:&@[#7]:&@[#7]:&@1:&@[#6](:&@[#6]:&@[#6](:&@[#7]:&@2)-&!@[#6])=&!@[#8]
Doing Scaffold_09 with 344 mols
[#6]-&!@[#6]1=&@[#7]-&@[#7](-&@[#6](-&@[#6]-&@1)-&!@[#6])-&!@[#6]
Doing Scaffold_16 with 269 mols
[#6]-&!@[#8]-&!@[#6](=&!@[#8])-&!@[#6]1(-&!@[#6])-&@[#7]-&@[#6](-&@[#6]2-&@[#6]-&@1-&@[#6](-&@[#7]-&@[#6]-&@2=&!@[#8])=&!@[#8])-&!@[#6]
Doing Scaffold_07 with 76 mols
[#6]-&!@[#6]1=&@[#7]-&@[#7]-&@[#6](-&@[#6]-&@1)=&!@[#8]
Doing Scaffold_02 with 106 mols
[#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6](:&@[#6]:&@1)-&!@[#6]1:&@[#6](:&@[#16]:&@[#6](:&@[#7]:&@1)-&!@[#7])-&!@[#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@1
Doing Scaffold_06 with 273 mols
[#6]-&!@[#6](-&!@[#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@1-&!@[#8])-&!@[#7]-&!@[#6]
Doing Scaffold_01 with 204 mols
[#6]-&!@[#6]-&!@[#7]1:&@[#6]:&@[#7]:&@[#6]:&@[#6]:&@1-&!@[#6]
Doing Scaffold_13 with 17 mols
[#6]-&!@[#7]-&!@[#6](=&!@[#8])-&!@[#6]1:&@[#6](-&!@[#8]):&@[#6](=&!@[#8]):&@[#6]:&@[#6](:&@[#8]:&@1)-&!@[#6](=&!@[#8])-&!@[#7]-&!@[#6]
Doing Scaffold_19 with 256 mols
[#6]-&!@[#6]-&!@[#7]-&!@[#6]1:&@[#7]:&@[#6]2:&@[#6](:&@[#16]:&@1):&@[#7]:&@[#6]:&@[#7]:&@[#6]:&@2-&!@[#7]-&!@[#6]-&!@[#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@1
Doing Scaffold_18 with 38 mols
[#6]-&!@[#8]-&!@[#6](=&!@[#8])-&!@[#7]-&!@[#6]
Doing Scaffold_04 with 78 mols
[#6]-&!@[#7]-&!@[#6](=&!@[#8])-&!@[#7]-&!@[#6]-&!@[#6]
Doing Scaffold_17 with 157 mols
[#6]-&!@[#7]1:&@[#6]:&@[#6](:&@[#6]:&@[#7]:&@1)-&!@[#6](=&!@[#8])-&!@[#6]1:&@[#6](:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@1)-&!@[#8]
Doing Scaffold_12 with 61 mols
[#6]-&!@[#7]1:&@[#6](=&!@[#8]):&@[#6]:&@[#6]:&@[#7]:&@[#6]:&@1=&!@[#8]
Doing Scaffold_21 with 558 mols
[#6]-&!@[#6]-&!@[#7]-&!@[#6]1:&@[#6](-&!@[#6]):&@[#7]:&@[#6]:&@[#7]:&@1
Doing Scaffold_15 with 11 mols
[#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@1
Doing Scaffold_14 with 12 mols
[#6]1:&@[#6]:&@[#6](-&!@[#9]):&@[#6]:&@[#6]:&@[#6]:&@1-&!@[#6]1:&@[#6]2:&@[#6](:&@[#7]:&@[#6]:&@1-&!@[#6]1:&@[#6]:&@[#6]:&@[#7]:&@[#6](:&@[#6]:&@1)-&!@[#7]-&!@[#6]):&@[#7]:&@[#6]:&@[#6]:&@[#7]:&@2
Doing Scaffold_08 with 13 mols
[#6]1:&@[#7]:&@[#7]:&@[#6](:&@[#6]:&@1)-&!@[#6]
In [100]:
counts = [len(synscreen[synscreen.scaffold == scaff]) for scaff in synscreen.scaffold.unique()]
#scaffs['count'] = [x for x in counts if x<=100]
scaffs['count'] = counts
In [101]:
scaffs
Out[101]:
scaffold ROMol smarts timed_out count
0 Scaffold_05 Mol [#6]-&!@[#7]-&!@[#6]1:&@[#6]:&@[#6]:&@[#7]:&@[... False 265
1 Scaffold_10 Mol [#6]12:&@[#6]:&@[#6]:&@[#7]:&@[#7]:&@1:&@[#6](... False 502
2 Scaffold_09 Mol [#6]-&!@[#6]1=&@[#7]-&@[#7](-&@[#6](-&@[#6]-&@... False 344
3 Scaffold_16 Mol [#6]-&!@[#8]-&!@[#6](=&!@[#8])-&!@[#6]1(-&!@[#... False 269
4 Scaffold_07 Mol [#6]-&!@[#6]1=&@[#7]-&@[#7]-&@[#6](-&@[#6]-&@1... False 76
5 Scaffold_02 Mol [#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6](:&@[#6]:&@1)... False 106
6 Scaffold_06 Mol [#6]-&!@[#6](-&!@[#6]1:&@[#6]:&@[#6]:&@[#6]:&@... False 273
7 Scaffold_01 Mol [#6]-&!@[#6]-&!@[#7]1:&@[#6]:&@[#7]:&@[#6]:&@[... False 204
8 Scaffold_13 Mol [#6]-&!@[#7]-&!@[#6](=&!@[#8])-&!@[#6]1:&@[#6]... False 17
9 Scaffold_19 Mol [#6]-&!@[#6]-&!@[#7]-&!@[#6]1:&@[#7]:&@[#6]2:&... False 256
10 Scaffold_18 Mol [#6]-&!@[#8]-&!@[#6](=&!@[#8])-&!@[#7]-&!@[#6] False 38
11 Scaffold_04 Mol [#6]-&!@[#7]-&!@[#6](=&!@[#8])-&!@[#7]-&!@[#6]... False 78
12 Scaffold_17 Mol [#6]-&!@[#7]1:&@[#6]:&@[#6](:&@[#6]:&@[#7]:&@1... False 157
13 Scaffold_12 Mol [#6]-&!@[#7]1:&@[#6](=&!@[#8]):&@[#6]:&@[#6]:&... False 61
14 Scaffold_21 Mol [#6]-&!@[#6]-&!@[#7]-&!@[#6]1:&@[#6](-&!@[#6])... False 558
15 Scaffold_15 Mol [#6]1:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@[#6]:&@1 False 11
16 Scaffold_14 Mol [#6]1:&@[#6]:&@[#6](-&!@[#9]):&@[#6]:&@[#6]:&@... False 12
17 Scaffold_08 Mol [#6]1:&@[#7]:&@[#7]:&@[#6](:&@[#6]:&@1)-&!@[#6] False 13

The measured data

Finally, let's look at the measured data that's present.

Every compound has measured % inhibition values and an assignment to "active", "inactive", and "gray" bins in the cdk_act_bin_1 column (the assignment scheme for this is in that PDF).

In [113]:
df.groupby('cdk_act_bin_1')['mol_name'].count()
Out[113]:
cdk_act_bin_1
0      15276
50      1906
100      368
Name: mol_name, dtype: int64
In [118]:
df.groupby(['sourcepool','cdk_act_bin_1'])['mol_name'].count()
Out[118]:
sourcepool  cdk_act_bin_1
divscreen   0                11972
            50                1180
            100                207
simscreen   0                  824
            50                  80
            100                 47
synscreen   0                 2480
            50                 646
            100                114
Name: mol_name, dtype: int64

There are also a smaller number of measured IC50 values:

In [139]:
df[df.cdk2_ic50 != 'None'].groupby('sourcepool')['mol_name'].count()
Out[139]:
sourcepool
divscreen    51
simscreen    26
synscreen    34
Name: mol_name, dtype: int64

I'll be using this dataset in future blog posts, but I figure it's worth pointing it out here.

Enjoy! :-)

1 comment:

Hugo de Almeida said...

Hi, great blog post. I'm trying to use rdkit and pandas to import data from an excel file, but in this case we have MOL information inside the xlsx (this file has a chemoffice workbook). I know this is a long shot, but do you know any tools to automatically export this table (I know I could probably export from excel using the chemoffice extension, but it's not working for some reason). I would like to use the MOL info instead of SMILEs because we have enhanced stereochemistry in some compounds (and the idea was to use rdkit to write cxsmiles for all compounds). Thanks for any insights!