This is a relatively short one because I just wanted to point people to an older (thus buried) dataset from some former colleagues that I've found really useful in the past but that I think a lot of folks aren't aware of. The dataset is in the supplementary material to this paper: https://pubs.acs.org/doi/abs/10.1021/jm020472j "Informative Library Design as an Efficient Strategy to Identify and Optimize Leads: Application to Cyclin-Dependent Kinase 2 Antagonists" by Erin Bradley et al. The paper itself is worth reading, but the buried treasure is the Excel file in the supplementary material, which contains SMILES and measured data for >17K compounds. There's also a very useful PDF which explains the columns in that file.
What's very cool is that the compounds are a mix of things from a small general-purpose screening library and compounds purchased or synthesized for a med chem project. I'm not aware of any other public datasets that have this type of information.
Let's look at what's there.
Note: This is another one of those posts that blogger couldn't quite deal with, so I did some editing here. There's a bit more in the jupyter notebook in githubfrom rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdFMCS
import pandas as pd
import rdkit
print(rdkit.__version__)
Start by reading in the file and adding a molecule column:
df = pd.read_excel('../data/jm020472j_s2.xls')
df.head()
PandasTools.AddMoleculeColumnToFrame(df)
df = df.drop('Smiles',axis=1)
The "sourcepool" column relates to which kind of compound it is:
divscreen
is from the screening librarysimscreen
are vendor compounds picked by chemists using similarity screeningsynscreen
are compounds made and screened during the course of the project
df.groupby('sourcepool')['mol_name'].count()
Another interesting column is the scaffold
, which contains human-assigned scaffolds for the compounds
df[df.sourcepool=='divscreen'].groupby('scaffold')['mol_name'].count()
df[df.sourcepool=='simscreen'].groupby('scaffold')['mol_name'].count()
synscreen = df[df.sourcepool=='synscreen']
synscreen.groupby('scaffold')['mol_name'].count()
Let's look at some compounds:
IPythonConsole.drawOptions.bondLineWidth=1
IPythonConsole.ipython_useSVG = True
PandasTools.FrameToGridImage(synscreen[synscreen.scaffold=='Scaffold_18'],molsPerRow=4,maxMols=8)
We don't have the actual scaffold definitions, but we can use the standard MCS trick to guess:
scaff = 'Scaffold_18'
params = rdFMCS.MCSParameters()
params.BondCompareParameters.CompleteRingsOnly=True
params.Threshold=0.8
mcs = rdFMCS.FindMCS(list(synscreen[synscreen.scaffold==scaff].ROMol),params)
print(mcs.smartsString)
tmp = synscreen[synscreen.scaffold==scaff]
tmp[tmp.ROMol>=Chem.MolFromSmarts(mcs.smartsString)].head()
scaff = 'Scaffold_12'
params = rdFMCS.MCSParameters()
params.BondCompareParameters.CompleteRingsOnly=True
params.Threshold=0.8
mcs = rdFMCS.FindMCS(list(synscreen[synscreen.scaffold==scaff].ROMol),params)
print(mcs.smartsString)
tmp = synscreen[synscreen.scaffold==scaff]
tmp[tmp.ROMol>=Chem.MolFromSmarts(mcs.smartsString)].head()
from collections import defaultdict
params = rdFMCS.MCSParameters()
params.BondCompareParameters.CompleteRingsOnly=True
params.Threshold=0.8
scaffs = defaultdict(list)
for scaff in synscreen.scaffold.unique():
subset = synscreen[synscreen.scaffold==scaff]
subset['ROMol'] = [Chem.Mol(x) for x in list(subset.ROMol)]
# if len(subset)>100:
# continue
print(f'Doing {scaff} with {len(subset)} mols')
mcs = rdFMCS.FindMCS(list(subset.ROMol),params)
print(mcs.smartsString)
matches = subset[subset.ROMol>=Chem.MolFromSmarts(mcs.smartsString)]
mol = matches.iloc[0].ROMol
scaffs['scaffold'].append(scaff)
scaffs['ROMol'].append(mol)
scaffs['smarts'].append(mcs.smartsString)
scaffs['timed_out'].append(mcs.canceled)
scaffs = pd.DataFrame(scaffs)
counts = [len(synscreen[synscreen.scaffold == scaff]) for scaff in synscreen.scaffold.unique()]
#scaffs['count'] = [x for x in counts if x<=100]
scaffs['count'] = counts
scaffs
The measured data¶
Finally, let's look at the measured data that's present.
Every compound has measured % inhibition values and an assignment to "active", "inactive", and "gray" bins in the cdk_act_bin_1 column (the assignment scheme for this is in that PDF).
df.groupby('cdk_act_bin_1')['mol_name'].count()
df.groupby(['sourcepool','cdk_act_bin_1'])['mol_name'].count()
There are also a smaller number of measured IC50 values:
df[df.cdk2_ic50 != 'None'].groupby('sourcepool')['mol_name'].count()
I'll be using this dataset in future blog posts, but I figure it's worth pointing it out here.
Enjoy! :-)
1 comment:
Hi, great blog post. I'm trying to use rdkit and pandas to import data from an excel file, but in this case we have MOL information inside the xlsx (this file has a chemoffice workbook). I know this is a long shot, but do you know any tools to automatically export this table (I know I could probably export from excel using the chemoffice extension, but it's not working for some reason). I would like to use the MOL info instead of SMILEs because we have enhanced stereochemistry in some compounds (and the idea was to use rdkit to write cxsmiles for all compounds). Thanks for any insights!
Post a Comment