Sunday, February 2, 2014

More on Datasets II

More information about the second set of benchmarking data sets.

In our benchmarking paper, we focused on simulating one of the classic virtual screening use cases: you have a set of diverse actives (e.g. from an HTS experiment) and want to identify the next compounds that should be tested. When we did the model fusion paper we wanted to also simulate a second use case: you have a small set of similar actives (e.g. from a paper or patent) and want to identify the next compounds that should be tested. To accomplish this we added a second type of dataset to the benchmarking platform, for want of a better term we called these "Data sets II" (I was advocating for "leave one paper out"; it's probably good that Nikolas and Sereina prevailed).

In the paper itself we didn't have the space to do a great job of describing these datasets and why they are interesting. I'm going to try and at least partially remedy that here. It also gives me a chance to play around with an idea for pulling the scaffold out of a set of compounds.

To build datasets II, Sereina started with the 50 ChEMBL targets we had identified as being difficult enough to be interesting for the machine-learning exercise. For each of those targets, here are the steps:

  1. Find all ChEMBL documents that contain an activity value measured against that target.
  2. For each document, find the number of active molecules. We define actives as compounds with a potency < 10uM.
  3. If the document has less than ten actives, remove it.
  4. If the target has less than four documents remaining, remove it.

This process left us with 37 targets. The targets had from 4-37 papers each and the papers had 10-112 actives.

Given that the documents in ChEMBL are mainly from med chem papers, and knowing the content of the typical med chem paper, we asserted that each paper that makes up these datasets is likely to contain data about one or two chemical series along with a small number of reference compounds. I'll show here that, at least for a random selection of sets, this is actually true.

The approach is a simple one: pull the compounds and calculate the MCS that covers at least 80% of the compounds in the dataset. This arbitrary cutoff allows the reference compound(s) to be ignored. To allow for the small changes that are sometimes made to scaffolds, I use generic atom types for the MCS-finding and then, in a post-processing step, assign the atom types that are preserved across all compounds matching the MCS.

The datasets are in github:

NOTE: This is another example where the post is a bit too big for blogger to handle. You can find the full post here in the nbviewer.

On to the analysis

In [1]:
import pandas as pd

from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
Draw.DrawingOptions.elemDict[0]=(0.,0.,0.)  # draw dummy atoms in black

from rdkit.Chem import PandasTools
from rdkit.Chem import AllChem as Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import MCS
import cPickle
from rdkit import rdBase

import time
print time.asctime()
Sun Feb  2 07:34:15 2014

Grab the data

In [2]:
import glob
pkls = glob.glob('/home/glandrum/Code/benchmarking_platform/compounds/ChEMBL_II/*.pkl')
print len(pkls)

Get target information from our local ChEMBL copy:

In [3]:
%load_ext sql
%config = False
In [4]:
import os
for pkl in pkls:
    fn = os.path.split(pkl)[-1]
    basen = os.path.splitext(fn)[0]
    data = %sql postgresql://localhost/chembl_16 \
          select tid,pref_name,organism from target_dictionary where tid=:tgtn ;
In [5]:
for idx in sorted(tgts.keys()):
    pkl,tgt,species = tgts[idx]
    print idx,tgt,species
15 Carbonic anhydrase II Homo sapiens
25 Glucocorticoid receptor Homo sapiens
43 Beta-2 adrenergic receptor Homo sapiens
51 Serotonin 1a (5-HT1a) receptor Homo sapiens
61 Muscarinic acetylcholine receptor M1 Homo sapiens
65 Cytochrome P450 19A1 Homo sapiens
72 Dopamine D2 receptor Homo sapiens
87 Cannabinoid CB1 receptor Homo sapiens
90 Dopamine D4 receptor Homo sapiens
93 Acetylcholinesterase Homo sapiens
100 Norepinephrine transporter Homo sapiens
107 Serotonin 2a (5-HT2a) receptor Homo sapiens
108 Serotonin 2c (5-HT2c) receptor Homo sapiens
114 Adenosine A1 receptor Homo sapiens
121 Serotonin transporter Homo sapiens
126 Cyclooxygenase-2 Homo sapiens
130 Dopamine D3 receptor Homo sapiens
165 HERG Homo sapiens
259 Cannabinoid CB2 receptor Homo sapiens
10188 MAP kinase p38 alpha Homo sapiens
10193 Carbonic anhydrase I Homo sapiens
10260 Vanilloid receptor Homo sapiens
10280 Histamine H3 receptor Homo sapiens
10434 Tyrosine-protein kinase SRC Homo sapiens
10980 Vascular endothelial growth factor receptor 2 Homo sapiens
11140 Dipeptidyl peptidase IV Homo sapiens
11365 Cytochrome P450 2D6 Homo sapiens
11489 11-beta-hydroxysteroid dehydrogenase 1 Homo sapiens
11534 Cathepsin S Homo sapiens
11575 C-C chemokine receptor type 2 Homo sapiens
11631 Sphingosine 1-phosphate receptor Edg-1  Homo sapiens
12209 Carbonic anhydrase XII Homo sapiens
12252 Beta-secretase 1 Homo sapiens
12952 Carbonic anhydrase IX Homo sapiens
13001 Matrix metalloproteinase-2 Homo sapiens
17045 Cytochrome P450 3A4 Homo sapiens
19905 Melanin-concentrating hormone receptor 1 Homo sapiens

In [6]:
sets = cPickle.load(file(tgts[11631][0]))

The data is organized in a dictionary with one entry per paper.

In [7]:
docid,cmpds = sets.iteritems().next()
data = %sql postgresql://localhost/chembl_16 \
          select journal,volume,first_page,year from docs where doc_id=:docid ;
print ', '.join(str(x) for x in data[0])
Bioorg. Med. Chem. Lett., 22, 144, 2012

In [8]:
ids,smis= zip(*cmpds)
mols = [Chem.MolFromSmiles(x) for x in smis]

Define an MCS-based approach to find the scaffold from the paper:

In [19]:
def MCS_Report(ms,printSmarts=True,atomCompare='any',**kwargs):
      the "convert to specific" algorithm used isn't perfect because it doesn't deal correctly
      with possible symmetries in the MCS-molecule match, but it's at least a start.
    mcs = MCS.FindMCS(ms,atomCompare=atomCompare,timeout=60,**kwargs)
    nAts = numpy.array([x.GetNumAtoms() for x in ms])
    print 'Mean nAts %.1f, mcs nAts: %d'%(nAts.mean(),mcs.numAtoms)
    if printSmarts: 
        print mcs.smarts
    mcsM = Chem.MolFromSmarts(mcs.smarts)
    # find the common atoms
    if atomCompare == 'any':
        mcsM2 = Chem.MolFromSmiles(mcs.smarts,sanitize=False)
        for m in ms:
            match = m.GetSubstructMatch(mcsM) 
            if not match:
            for qidx,midx in enumerate(match):
                anum = m.GetAtomWithIdx(midx).GetAtomicNum()
                if atNums[qidx]==-1:
                elif anum!=atNums[qidx]:
        for idx,atnum in enumerate(atNums):
            if atnum>0:
    return mcsM,img

The function prints out the mean number of atoms per molecule in the input along with the size of the MCS. This is intended to help assess whether or not the MCS is actually a scaffold.

In [10]:
mcs,img = MCS_Report(mols,threshold=0.8)
Mean nAts 30.6, mcs nAts: 29


Now let's look at all the papers

In [11]:
def processTarget(tid):
    print '-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*'
    print '\t',target,species
    sets = cPickle.load(file(pkl))
    print '\t\t %d papers'%(len(sets))
    for docid,cmpds in sets.iteritems():
        data = %sql postgresql://localhost/chembl_16 \
              select journal,volume,first_page,year from docs where doc_id=:docid ;
        ref=', '.join(str(x) for x in data[0])
        print '----------------'
        print ref
        ids,smis= zip(*cmpds)
        mols = [Chem.MolFromSmiles(x) for x in smis]
        mcs,img = MCS_Report(mols,printSmarts=False,threshold=0.8,atomCompare='any')
    return alldata 
In [12]:
Draw.MolsToGridImage([x[2] for x in alldata],molsPerRow=5,kekulize=False)   
 Sphingosine 1-phosphate receptor Edg-1  Homo sapiens
   7 papers
Bioorg. Med. Chem. Lett., 22, 144, 2012
Mean nAts 30.6, mcs nAts: 29
Bioorg. Med. Chem. Lett., 22, 1779, 2012
Mean nAts 31.2, mcs nAts: 29
In [13]:
Draw.MolsToGridImage([x[2] for x in alldata],molsPerRow=5,kekulize=False)   
 Matrix metalloproteinase-2 Homo sapiens
   10 papers
Bioorg. Med. Chem. Lett., 21, 2820, 2011
Mean nAts 37.8, mcs nAts: 28
Bioorg. Med. Chem. Lett., 19, 3445, 2009
Mean nAts 36.1, mcs nAts: 33

No comments: