Sunday, August 10, 2014

Picking diverse compounds from large sets

Draft post This post will probably be edited over the next day or so.

This post explores two different approaches for picking diverse small subsets of compounds from large sets: random selection and the MaxMin algorithm (Ashton, M. et. al., Quant. Struct.-Act. Relat., 21 (2002), 598-604).

In [1]:
import numpy as np
from __future__ import print_function

from rdkit import Chem
from rdkit.Chem import Draw,rdMolDescriptors,AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit import SimDivFilters,DataStructs

import gzip
%pylab inline

from rdkit import rdBase
import time
Populating the interactive namespace from numpy and matplotlib
Mon Aug 11 07:46:10 2014

Start by reading in a set of molecules from the Zinc Biogenic Compounds (formerly known as "Zinc Natural Products") subset and generating Morgan2 fingerprints for the diversity calculation. The compound structures are available here:

In [2]:
fps = []
inf ='../data/znp.sdf.gz')
suppl = Chem.ForwardSDMolSupplier(inf)
for m in suppl:
    if m is None:
    if not len(fps)%5000: print("Done: %d"%len(fps))

print("Got %d fingerprints"%len(fps))        
Done: 5000
Done: 10000
Done: 15000
Done: 20000
Done: 25000
Done: 30000
Done: 35000
Done: 40000
Done: 45000
Done: 50000
Done: 55000
Done: 60000
Done: 65000
Done: 70000
Done: 75000
Done: 80000
Done: 85000
Done: 90000
Done: 95000
Done: 100000
Done: 105000
Done: 110000
Done: 115000
Done: 120000
Done: 125000
Done: 130000
Done: 135000
Done: 140000
Got 144069 fingerprints

In [5]:
import time
mmp =SimDivFilters.MaxMinPicker()
bv_ids = mmp.LazyBitVectorPick(fps,len(fps),50)
print("That took %.2f seconds"%(t2-t1))
That took 43.99 seconds

In [10]:
for i in range(len(bv_ids)):
    for j in range(i+1,len(bv_ids)):
_=title("MaxMin Picks")

Those are certainly diverse

What about simply picking random compounds?

In [11]:
import random
rr = list(range(len(fps)))
r_ids = rr[:50]
for i in range(len(r_ids)):
    for j in range(i+1,len(r_ids)):
_=title("Random Picks")

At first glance, it looks like the random picks are much less diverse than the picks with the MaxMin algorithm: the peak in the first histogram is significantly shifted to higher similarity.

However: there is an important detail to keep in mind: the "noise level" of the fingerprint. In an earlier post (, I looked at similarities between random pairs of "drug-like" compounds from ChEMBL in order to identify thresholds for when a similarity value is at the noise level (defined by finding the value which 99% of random pairs have a lower similarity). For the fingerprint used here, Morgan2, this value is 0.321.

Using that cutoff, we can easily figure out what fraction of the pairs in the set above are "noise":

In [12]:
count = sum([1 for x in dist_hist2 if x<noise])
print("Count: %d, fraction: %.3f"%(count,1.*count/len(dist_hist2)))
Count: 1210, fraction: 0.988

So almost 99% of the pairs are below the noise level.

Verify that this wasn't just lucky by doing some stats:

In [15]:
for rep in range(100):
    rr = list(range(len(fps)))
    r_ids = rr[:50]
    for i in range(len(r_ids)):
        for j in range(i+1,len(r_ids)):
            if d<noise:
counts = numpy.array(counts,numpy.double)
counts /= nTot
mean = numpy.mean(counts)
dev = numpy.std(counts)
print("Mean: %.3f, std dev: %.3f"%(mean,dev))
Mean: 0.991, std dev: 0.004

Fazit: randomly picking small numbers of compounds from the large set gives subsets that are quite diverse. It's also a lot quicker than using the diversity picker: not only can we skip the diversity picking algorithm, but the molecules and fingerprints don't even need to be constructed.

Clearly this is not going to work as well when picking from compounds that are reasonably closely related to each other, but for small subsets of large diverse sets, it's a solid approach.

1 comment:

George Papadatos said...

Welcome back!
I agree 100% with your post and I've always been skeptical towards presentations/papers of overly sophisticated diverse selection methods which are effectively just a tiny bit more than expensive random number generators.
Btw, this has been confirmed in the bioactivity space too: