Wednesday, October 9, 2013

Fingerprint Thresholds

Thresholds for "random" in fingerprints the RDKit supports

Updated 27 May, 2019 to use Python 3 and add the count-based Avalon fingerprints
A frequent question that comes up when considering fingerprint similarity is: "What threshold should I use to determine what a neighbor is?" The answer is poorly defined. Of course it depends heavily on the details of the fingerprint, but there's also a very subjective component: you want to pick a low enough threshold that you're sure you won't miss anything, but you don't want to pick up too much noise.
The goal here is to systematically come up with some guidelines that can be used for fingerprints supported within the RDKit. We will do that by looking a similarities between random "drug-like" (MW<600) molecules picked from ChEMBL.
For the analysis, the 25K similarity values are sorted and the values at particular threshold are examined.
There's a fair amount of code and results below, so here's the summary table. To help interpret this: 22500 of the 25000 pairs (90%) have a MACCS keys similarity value less than 0.528.
FingerprintMetric90% level95% level99% level
MACCSTanimoto 0.528 0.573 0.652
MFP0 (counts)Tanimoto 0.525 0.568 0.649
MFP1 (counts)Tanimoto 0.333 0.365 0.428
MFP2 (counts)Tanimoto 0.230 0.255 0.306
MFP3 (counts)Tanimoto 0.178 0.198 0.240
MFP0 (bits)Tanimoto 0.529 0.571 0.654
MFP1 (bits)Tanimoto 0.341 0.372 0.437
MFP2 (bits)Tanimoto 0.246 0.272 0.321
MFP3 (bits)Tanimoto 0.203 0.224 0.264
FeatMFP0 (counts)Tanimoto 0.692 0.742 0.825
FeatMFP1 (counts)Tanimoto 0.472 0.512 0.584
FeatMFP2 (counts)Tanimoto 0.333 0.364 0.425
FeatMFP3 (counts)Tanimoto 0.255 0.278 0.331
FeatMFP0 (bits)Tanimoto 0.692 0.741 0.825
FeatMFP1 (bits)Tanimoto 0.476 0.516 0.587
FeatMFP2 (bits)Tanimoto 0.346 0.376 0.438
FeatMFP3 (bits)Tanimoto 0.275 0.299 0.349
RDKit4Tanimoto 0.283 0.325 0.425
RDKit5Tanimoto 0.255 0.286 0.370
RDKit6Tanimoto 0.282 0.307 0.367
RDKit7Tanimoto 0.390 0.429 0.510
RDKit4 (linear)Tanimoto 0.307 0.354 0.462
RDKit5 (linear)Tanimoto 0.268 0.307 0.406
RDKit6 (linear)Tanimoto 0.248 0.280 0.366
RDKit7 (linear)Tanimoto 0.235 0.263 0.339
Atom pairs (counts)Tanimoto 0.238 0.266 0.326
Torsions (counts)Tanimoto 0.165 0.198 0.264
Atom pairs (bits)Tanimoto 0.336 0.364 0.415
Torsions (bits)Tanimoto 0.190 0.221 0.290
Avalon (512 bits)Tanimoto 0.462 0.504 0.579
Avalon (1024 bits)Tanimoto 0.341 0.377 0.451
Avalon (512 bits)Tanimoto 0.462 0.504 0.579
Avalon (1024 bits)Tanimoto 0.341 0.377 0.451
Avalon counts (512 bits)Tanimoto 0.381 0.417 0.497
Avalon counts (1024 bits)Tanimoto 0.346 0.384 0.469

In [2]:
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Avalon import pyAvalonTools
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from rdkit import DataStructs
from collections import defaultdict
import cPickle,random,gzip
print rdBase.rdkitVersion
2013.09.1pre

Read in the data

We're using the set of 25K reference pairs generated in an earlier post: http://rdkit.blogspot.ch/2013/10/building-similarity-comparison-set-goal.html
As a quick reminder: these are pairs of molecules taken from ChEMBL with MW<600 and a count-based MFP0 similarity of at least 0.7 to each other.
In [6]:
ind = [x.split() for x in gzip.open('../data/chembl16_25K.pairs.txt.gz')]
ms1 = []
ms2 = []
for i,row in enumerate(ind):
    m1 = Chem.MolFromSmiles(row[1])
    ms1.append((row[0],m1))
    m2 = Chem.MolFromSmiles(row[3])
    ms2.append((row[2],m2))
    
Those pairs are related to each other, but we want random pairs, so shuffle the second list:
In [7]:
random.seed(23)
random.shuffle(ms2)
In [8]:
def compareFPs(ms1,ms2,fpfn,fpName):
    fps = [fpfn(x[1]) for x in ms1]
    fp2s = [fpfn(x[1]) for x in ms2]
    sims = [DataStructs.TanimotoSimilarity(x,y) for x,y in zip(fps,fp2s)]
    sl = sorted(sims)
    np = len(sl)
    for bin in (.7,.8,.9,.95,.99):
        print bin,sl[int(bin*np)]
    hist(sims,bins=20)
    xlabel(fpName)
    

MACCS

In [10]:
compareFPs(ms1,ms2,lambda x:rdMolDescriptors.GetMACCSKeysFingerprint(x),"MACCS")
0.7 0.430555555556
0.8 0.470588235294
0.9 0.52808988764
0.95 0.573529411765
0.99 0.655913978495

Morgan FPs

count based

In [11]:
compareFPs(ms1,ms2,lambda x:rdMolDescriptors.GetMorganFingerprint(x,0),"Morgan0")
0.7 0.428571428571
0.8 0.470588235294
0.9 0.526315789474
0.95 0.571428571429
0.99 0.653846153846

Loading the full notebook results seems to break blogger. You can access them in github here or view them in the nbviewer here.

2 comments:

Unknown said...

Hi Greg
This can also be thought of in the context of neighbourhood behaviour. See http://pubs.acs.org/doi/abs/10.1021/ci800302g and references therein for some alternative strategies for arriving at the "optimal" threshold.
Regards
Stephen

greg landrum said...

First: sorry for the slow reply.
I thought I had blogger set up to email me when comments are posted, but obviously I don't.

Thanks for the pointer to the neighborhood behavior paper. I'll take a look at either revisit this post or do a follow up.