Sunday, October 27, 2013

Comparing fingerprints to each other. Part 1

Comparing fingerprints to each other. Part 1

Goal: Look at the differences between different similarity methods.

This uses a set of pairs of molecules that have a baseline similarity: a Tanimoto similarity using count-based Morgan0 fingerprints of at least 0.7. The construction of this set was presented in an earlier post: http://rdkit.blogspot.com/2013/10/building-similarity-comparison-set-goal.html.

Note: this notebook and the data it uses/generates can be found in the github repo: https://github.com/greglandrum/rdkit_blog

Set up

Do the usual imports, read in the molecules, set up the fingerprints we'll compare, and calculate the similarities between the pairs of molecules using those fingerprints.

In [1]:
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Avalon import pyAvalonTools
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from rdkit import DataStructs
from collections import defaultdict
import cPickle,random,gzip
import scipy as sp
import pandas
from scipy import stats
from IPython.core.display import display,HTML,Javascript
print rdBase.rdkitVersion
2013.09.1beta

In [2]:
ind = [x.split() for x in gzip.open('../data/chembl16_25K.pairs.txt.gz')]
ms1 = []
ms2 = []
for i,row in enumerate(ind):
    m1 = Chem.MolFromSmiles(row[1])
    ms1.append((row[0],m1))
    m2 = Chem.MolFromSmiles(row[3])
    ms2.append((row[2],m2))
    
In [2]:
methods = [(lambda x:Chem.RDKFingerprint(x,maxPath=4),'RDKit4'),
           (lambda x:Chem.RDKFingerprint(x,maxPath=5),'RDKit5'),
           (lambda x:Chem.RDKFingerprint(x,maxPath=6),'RDKit6'),
           (lambda x:Chem.RDKFingerprint(x,maxPath=7),'RDKit7'),
           (lambda x:Chem.RDKFingerprint(x,maxPath=4,branchedPaths=False),'RDKit4-linear'),
           (lambda x:Chem.RDKFingerprint(x,maxPath=5,branchedPaths=False),'RDKit5-linear'),
           (lambda x:Chem.RDKFingerprint(x,maxPath=6,branchedPaths=False),'RDKit6-linear'),
           (lambda x:Chem.RDKFingerprint(x,maxPath=7,branchedPaths=False),'RDKit7-linear'),
           
           (lambda x:rdMolDescriptors.GetMorganFingerprint(x,0),'MFP0'),
           (lambda x:rdMolDescriptors.GetMorganFingerprint(x,1),'MFP1'),
           (lambda x:rdMolDescriptors.GetMorganFingerprint(x,2),'MFP2'),
           (lambda x:rdMolDescriptors.GetMorganFingerprint(x,3),'MFP3'),
           (lambda x:rdMolDescriptors.GetMorganFingerprint(x,0,useFeatures=True),'FeatMFP0'),
           (lambda x:rdMolDescriptors.GetMorganFingerprint(x,1,useFeatures=True),'FeatMFP1'),
           (lambda x:rdMolDescriptors.GetMorganFingerprint(x,2,useFeatures=True),'FeatMFP2'),
           (lambda x:rdMolDescriptors.GetMorganFingerprint(x,3,useFeatures=True),'FeatMFP3'),
           (lambda x:rdMolDescriptors.GetHashedMorganFingerprint(x,0),'MFP0-bits'),
           (lambda x:rdMolDescriptors.GetHashedMorganFingerprint(x,1),'MFP1-bits'),
           (lambda x:rdMolDescriptors.GetHashedMorganFingerprint(x,2),'MFP2-bits'),
           (lambda x:rdMolDescriptors.GetHashedMorganFingerprint(x,3),'MFP3-bits'),
           
           (lambda x:rdMolDescriptors.GetAtomPairFingerprint(x),'AP'),
           (lambda x:rdMolDescriptors.GetTopologicalTorsionFingerprint(x),'TT'),
           (lambda x:rdMolDescriptors.GetHashedAtomPairFingerprintAsBitVect(x),'AP-bits'),
           (lambda x:rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect(x),'TT-bits'),
            
           (lambda x:rdMolDescriptors.GetMACCSKeysFingerprint(x),'MACCS'),
           (lambda x:pyAvalonTools.GetAvalonFP(x,512),'Avalon-512'),
           (lambda x:pyAvalonTools.GetAvalonFP(x,1024),'Avalon-1024'),
           
           ]
In [30]:
scoredLists={}
In [35]:
for method,nm in methods:
    if not scoredLists.has_key(nm):
        print 'Doing: ',nm
        rl=[]
        for i,(m1,m2) in enumerate(zip(ms1,ms2)):
            fp1 = method(m1[-1])
            fp2 = method(m2[-1])
            sim = DataStructs.TanimotoSimilarity(fp1,fp2)
            rl.append((sim,i))
        scoredLists[nm]=rl
Doing:  AP-bits
Doing:  TT-bits
Doing: 
In [36]:
cPickle.dump(scoredLists,gzip.open('../data/chembl16_25K.pairs.sims.pkl.gz','wb+'))

Set up the comparison code

In [2]:
scoredLists = cPickle.load(gzip.open('../data/chembl16_25K.pairs.sims.pkl.gz','rb'))
In [3]:
def directCompare(scoredLists,fp1,fp2,plotIt=True,silent=False):
    """ We return: Kendall tau, Spearman rho, and Pearson R
    
    """
    l1 = scoredLists[fp1]
    l2 = scoredLists[fp2]
    rl1=[x[-1] for x in l1]
    rl2=[x[-1] for x in l2]
    vl1=[x[0] for x in l1]
    vl2=[x[0] for x in l2]
    if plotIt:
        _=scatter(vl1,vl2,edgecolors='none')
        maxv=max(max(vl1),max(vl2))
        minv=min(min(vl1),min(vl2))
        _=plot((minv,maxv),(minv,maxv),color='k',linestyle='-')
        xlabel(fp1)
        ylabel(fp2)
    
    tau,tau_p=stats.kendalltau(vl1,vl2)
    spearman_rho,spearman_p=stats.spearmanr(vl1,vl2)
    pearson_r,pearson_p = stats.pearsonr(vl1,vl2)
    if not silent:
        print fp1,fp2,tau,tau_p,spearman_rho,spearman_p,pearson_r,pearson_p
    return tau,spearman_rho,pearson_r

And now compare a few methods to each other

Start with two very closely related fingerprints:

In [4]:
_=directCompare(scoredLists,'MFP0','MFP0-bits')
MFP0 MFP0-bits 0.948023508497 0.0 0.958016189378 0.0 0.968091057994 0.0

What about a two different Morgan fingerprint radii?

In [5]:
_=directCompare(scoredLists,'MFP1','MFP2')
MFP1 MFP2 0.837913477353 0.0 0.961737756836 0.0 0.961445138489 0.0

And a couple RDKit fingerprint sizes

In [6]:
_=directCompare(scoredLists,'RDKit4','RDKit6')
RDKit4 RDKit6 0.671796319619 0.0 0.84783440863 0.0 0.927591672276 0.0

Do all the comparisons so that we can do some statistics on them

In [7]:
ks = sorted(scoredLists.keys())
kappas={}
spearmans={}
pearsons={}
for i,ki in enumerate(ks):
    for j in range(i+1,len(ks)):
        kappa,spearman,pearson=directCompare(scoredLists,ki,ks[j],plotIt=False,silent=True)
        kappas[(ki,ks[j])]=kappa
        spearmans[(ki,ks[j])]=spearman
        pearsons[(ki,ks[j])]=pearson
       
In [8]:
cPickle.dump((ks,kappas,spearmans,pearsons),gzip.open('../data/chembl16_25K.pairs.sim_workup.pkl.gz','wb+'))
In [10]:
(ks,kappas,spearmans,pearsons)=cPickle.load(gzip.open('../data/chembl16_25K.pairs.sim_workup.pkl.gz','rb'))

Load the data into a Pandas dataframe

In [11]:
rows=[]
for k in kappas.keys():
    rows.append([k[0],k[1],kappas[k],spearmans[k],pearsons[k]])

df = pandas.DataFrame(data=rows,columns=['Sim1','Sim2','Tau','Spearman','Pearson'])
df.sort(columns=('Sim1',),inplace=True)
df.head()
Out[11]:
Sim1 Sim2 Tau Spearman Pearson
16 AP MFP3-bits 0.371669 0.524035 0.752549
26 AP MFP3 0.371409 0.523917 0.753598
54 AP TT-bits 0.474149 0.650504 0.782760
81 AP FeatMFP2 0.360262 0.508026 0.695196
86 AP MFP1 0.416702 0.579992 0.746307

Let's get a feeling for what the correlations look like for various tau values.

In [12]:
figure(figsize=(24,4))
subplot(1,5,1)
tau,s,p=directCompare(scoredLists,'MFP1','MFP1-bits',silent=True)
title('tau=%.2f'%tau)
subplot(1,5,2)
tau,s,p=directCompare(scoredLists,'FeatMFP1','FeatMFP3',silent=True)
title('tau=%.2f'%tau)
subplot(1,5,3)
tau,s,p=directCompare(scoredLists,'Avalon-1024','RDKit6',silent=True)
title('tau=%.2f'%tau)
subplot(1,5,4)
tau,s,p=directCompare(scoredLists,'RDKit7','TT',silent=True)
title('tau=%.2f'%tau)
subplot(1,5,5)
tau,s,p=directCompare(scoredLists,'MFP0','RDKit7',silent=True)
_=title('tau=%.2f'%tau)

That last one is somewhat artificial due to the lower bound on MFP0 enforced by the data set.

There's not much correlation left at tau=0.25.

Similar Fingerprints: all the pairs of fingerprints where tau>0.85

In [13]:
HTML(df[df.Tau>0.85].sort(columns=['Tau'],ascending=False).to_html(float_format=lambda x: '%4.3f' % x,
    classes="table display"))
Out[13]:
Sim1 Sim2 Tau Spearman Pearson
80 MFP1 MFP1-bits 0.953 0.992 0.996
141 MFP0 MFP0-bits 0.948 0.958 0.968
240 MFP2 MFP3 0.926 0.992 0.989
48 MFP2 MFP2-bits 0.923 0.989 0.996
205 FeatMFP2 FeatMFP3 0.919 0.990 0.981
106 RDKit5-linear RDKit6-linear 0.917 0.990 0.993
203 RDKit4-linear RDKit5-linear 0.914 0.990 0.990
251 RDKit6-linear RDKit7-linear 0.899 0.985 0.994
165 RDKit4 RDKit4-linear 0.893 0.984 0.988
161 MFP2-bits MFP3-bits 0.892 0.983 0.986
247 MFP3 MFP3-bits 0.887 0.979 0.996
242 MFP2-bits MFP3 0.883 0.980 0.985
22 RDKit4 RDKit5-linear 0.877 0.978 0.985
329 RDKit4 RDKit5 0.871 0.976 0.983
224 MFP2 MFP3-bits 0.867 0.974 0.984
158 TT TT-bits 0.857 0.969 0.981
109 RDKit4-linear RDKit6-linear 0.856 0.971 0.973

Nothing terribly suprising there.

Different Fingerprints: all the pairs of fingerprints where tau<0.3

We also exclude all of the MFP0 variants.

In [14]:
subset=df[df.Tau<0.3]\
   [~df.Sim1.isin(('MFP0','MFP0-bits','FeatMFP0',))]\
   [~df.Sim2.isin(('MFP0','MFP0-bits','FeatMFP0',))]

HTML(subset.sort(columns=['Tau'],ascending=False).to_html(float_format=lambda x: '%4.3f' % x,
    classes="table display"))
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py:2021: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)

Out[14]:
Sim1 Sim2 Tau Spearman Pearson
281 AP RDKit6 0.299 0.426 0.663
333 AP RDKit4 0.299 0.428 0.623
278 AP RDKit5-linear 0.298 0.427 0.634
236 AP-bits Avalon-1024 0.296 0.425 0.620
168 Avalon-1024 FeatMFP2 0.296 0.425 0.653
338 Avalon-1024 TT 0.293 0.420 0.683
94 Avalon-1024 FeatMFP3 0.291 0.418 0.679
190 Avalon-1024 FeatMFP1 0.291 0.420 0.563
170 AP-bits MACCS 0.289 0.417 0.517
289 AP RDKit4-linear 0.284 0.409 0.599
122 AP-bits Avalon-512 0.279 0.402 0.574
56 RDKit7 TT-bits 0.278 0.398 0.668
85 Avalon-1024 MFP1 0.261 0.377 0.615
34 FeatMFP1 MACCS 0.260 0.377 0.462
317 Avalon-1024 MFP1-bits 0.259 0.375 0.613
174 Avalon-1024 MFP2-bits 0.257 0.372 0.656
331 Avalon-1024 MFP2 0.257 0.371 0.658
139 Avalon-1024 MFP3-bits 0.255 0.369 0.662
4 FeatMFP2 MACCS 0.253 0.367 0.492
128 Avalon-1024 MFP3 0.251 0.363 0.663
273 RDKit7 TT 0.250 0.360 0.647
241 FeatMFP3 MACCS 0.248 0.361 0.496
138 AP Avalon-1024 0.248 0.360 0.599
212 Avalon-512 TT-bits 0.245 0.354 0.604
87 AP RDKit7 0.244 0.352 0.589
183 MACCS MFP1 0.243 0.354 0.481
19 MACCS TT-bits 0.243 0.354 0.490
127 FeatMFP2 RDKit7 0.240 0.347 0.594
160 MACCS MFP1-bits 0.239 0.348 0.476
301 FeatMFP3 RDKit7 0.230 0.332 0.626
336 MACCS MFP2 0.229 0.335 0.480
107 MACCS MFP2-bits 0.228 0.333 0.478
223 FeatMFP1 RDKit7 0.228 0.331 0.495
119 MACCS MFP3-bits 0.227 0.332 0.474
132 MACCS MFP3 0.225 0.330 0.474
27 Avalon-512 TT 0.224 0.326 0.588
277 MACCS TT 0.224 0.327 0.470
172 Avalon-512 FeatMFP1 0.212 0.310 0.461
219 Avalon-512 FeatMFP2 0.211 0.309 0.540
218 AP Avalon-512 0.206 0.301 0.522
24 Avalon-512 FeatMFP3 0.203 0.297 0.567
196 AP MACCS 0.200 0.293 0.433
7 MFP2-bits RDKit7 0.191 0.279 0.600
238 MFP3-bits RDKit7 0.190 0.278 0.617
268 Avalon-512 MFP2-bits 0.184 0.270 0.551
51 MFP2 RDKit7 0.183 0.269 0.597
176 Avalon-512 MFP3-bits 0.181 0.267 0.560
104 Avalon-512 MFP2 0.179 0.265 0.550
201 Avalon-512 MFP1 0.178 0.262 0.502
312 Avalon-512 MFP1-bits 0.177 0.262 0.500
272 MFP3 RDKit7 0.173 0.254 0.610
229 MFP1-bits RDKit7 0.173 0.254 0.532
243 MFP1 RDKit7 0.173 0.254 0.533
258 Avalon-512 MFP3 0.171 0.252 0.557

What about methods that work well for similarity-based virtual screening?

Look at the methods that we found to be "best" as measured by AUC for similarity-based virtual screening in our benchmarking paper (http://www.jcheminf.com/content/5/1/26 ). The table itself is here: http://www.jcheminf.com/content/5/1/26/table/T1

I've got best in quotes here because there wasn't a statistically significant difference in performance.

In [15]:
subset=df[df.Sim1.isin(('AP','Avalon-1024','TT','RDKit5'))][df.Sim2.isin(('AP','Avalon-1024','TT','RDKit5'))]
HTML(subset.to_html(float_format=lambda x: '%4.3f' % x,
    classes="table display"))
Out[15]:
Sim1 Sim2 Tau Spearman Pearson
130 AP RDKit5 0.308 0.439 0.656
138 AP Avalon-1024 0.248 0.360 0.599
348 AP TT 0.484 0.662 0.788
126 Avalon-1024 RDKit5 0.468 0.641 0.824
338 Avalon-1024 TT 0.293 0.420 0.683
191 RDKit5 TT 0.397 0.552 0.742

That is the correlation over the entire range of similarities. What about if we just look at the top pairs for each fingerprint?

In [16]:
nToDo=200
apl = sorted(scoredLists['AP'],reverse=True)[:nToDo]
ttl = sorted(scoredLists['TT'],reverse=True)[:nToDo]
avl = sorted(scoredLists['Avalon-1024'],reverse=True)[:nToDo]
rdkl = sorted(scoredLists['RDKit5'],reverse=True)[:nToDo]
In [17]:
idsToKeep=set()
idsToKeep.update([x[1] for x in apl])
idsToKeep.update([x[1] for x in ttl])
idsToKeep.update([x[1] for x in avl])
idsToKeep.update([x[1] for x in rdkl])
print len(idsToKeep)
384

In [18]:
limitedLists={}
for fp in ('AP','TT','Avalon-1024','RDKit5'):
    limitedLists[fp]=[scoredLists[fp][x] for x in idsToKeep]
In [19]:
figure(figsize=(30,4))
subplot(1,6,1)
tau,s,p=directCompare(limitedLists,'AP','TT',silent=True)
title('tau=%.2f'%tau)
subplot(1,6,2)
tau,s,p=directCompare(limitedLists,'AP','Avalon-1024',silent=True)
title('tau=%.2f'%tau)
subplot(1,6,3)
tau,s,p=directCompare(limitedLists,'AP','RDKit5',silent=True)
title('tau=%.2f'%tau)
subplot(1,6,4)
tau,s,p=directCompare(limitedLists,'TT','Avalon-1024',silent=True)
title('tau=%.2f'%tau)
subplot(1,6,5)
tau,s,p=directCompare(limitedLists,'TT','RDKit5',silent=True)
title('tau=%.2f'%tau)
subplot(1,6,6)
tau,s,p=directCompare(limitedLists,'Avalon-1024','RDKit5',silent=True)
_=title('tau=%.2f'%tau)

The Tau values are still pretty low. The rankings from these fingerprints tend to have a low correlation with each other.

The comparison in the benchmarking paper showed, on the other hand, that across a broad range of data sets the fingerprints perform at about the same level when it comes to enrichment. It seems like there's either a contradiction or this set of pairs isn't particularly representative of what we used for that paper.

Even more concrete: look at the number of overlapping pairs in that pick

Look at the overlap between the top picks of those fingerprints.

In [20]:
nToDo=200
apl = sorted(scoredLists['AP'],reverse=True)[:nToDo]
ttl = sorted(scoredLists['TT'],reverse=True)[:nToDo]
avl = sorted(scoredLists['Avalon-1024'],reverse=True)[:nToDo]
rdkl = sorted(scoredLists['RDKit5'],reverse=True)[:nToDo]
idsToKeep=set()
idsToKeep.update([x[1] for x in apl])
idsToKeep.update([x[1] for x in ttl])
idsToKeep.update([x[1] for x in avl])
idsToKeep.update([x[1] for x in rdkl])
print 'Overall number:',len(idsToKeep)
ids={}
ids['AP']=set([x[1] for x in apl])
ids['TT']=set([x[1] for x in ttl])
ids['Avalon-1024']=set([x[1] for x in avl])
ids['RDKit5']=set([x[1] for x in rdkl])

ks = sorted(ids.keys())
for i,k in enumerate(ks):
    for j in range(i+1,len(ks)):
        overlap=len(ids[k].intersection(ids[ks[j]]))
        print ks[i],ks[j],overlap,'%.2f'%(float(overlap)/len(apl))
Overall number: 384
AP Avalon-1024 102 0.51
AP RDKit5 112 0.56
AP TT 137 0.69
Avalon-1024 RDKit5 125 0.62
Avalon-1024 TT 111 0.56
RDKit5 TT 117 0.58

So each of those sets of picks has a good fraction (>40%) of different compounds. Nice!

Repeat that for fewer picks:

In [21]:
nToDo=100
apl = sorted(scoredLists['AP'],reverse=True)[:nToDo]
ttl = sorted(scoredLists['TT'],reverse=True)[:nToDo]
avl = sorted(scoredLists['Avalon-1024'],reverse=True)[:nToDo]
rdkl = sorted(scoredLists['RDKit5'],reverse=True)[:nToDo]
idsToKeep=set()
idsToKeep.update([x[1] for x in apl])
idsToKeep.update([x[1] for x in ttl])
idsToKeep.update([x[1] for x in avl])
idsToKeep.update([x[1] for x in rdkl])
print 'Overall number:',len(idsToKeep)
ids={}
ids['AP']=set([x[1] for x in apl])
ids['TT']=set([x[1] for x in ttl])
ids['Avalon-1024']=set([x[1] for x in avl])
ids['RDKit5']=set([x[1] for x in rdkl])

ks = sorted(ids.keys())
for i,k in enumerate(ks):
    for j in range(i+1,len(ks)):
        overlap=len(ids[k].intersection(ids[ks[j]]))
        print ks[i],ks[j],overlap,'%.2f'%(float(overlap)/len(apl))
Overall number: 175
AP Avalon-1024 58 0.58
AP RDKit5 69 0.69
AP TT 86 0.86
Avalon-1024 RDKit5 56 0.56
Avalon-1024 TT 60 0.60
RDKit5 TT 70 0.70

Still a significant number of unique compounds when considering pairwise overlaps. AP--TT is, of course, something of an exception.

Idle Curiosity: Difference between the correlation coefficients

In [22]:
xk='Tau'
yk='Spearman'
tplt=df.plot(x=xk,y=yk,style='o')
minV=min(min(df[xk]),min(df[yk]))
maxV=max(max(df[xk]),max(df[yk]))
tplt.plot((minV,maxV),(minV,maxV))
xlabel(xk)
_=ylabel(yk)
In [23]:
xk='Tau'
yk='Pearson'
tplt=df.plot(x=xk,y=yk,style='o')
minV=min(min(df[xk]),min(df[yk]))
maxV=max(max(df[xk]),max(df[yk]))
tplt.plot((minV,maxV),(minV,maxV))
xlabel(xk)
_=ylabel(yk)
In [24]:
xk='Spearman'
yk='Pearson'
tplt=df.plot(x=xk,y=yk,style='o')
minV=min(min(df[xk]),min(df[yk]))
maxV=max(max(df[xk]),max(df[yk]))
tplt.plot((minV,maxV),(minV,maxV))
xlabel(xk)
_=ylabel(yk)

Here's the full set of results... there are a lot

In [25]:
from IPython.core.display import display,HTML,Javascript
HTML(df.sort(columns=['Tau'],ascending=False).to_html(float_format=lambda x: '%4.3f' % x,
    classes="table display"))
Out[25]:
Sim1 Sim2 Tau Spearman Pearson
80 MFP1 MFP1-bits 0.953 0.992 0.996
141 MFP0 MFP0-bits 0.948 0.958 0.968
240 MFP2 MFP3 0.926 0.992 0.989
48 MFP2 MFP2-bits 0.923 0.989 0.996
205 FeatMFP2 FeatMFP3 0.919 0.990 0.981
106 RDKit5-linear RDKit6-linear 0.917 0.990 0.993
203 RDKit4-linear RDKit5-linear 0.914 0.990 0.990
251 RDKit6-linear RDKit7-linear 0.899 0.985 0.994
165 RDKit4 RDKit4-linear 0.893 0.984 0.988
161 MFP2-bits MFP3-bits 0.892 0.983 0.986
247 MFP3 MFP3-bits 0.887 0.979 0.996
242 MFP2-bits MFP3 0.883 0.980 0.985
22 RDKit4 RDKit5-linear 0.877 0.978 0.985
329 RDKit4 RDKit5 0.871 0.976 0.983
224 MFP2 MFP3-bits 0.867 0.974 0.984
158 TT TT-bits 0.857 0.969 0.981
109 RDKit4-linear RDKit6-linear 0.856 0.971 0.973
152 RDKit4 RDKit6-linear 0.847 0.967 0.973
70 RDKit5 RDKit6-linear 0.845 0.966 0.980
112 RDKit5-linear RDKit7-linear 0.843 0.964 0.979
162 MFP1 MFP2 0.838 0.962 0.961
256 RDKit5 RDKit7-linear 0.837 0.962 0.976
287 RDKit5 RDKit5-linear 0.836 0.962 0.977
239 MFP1-bits MFP2 0.823 0.955 0.958
129 MFP1-bits MFP2-bits 0.819 0.953 0.958
12 MFP1 MFP2-bits 0.813 0.950 0.957
315 FeatMFP1 FeatMFP2 0.805 0.947 0.937
271 RDKit4-linear RDKit5 0.804 0.947 0.960
105 Avalon-1024 Avalon-512 0.800 0.943 0.963
60 RDKit4 RDKit7-linear 0.800 0.943 0.958
309 MFP1 MFP3 0.793 0.939 0.923
110 RDKit4-linear RDKit7-linear 0.792 0.940 0.952
45 MFP1-bits MFP3 0.780 0.931 0.920
156 RDKit5 RDKit6 0.768 0.919 0.974
57 MFP1-bits MFP3-bits 0.765 0.923 0.918
46 MFP1 MFP3-bits 0.762 0.921 0.919
116 FeatMFP1 FeatMFP3 0.755 0.916 0.873
101 RDKit6 RDKit7-linear 0.745 0.906 0.959
100 RDKit6 RDKit6-linear 0.692 0.865 0.946
93 AP AP-bits 0.683 0.862 0.928
209 RDKit6 RDKit7 0.683 0.861 0.931
193 RDKit4 RDKit6 0.672 0.848 0.928
148 RDKit5-linear RDKit6 0.655 0.834 0.927
77 RDKit4-linear RDKit6 0.624 0.807 0.894
325 FeatMFP3 MFP3 0.620 0.804 0.902
91 MFP1 TT 0.618 0.802 0.871
33 MFP1-bits TT 0.615 0.799 0.868
88 FeatMFP3 MFP2 0.608 0.791 0.897
252 MFP2 TT 0.605 0.789 0.893
300 FeatMFP3 MFP3-bits 0.604 0.788 0.897
89 MFP2-bits TT 0.600 0.784 0.890
184 FeatMFP3 MFP2-bits 0.598 0.782 0.893
291 FeatMFP2 MFP2 0.583 0.768 0.852
142 FeatMFP2 MFP3 0.582 0.768 0.836
344 MFP1 TT-bits 0.582 0.767 0.851
2 MFP1-bits TT-bits 0.579 0.764 0.849
283 FeatMFP3 MFP1 0.574 0.759 0.850
222 MFP3 TT 0.574 0.758 0.876
28 FeatMFP2 MFP2-bits 0.574 0.759 0.848
95 FeatMFP2 MFP1 0.570 0.754 0.838
335 FeatMFP3 MFP1-bits 0.569 0.754 0.847
154 FeatMFP2 MFP3-bits 0.569 0.754 0.832
284 MFP2 TT-bits 0.567 0.751 0.872
262 MFP3-bits TT 0.567 0.750 0.873
13 FeatMFP2 MFP1-bits 0.565 0.750 0.835
49 MFP2-bits TT-bits 0.563 0.746 0.870
265 FeatMFP0 FeatMFP1 0.553 0.737 0.770
97 Avalon-512 RDKit7 0.541 0.723 0.821
121 MFP3 TT-bits 0.536 0.718 0.855
155 MFP3-bits TT-bits 0.531 0.712 0.853
143 Avalon-1024 RDKit6 0.516 0.694 0.850
245 Avalon-512 RDKit6 0.486 0.663 0.795
348 AP TT 0.484 0.662 0.788
10 FeatMFP1 MFP1 0.482 0.658 0.718
83 FeatMFP2 RDKit5 0.479 0.655 0.774
71 FeatMFP1 MFP1-bits 0.478 0.653 0.715
276 FeatMFP2 RDKit4 0.477 0.654 0.760
303 RDKit7 RDKit7-linear 0.475 0.648 0.831
54 AP TT-bits 0.474 0.651 0.783
11 FeatMFP3 RDKit5 0.473 0.646 0.788
257 Avalon-1024 MACCS 0.473 0.650 0.697
231 FeatMFP2 RDKit6-linear 0.472 0.648 0.766
250 FeatMFP3 RDKit4 0.471 0.646 0.763
304 RDKit5 RDKit7 0.469 0.638 0.835
126 Avalon-1024 RDKit5 0.468 0.641 0.824
195 FeatMFP3 RDKit6-linear 0.468 0.642 0.782
69 FeatMFP2 RDKit5-linear 0.467 0.642 0.756
290 FeatMFP1 RDKit4 0.467 0.643 0.699
305 Avalon-1024 RDKit7 0.466 0.637 0.798
318 FeatMFP3 RDKit5-linear 0.463 0.637 0.766
228 FeatMFP2 RDKit7-linear 0.462 0.635 0.764
173 FeatMFP1 RDKit5 0.461 0.636 0.689
263 FeatMFP1 MFP2 0.459 0.633 0.685
134 FeatMFP2 TT 0.458 0.629 0.766
337 FeatMFP1 RDKit6-linear 0.458 0.632 0.685
39 FeatMFP3 RDKit7-linear 0.457 0.628 0.787
153 FeatMFP1 RDKit5-linear 0.457 0.631 0.687
194 Avalon-1024 RDKit7-linear 0.455 0.625 0.808
320 FeatMFP3 TT 0.454 0.624 0.799
68 FeatMFP1 MFP3 0.454 0.628 0.659
302 FeatMFP2 RDKit4-linear 0.453 0.627 0.735
159 Avalon-512 MACCS 0.453 0.627 0.684
213 FeatMFP1 MFP2-bits 0.452 0.625 0.681
197 FeatMFP3 RDKit4-linear 0.449 0.621 0.738
140 FeatMFP1 RDKit4-linear 0.449 0.623 0.684
55 FeatMFP1 MFP3-bits 0.445 0.617 0.656
332 FeatMFP1 RDKit7-linear 0.444 0.616 0.673
14 FeatMFP0 FeatMFP2 0.443 0.614 0.622
114 MACCS RDKit6 0.439 0.609 0.659
295 FeatMFP2 TT-bits 0.438 0.605 0.753
113 Avalon-1024 RDKit6-linear 0.433 0.599 0.794
310 Avalon-1024 RDKit4 0.432 0.599 0.781
266 FeatMFP3 TT-bits 0.432 0.597 0.785
297 AP-bits TT-bits 0.425 0.590 0.741
25 MACCS RDKit5 0.424 0.591 0.658
74 Avalon-1024 RDKit5-linear 0.420 0.584 0.776
166 MACCS RDKit4 0.417 0.583 0.659
210 FeatMFP0 FeatMFP3 0.417 0.583 0.554
86 AP MFP1 0.417 0.580 0.746
151 AP MFP1-bits 0.416 0.579 0.745
167 AP-bits TT 0.413 0.575 0.729
66 RDKit5 TT-bits 0.412 0.571 0.752
189 RDKit6-linear RDKit7 0.412 0.570 0.795
349 Avalon-1024 RDKit4-linear 0.410 0.573 0.749
330 FeatMFP2 RDKit6 0.410 0.569 0.742
319 MACCS RDKit4-linear 0.410 0.575 0.649
260 MACCS RDKit7-linear 0.408 0.571 0.634
78 MACCS RDKit5-linear 0.402 0.565 0.641
84 MACCS RDKit6-linear 0.401 0.563 0.636
99 FeatMFP3 RDKit6 0.401 0.557 0.767
169 AP MFP2 0.398 0.557 0.761
164 AP MFP2-bits 0.397 0.556 0.760
191 RDKit5 TT 0.397 0.552 0.742
199 RDKit7-linear TT-bits 0.396 0.551 0.737
296 RDKit6-linear TT-bits 0.395 0.550 0.727
249 RDKit4 TT-bits 0.391 0.546 0.705
29 FeatMFP1 RDKit6 0.390 0.547 0.642
53 RDKit4 RDKit7 0.389 0.540 0.759
311 MACCS RDKit7 0.387 0.545 0.620
200 AP-bits RDKit6 0.386 0.539 0.696
202 RDKit5-linear TT-bits 0.385 0.538 0.705
230 RDKit6 TT-bits 0.384 0.535 0.761
102 RDKit6-linear TT 0.382 0.534 0.717
198 RDKit7-linear TT 0.380 0.532 0.727
282 Avalon-512 RDKit5 0.380 0.533 0.734
342 MFP1 RDKit4 0.378 0.533 0.674
40 RDKit4 TT 0.378 0.530 0.695
5 RDKit5-linear RDKit7 0.378 0.527 0.760
188 FeatMFP1 TT 0.378 0.531 0.614
187 MFP1 RDKit5 0.377 0.530 0.695
227 Avalon-512 RDKit7-linear 0.376 0.528 0.720
135 MFP1 RDKit6-linear 0.375 0.528 0.685
217 MFP1 RDKit5-linear 0.373 0.526 0.673
328 MFP1-bits RDKit4 0.372 0.525 0.668
178 MFP1-bits RDKit5 0.372 0.524 0.691
16 AP MFP3-bits 0.372 0.524 0.753
146 RDKit5-linear TT 0.372 0.522 0.695
26 AP MFP3 0.371 0.524 0.754
346 MFP1-bits RDKit6-linear 0.369 0.521 0.681
347 MFP2 RDKit5 0.368 0.519 0.730
30 MFP1-bits RDKit5-linear 0.367 0.518 0.668
275 FeatMFP1 TT-bits 0.366 0.516 0.607
288 MFP2 RDKit6-linear 0.366 0.516 0.721
192 MFP2 RDKit4 0.366 0.517 0.694
323 MFP3 RDKit5 0.366 0.516 0.732
123 MFP3 RDKit4 0.365 0.516 0.691
65 MFP2-bits RDKit5 0.365 0.515 0.727
32 MFP3 RDKit6-linear 0.365 0.515 0.725
292 MFP3-bits RDKit5 0.364 0.513 0.729
259 RDKit4-linear TT-bits 0.363 0.511 0.667
186 AP-bits RDKit7-linear 0.363 0.511 0.679
108 MFP1 RDKit7-linear 0.362 0.511 0.686
76 MFP3 RDKit5-linear 0.362 0.512 0.702
341 MFP2 RDKit5-linear 0.362 0.511 0.701
246 MFP2-bits RDKit6-linear 0.362 0.510 0.718
235 RDKit6 TT 0.361 0.507 0.748
1 MFP3-bits RDKit6-linear 0.361 0.510 0.722
3 AP-bits RDKit7 0.361 0.508 0.663
261 MFP2-bits RDKit4 0.361 0.510 0.690
81 AP FeatMFP2 0.360 0.508 0.695
52 MFP1 RDKit4-linear 0.360 0.509 0.646
96 MFP3-bits RDKit4 0.359 0.508 0.687
322 AP-bits MFP1-bits 0.358 0.504 0.689
294 AP-bits RDKit5 0.358 0.504 0.670
313 MFP1-bits RDKit7-linear 0.357 0.506 0.682
299 AP-bits MFP1 0.357 0.503 0.690
298 MFP2-bits RDKit5-linear 0.357 0.504 0.697
215 RDKit4-linear RDKit7 0.357 0.501 0.718
204 MFP2 RDKit7-linear 0.356 0.503 0.729
41 MFP3-bits RDKit5-linear 0.356 0.504 0.698
6 MFP3 RDKit7-linear 0.354 0.500 0.737
136 MFP2-bits RDKit7-linear 0.354 0.500 0.727
255 MFP1-bits RDKit4-linear 0.354 0.501 0.640
314 MFP3-bits RDKit7-linear 0.353 0.499 0.735
73 RDKit4-linear TT 0.350 0.495 0.656
293 MFP3 RDKit4-linear 0.349 0.495 0.663
270 MFP2 RDKit4-linear 0.348 0.493 0.665
58 AP-bits MFP2-bits 0.346 0.490 0.708
274 AP FeatMFP3 0.346 0.489 0.720
42 AP-bits FeatMFP2 0.345 0.487 0.671
103 AP-bits RDKit6-linear 0.343 0.485 0.659
149 AP-bits MFP2 0.342 0.485 0.707
157 MFP2-bits RDKit4-linear 0.342 0.486 0.660
182 MFP3-bits RDKit4-linear 0.342 0.486 0.659
345 Avalon-512 RDKit6-linear 0.339 0.481 0.696
144 AP-bits RDKit4 0.333 0.473 0.635
115 Avalon-512 RDKit4 0.333 0.473 0.679
18 AP-bits MFP3-bits 0.330 0.469 0.705
75 AP-bits RDKit5-linear 0.329 0.468 0.639
234 AP-bits FeatMFP3 0.328 0.465 0.688
324 AP FeatMFP1 0.327 0.465 0.588
181 AP-bits FeatMFP1 0.326 0.463 0.584
279 Avalon-512 RDKit5-linear 0.322 0.458 0.673
8 AP-bits MFP3 0.320 0.455 0.701
214 AP-bits RDKit4-linear 0.316 0.450 0.610
321 AP RDKit7-linear 0.315 0.448 0.672
206 MFP2-bits RDKit6 0.315 0.448 0.720
43 MFP3-bits RDKit6 0.314 0.446 0.729
248 MFP2 RDKit6 0.312 0.445 0.721
343 Avalon-512 RDKit4-linear 0.312 0.446 0.644
47 MFP1 RDKit6 0.312 0.445 0.671
37 MFP1-bits RDKit6 0.309 0.441 0.667
118 Avalon-1024 TT-bits 0.308 0.439 0.693
180 AP RDKit6-linear 0.308 0.439 0.656
130 AP RDKit5 0.308 0.439 0.656
79 MFP3 RDKit6 0.306 0.437 0.729
281 AP RDKit6 0.299 0.426 0.663
333 AP RDKit4 0.299 0.428 0.623
278 AP RDKit5-linear 0.298 0.427 0.634
236 AP-bits Avalon-1024 0.296 0.425 0.620
168 Avalon-1024 FeatMFP2 0.296 0.425 0.653
338 Avalon-1024 TT 0.293 0.420 0.683
94 Avalon-1024 FeatMFP3 0.291 0.418 0.679
190 Avalon-1024 FeatMFP1 0.291 0.420 0.563
170 AP-bits MACCS 0.289 0.417 0.517
289 AP RDKit4-linear 0.284 0.409 0.599
122 AP-bits Avalon-512 0.279 0.402 0.574
56 RDKit7 TT-bits 0.278 0.398 0.668
171 FeatMFP0 RDKit4 0.269 0.386 0.422
111 FeatMFP0 RDKit4-linear 0.266 0.383 0.423
82 FeatMFP0 RDKit5-linear 0.265 0.382 0.413
163 FeatMFP0 RDKit6-linear 0.264 0.380 0.405
316 FeatMFP0 RDKit5 0.263 0.379 0.405
85 Avalon-1024 MFP1 0.261 0.377 0.615
34 FeatMFP1 MACCS 0.260 0.377 0.462
120 MFP0-bits MFP1-bits 0.260 0.372 0.600
317 Avalon-1024 MFP1-bits 0.259 0.375 0.613
174 Avalon-1024 MFP2-bits 0.257 0.372 0.656
179 FeatMFP0 RDKit7-linear 0.257 0.370 0.395
331 Avalon-1024 MFP2 0.257 0.371 0.658
286 MFP0 MFP1 0.255 0.365 0.605
254 MFP0 MFP1-bits 0.255 0.365 0.603
139 Avalon-1024 MFP3-bits 0.255 0.369 0.662
150 MFP0-bits MFP1 0.253 0.362 0.593
4 FeatMFP2 MACCS 0.253 0.367 0.492
128 Avalon-1024 MFP3 0.251 0.363 0.663
273 RDKit7 TT 0.250 0.360 0.647
241 FeatMFP3 MACCS 0.248 0.361 0.496
138 AP Avalon-1024 0.248 0.360 0.599
212 Avalon-512 TT-bits 0.245 0.354 0.604
87 AP RDKit7 0.244 0.352 0.589
183 MACCS MFP1 0.243 0.354 0.481
19 MACCS TT-bits 0.243 0.354 0.490
127 FeatMFP2 RDKit7 0.240 0.347 0.594
160 MACCS MFP1-bits 0.239 0.348 0.476
301 FeatMFP3 RDKit7 0.230 0.332 0.626
117 FeatMFP0 RDKit6 0.230 0.334 0.373
336 MACCS MFP2 0.229 0.335 0.480
107 MACCS MFP2-bits 0.228 0.333 0.478
223 FeatMFP1 RDKit7 0.228 0.331 0.495
119 MACCS MFP3-bits 0.227 0.332 0.474
132 MACCS MFP3 0.225 0.330 0.474
27 Avalon-512 TT 0.224 0.326 0.588
63 MFP0-bits MFP2-bits 0.224 0.322 0.565
277 MACCS TT 0.224 0.327 0.470
9 MFP0-bits MFP2 0.220 0.316 0.563
35 MFP0 MFP2 0.219 0.315 0.570
31 MFP0 MFP2-bits 0.218 0.313 0.568
216 MFP0-bits MFP3-bits 0.215 0.310 0.559
207 MFP0-bits MFP3 0.213 0.307 0.558
172 Avalon-512 FeatMFP1 0.212 0.310 0.461
226 MFP0 MFP3 0.211 0.304 0.565
36 FeatMFP0 MFP1 0.211 0.308 0.369
219 Avalon-512 FeatMFP2 0.211 0.309 0.540
92 FeatMFP0 MFP1-bits 0.210 0.305 0.366
244 MFP0 MFP3-bits 0.209 0.300 0.562
218 AP Avalon-512 0.206 0.301 0.522
24 Avalon-512 FeatMFP3 0.203 0.297 0.567
50 Avalon-1024 FeatMFP0 0.200 0.294 0.341
196 AP MACCS 0.200 0.293 0.433
221 AP MFP0 0.198 0.285 0.545
326 FeatMFP0 MFP3 0.198 0.290 0.328
125 FeatMFP0 MFP2 0.197 0.288 0.340
20 FeatMFP1 MFP0-bits 0.197 0.283 0.442
334 AP-bits FeatMFP0 0.196 0.287 0.346
327 AP MFP0-bits 0.196 0.282 0.533
177 FeatMFP0 MFP0 0.195 0.259 0.319
253 FeatMFP0 MFP3-bits 0.195 0.286 0.327
208 FeatMFP1 MFP0 0.195 0.280 0.447
38 FeatMFP0 MFP2-bits 0.194 0.284 0.338
225 FeatMFP0 MFP0-bits 0.193 0.256 0.313
7 MFP2-bits RDKit7 0.191 0.279 0.600
238 MFP3-bits RDKit7 0.190 0.278 0.617
0 FeatMFP2 MFP0-bits 0.188 0.271 0.487
268 Avalon-512 MFP2-bits 0.184 0.270 0.551
59 FeatMFP3 MFP0-bits 0.183 0.264 0.508
237 FeatMFP2 MFP0 0.183 0.264 0.490
51 MFP2 RDKit7 0.183 0.269 0.597
176 Avalon-512 MFP3-bits 0.181 0.267 0.560
67 FeatMFP0 MACCS 0.180 0.263 0.308
104 Avalon-512 MFP2 0.179 0.265 0.550
201 Avalon-512 MFP1 0.178 0.262 0.502
131 FeatMFP3 MFP0 0.178 0.256 0.512
312 Avalon-512 MFP1-bits 0.177 0.262 0.500
272 MFP3 RDKit7 0.173 0.254 0.610
229 MFP1-bits RDKit7 0.173 0.254 0.532
243 MFP1 RDKit7 0.173 0.254 0.533
258 Avalon-512 MFP3 0.171 0.252 0.557
350 AP-bits MFP0-bits 0.170 0.246 0.492
98 AP-bits MFP0 0.169 0.244 0.501
233 MFP0-bits TT 0.162 0.235 0.467
124 AP FeatMFP0 0.161 0.236 0.308
64 FeatMFP0 TT 0.159 0.233 0.295
264 MFP0 TT 0.159 0.231 0.471
308 Avalon-512 FeatMFP0 0.158 0.233 0.289
147 MFP0-bits TT-bits 0.157 0.228 0.457
90 FeatMFP0 TT-bits 0.156 0.228 0.293
21 MFP0 TT-bits 0.154 0.223 0.461
211 MFP0-bits RDKit4 0.149 0.217 0.411
232 MFP0-bits RDKit4-linear 0.148 0.214 0.397
269 MFP0-bits RDKit5-linear 0.148 0.214 0.415
185 MFP0 RDKit4 0.147 0.213 0.413
44 MACCS MFP0 0.147 0.213 0.356
220 FeatMFP0 RDKit7 0.146 0.215 0.288
306 MFP0-bits RDKit6-linear 0.146 0.212 0.426
15 MACCS MFP0-bits 0.145 0.211 0.349
23 MFP0 RDKit4-linear 0.145 0.211 0.400
137 MFP0 RDKit5-linear 0.144 0.210 0.417
17 MFP0-bits RDKit5 0.144 0.209 0.425
175 MFP0 RDKit6-linear 0.142 0.206 0.428
339 MFP0-bits RDKit7-linear 0.140 0.204 0.432
340 MFP0 RDKit5 0.140 0.203 0.427
280 Avalon-1024 MFP0 0.139 0.202 0.414
145 Avalon-1024 MFP0-bits 0.138 0.201 0.409
307 MFP0 RDKit7-linear 0.136 0.198 0.434
267 MFP0-bits RDKit6 0.121 0.176 0.414
133 MFP0 RDKit6 0.116 0.169 0.416
61 Avalon-512 MFP0 0.108 0.158 0.352
62 Avalon-512 MFP0-bits 0.107 0.157 0.347
72 MFP0-bits RDKit7 0.077 0.112 0.340
285 MFP0 RDKit7 0.072 0.105 0.341
In []: