Revisiting bit collisions in Morgan fingerprints with a larger dataset¶
I looked at the number of collisions in Morgan fingerprints in an earlier post. The topic came up again in discussions about the recent post on Morgan fingerprint stats, which used a much larger data set.
Here we repeat earlier collision analysis, usting the larger dataset. I will look at fingerprints with different radii -- 1, 2, and 3 -- folded to a set of different sizes -- 1K, 2K, 4K, 8K.
TL;DR version: The conclusions match what we observed before, there are a fair number of collisions at fingerprint sizes below 4K. As expected, higher radii have more collisions.
from rdkit import Chem,DataStructs
import time,random,gzip,pickle,copy
import numpy as np
from collections import defaultdict
from rdkit.Chem import Draw
from rdkit.Chem import rdMolDescriptors as rdmd
from rdkit.Chem.Draw import IPythonConsole
from rdkit import DataStructs
from rdkit import rdBase
%pylab inline
print(rdBase.rdkitVersion)
import time
print(time.asctime())
For test data I'll use the same 16 million ZINC compounds I used in the bit statistics post.
filen='/scratch/RDKit_git/Data/Zinc/zinc_all_clean.pkl.gz'
Loop over the molecules and build fingerprints of multiple radii and folded lengths.
import copy
history={} # we will use this to see how quickly the results converge
counts=defaultdict(lambda:defaultdict(int))
t1 = time.time()
with gzip.open(filen,'rb') as inf:
i = 0
while 1:
try:
m,nm = pickle.load(inf)
except EOFError:
break
if not m: continue
i+=1
for v in 1,2,3:
onbits=len(rdmd.GetMorganFingerprint(m,v).GetNonzeroElements())
counts[(v,-1)][onbits]+=1
for l in 1024,2048,4096,8192:
dbits = onbits-rdmd.GetMorganFingerprintAsBitVect(m,v,l).GetNumOnBits()
counts[(v,l)][dbits]+=1
if not i%20000:
t2 = time.time()
print("Done %d in %.2f sec"%(i,t2-t1))
if not i%100000:
history[i] = copy.deepcopy(counts)
pickle.dump(dict(counts),gzip.open('../data/fp_collision_counts.pkl.gz','wb+'))
for k,d in history.items():
history[k] = dict(d)
pickle.dump(dict(history),gzip.open('../data/fp_collision_counts.history.pkl.gz','wb+'))
Now plot histograms of the numbers of collisions along with the distributions of the number of bits set in the non-folded FPs
figure(figsize=(16,20))
pidx=1
#----------------------------
for nbits in (8192,4096,2048,1024):
subplot(4,2,pidx)
pidx+=1
maxCollisions = max(counts[3,nbits].keys())+1
d1=np.zeros(maxCollisions,np.int)
for k,v in counts[1,nbits].items():
d1[k]=v
d2=np.zeros(maxCollisions,np.int)
for k,v in counts[2,nbits].items():
d2[k]=v
d3=np.zeros(maxCollisions,np.int)
for k,v in counts[3,nbits].items():
d3[k]=v
barWidth=.25
locs = np.array(range(maxCollisions))
bar(locs,d1,barWidth,color='b',label="r=1")
bar(locs+barWidth,d2,barWidth,color='g',label="r=2")
bar(locs+2*barWidth,d3,barWidth,color='r',label="r=3")
#_=hist((d1,d2,d3),bins=20,log=True,label=("r=1","r=2","r=3"))
title('%d bits'%nbits)
_=yscale("log")
_=legend()
subplot(4,2,pidx)
pidx+=1
plot([x for x,y in counts[1,-1].items()],[y for x,y in counts[1,-1].items()],'.b-',label=
"r=1")
plot([x for x,y in counts[2,-1].items()],[y for x,y in counts[2,-1].items()],'.g-',label=
"r=2")
plot([x for x,y in counts[3,-1].items()],[y for x,y in counts[3,-1].items()],'.r-',label=
"r=3")
_=xlabel("num bits set")
_=ylabel("count")
_=legend()
So, there are definitely some collisions. But not a huge number.
It's worth looking at the histograms without the log scale, just to maintain some perspective:
figure(figsize=(16,20))
pidx=1
#----------------------------
for nbits in (8192,4096,2048,1024):
subplot(4,2,pidx)
pidx+=1
maxCollisions = max(counts[3,nbits].keys())+1
d1=np.zeros(maxCollisions,np.int)
for k,v in counts[1,nbits].items():
d1[k]=v
d2=np.zeros(maxCollisions,np.int)
for k,v in counts[2,nbits].items():
d2[k]=v
d3=np.zeros(maxCollisions,np.int)
for k,v in counts[3,nbits].items():
d3[k]=v
barWidth=.25
locs = np.array(range(maxCollisions))
bar(locs,d1,barWidth,color='b',label="r=1")
bar(locs+barWidth,d2,barWidth,color='g',label="r=2")
bar(locs+2*barWidth,d3,barWidth,color='r',label="r=3")
#_=hist((d1,d2,d3),bins=20,log=True,label=("r=1","r=2","r=3"))
title('%d bits'%nbits)
_=legend()
subplot(4,2,pidx)
pidx+=1
plot([x for x,y in counts[1,-1].items()],[y for x,y in counts[1,-1].items()],'.b-',label=
"r=1")
plot([x for x,y in counts[2,-1].items()],[y for x,y in counts[2,-1].items()],'.g-',label=
"r=2")
plot([x for x,y in counts[3,-1].items()],[y for x,y in counts[3,-1].items()],'.r-',label=
"r=3")
_=xlabel("num bits set")
_=ylabel("count")
_=legend()
It's tempting to look at the effect on similarity, but looking at random pairs basically implies looking at differences in the noise. I'm not sure that really helps, so I won't do it here.
1 comment:
There is a related paper on how the bit-based nature of fingerprints affects clustering here:
Ties in proximity and clustering compounds.
This is caused by similarity measures having a finite number of bits that can be used in union and intersection operators which means that is a finite number of tanimoto values, say, that can actually be calculated. Collisions affect this behavior adversely, although ironically, can make identify similar compounds with fingerprints more effective due to locality sensitive hashing (i.e. similar molecules SHOULD have collisions more often). An rdkit based example is here http://chembl.blogspot.com/2015/08/lsh-based-similarity-search-in-mongodb.html
Post a Comment