Sunday, February 7, 2016

Colliding Bits III

Revisiting bit collisions in Morgan fingerprints with a larger dataset

I looked at the number of collisions in Morgan fingerprints in an earlier post. The topic came up again in discussions about the recent post on Morgan fingerprint stats, which used a much larger data set.

Here we repeat earlier collision analysis, usting the larger dataset. I will look at fingerprints with different radii -- 1, 2, and 3 -- folded to a set of different sizes -- 1K, 2K, 4K, 8K.

TL;DR version: The conclusions match what we observed before, there are a fair number of collisions at fingerprint sizes below 4K. As expected, higher radii have more collisions.

In [1]:
from rdkit import Chem,DataStructs
import time,random,gzip,pickle,copy
import numpy as np
from collections import defaultdict
from rdkit.Chem import Draw
from rdkit.Chem import rdMolDescriptors as rdmd
from rdkit.Chem.Draw import IPythonConsole
from rdkit import DataStructs
from rdkit import rdBase
%pylab inline

print(rdBase.rdkitVersion)
import time
print(time.asctime())
/home/glandrum/anaconda3/lib/python3.4/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
WARNING: 
Populating the interactive namespace from numpy and matplotlib
2016.03.1.dev1
Sat Feb  6 08:29:19 2016
pylab import has clobbered these variables: ['copy', 'random']
`%matplotlib` prevents importing * from pylab and numpy

For test data I'll use the same 16 million ZINC compounds I used in the bit statistics post.

In [2]:
filen='/scratch/RDKit_git/Data/Zinc/zinc_all_clean.pkl.gz'

Loop over the molecules and build fingerprints of multiple radii and folded lengths.

In [25]:
import copy
history={} # we will use this to see how quickly the results converge
counts=defaultdict(lambda:defaultdict(int))
t1 = time.time()
with gzip.open(filen,'rb') as inf:
    i = 0
    while 1:
        try:
            m,nm = pickle.load(inf)
        except EOFError:
            break
        if not m: continue
        i+=1
        for v in 1,2,3:
            onbits=len(rdmd.GetMorganFingerprint(m,v).GetNonzeroElements())
            counts[(v,-1)][onbits]+=1
            for l in 1024,2048,4096,8192:
                dbits = onbits-rdmd.GetMorganFingerprintAsBitVect(m,v,l).GetNumOnBits()
                counts[(v,l)][dbits]+=1
        if not i%20000:
            t2 = time.time()
            print("Done %d in %.2f sec"%(i,t2-t1))
        if not i%100000:
            history[i] = copy.deepcopy(counts)
Done 20000 in 14.47 sec
Done 40000 in 29.38 sec
 ... snip ...
Done 16380000 in 14358.79 sec
Done 16400000 in 14376.86 sec
In [26]:
pickle.dump(dict(counts),gzip.open('../data/fp_collision_counts.pkl.gz','wb+'))
In [27]:
for k,d in history.items():
    history[k] = dict(d)
pickle.dump(dict(history),gzip.open('../data/fp_collision_counts.history.pkl.gz','wb+'))

Now plot histograms of the numbers of collisions along with the distributions of the number of bits set in the non-folded FPs

In [58]:
figure(figsize=(16,20))

pidx=1
#----------------------------
for nbits in (8192,4096,2048,1024):
    subplot(4,2,pidx)
    pidx+=1
    maxCollisions = max(counts[3,nbits].keys())+1
    d1=np.zeros(maxCollisions,np.int)
    for k,v in counts[1,nbits].items():
        d1[k]=v
    d2=np.zeros(maxCollisions,np.int)
    for k,v in counts[2,nbits].items():
        d2[k]=v
    d3=np.zeros(maxCollisions,np.int)
    for k,v in counts[3,nbits].items():
        d3[k]=v
    barWidth=.25
    locs = np.array(range(maxCollisions))

    bar(locs,d1,barWidth,color='b',label="r=1")
    bar(locs+barWidth,d2,barWidth,color='g',label="r=2")
    bar(locs+2*barWidth,d3,barWidth,color='r',label="r=3")
    
    #_=hist((d1,d2,d3),bins=20,log=True,label=("r=1","r=2","r=3"))
    title('%d bits'%nbits)
    _=yscale("log")
    _=legend()

    subplot(4,2,pidx)
    pidx+=1
    plot([x for x,y in counts[1,-1].items()],[y for x,y in counts[1,-1].items()],'.b-',label=
        "r=1")
    plot([x for x,y in counts[2,-1].items()],[y for x,y in counts[2,-1].items()],'.g-',label=
        "r=2")
    plot([x for x,y in counts[3,-1].items()],[y for x,y in counts[3,-1].items()],'.r-',label=
        "r=3")
    _=xlabel("num bits set")
    _=ylabel("count")
    _=legend()
    
    

So, there are definitely some collisions. But not a huge number.

It's worth looking at the histograms without the log scale, just to maintain some perspective:

In [59]:
figure(figsize=(16,20))

pidx=1
#----------------------------
for nbits in (8192,4096,2048,1024):
    subplot(4,2,pidx)
    pidx+=1
    maxCollisions = max(counts[3,nbits].keys())+1
    d1=np.zeros(maxCollisions,np.int)
    for k,v in counts[1,nbits].items():
        d1[k]=v
    d2=np.zeros(maxCollisions,np.int)
    for k,v in counts[2,nbits].items():
        d2[k]=v
    d3=np.zeros(maxCollisions,np.int)
    for k,v in counts[3,nbits].items():
        d3[k]=v
    barWidth=.25
    locs = np.array(range(maxCollisions))

    bar(locs,d1,barWidth,color='b',label="r=1")
    bar(locs+barWidth,d2,barWidth,color='g',label="r=2")
    bar(locs+2*barWidth,d3,barWidth,color='r',label="r=3")
    
    #_=hist((d1,d2,d3),bins=20,log=True,label=("r=1","r=2","r=3"))
    title('%d bits'%nbits)
    _=legend()

    subplot(4,2,pidx)
    pidx+=1
    plot([x for x,y in counts[1,-1].items()],[y for x,y in counts[1,-1].items()],'.b-',label=
        "r=1")
    plot([x for x,y in counts[2,-1].items()],[y for x,y in counts[2,-1].items()],'.g-',label=
        "r=2")
    plot([x for x,y in counts[3,-1].items()],[y for x,y in counts[3,-1].items()],'.r-',label=
        "r=3")
    _=xlabel("num bits set")
    _=ylabel("count")
    _=legend()
    
    

It's tempting to look at the effect on similarity, but looking at random pairs basically implies looking at differences in the noise. I'm not sure that really helps, so I won't do it here.

In [ ]:
 

1 comment:

Brian Kelley said...

There is a related paper on how the bit-based nature of fingerprints affects clustering here:

Ties in proximity and clustering compounds.

This is caused by similarity measures having a finite number of bits that can be used in union and intersection operators which means that is a finite number of tanimoto values, say, that can actually be calculated. Collisions affect this behavior adversely, although ironically, can make identify similar compounds with fingerprints more effective due to locality sensitive hashing (i.e. similar molecules SHOULD have collisions more often). An rdkit based example is here http://chembl.blogspot.com/2015/08/lsh-based-similarity-search-in-mongodb.html