Looking at bit collisions in Morgan fingerprints
In the group over the past few days we've had a few conversations about bit collisions in Morgan fingerprints. These happen when we fold the fingerprints to be a particular size if two different bits (corresponding to two different atom environments) end up folding onto the same bit.
This post is an exploration of how often that happens. I will look at fingerprints with different radii -- 1, 2, and 3 -- folded to a set of different sizes -- 1K, 2K, 4K, 8K.
TL;DR version: There are a fair number of collisions at fingerprint sizes below 4K. As expected, higher radii have more collisions. The collisions don't end up making much difference in terms of calculated similarity, but we have observed (not shown here) that it can make a measurable difference in the performance of some machine-learning algorithms.
from rdkit import Chem,DataStructs
import time,random
import numpy as np
from collections import defaultdict
from rdkit.Chem import Draw
from rdkit.Chem import rdMolDescriptors as rdmd
from rdkit.Chem.Draw import IPythonConsole
from rdkit import DataStructs
from rdkit import rdBase
from __future__ import print_function
print(rdBase.rdkitVersion)
import time
print(time.asctime())
For test data I'll use the ChEMBL compounds from 2010-2012 that I built in an earlier post. This is a set of about 230K compounds.
filen='../data/chembl16_2010-2012.smi'
!wc -l $filen
!head $filen
Loop over the molecules and build fingerprints of multiple radii and folded lengths.
counts=defaultdict(list)
for i,line in enumerate(file(filen)):
m = Chem.MolFromSmiles(line.split()[0])
if not m: continue
for v in 1,2,3:
counts[(v,-1)].append(len(rdmd.GetMorganFingerprint(m,v).GetNonzeroElements()))
for l in 1024,2048,4096,8192:
counts[(v,l)].append(rdmd.GetMorganFingerprintAsBitVect(m,v,l).GetNumOnBits())
if not (i+1)%5000:
print("Done {0}".format(i+1))
Now plot histograms of the numbers of collisions along with the fractions of collisions.
The two plots in each row show the same data, the left one uses counts of collisions, the right is the fraction of bits that are collisions. The right plots also include lines showing what fraction of fingerprints have at least that fraction of collisions.
figure(figsize=(16,20))
pidx=1
#----------------------------
for nbits in (8192,4096,2048,1024):
subplot(4,2,pidx)
pidx+=1
v1=np.array(counts[1,-1])
v2=np.array(counts[1,nbits])
d1 = v1-v2
d1p=np.array(d1,np.float)
d1p/=v1
v1=np.array(counts[2,-1])
v2=np.array(counts[2,nbits])
d2 = v1-v2
d2p=np.array(d2,np.float)
d2p/=v1
v1=np.array(counts[3,-1])
v2=np.array(counts[3,nbits])
d3 = v1-v2
d3p=np.array(d3,np.float)
d3p/=v1
_=hist((d1,d2,d3),bins=20,log=True,label=("r=1","r=2","r=3"))
title('%d bits'%nbits)
_=legend()
subplot(4,2,pidx)
pidx+=1
_=hist((d1p,d2p,d3p),bins=20,log=True,label=("r=1","r=2","r=3"))
_=hist((d1p,d2p,d3p),bins=20,histtype='step',cumulative=-1,normed=True, color=['b','g','r'])
_=legend()
So, there are definitely some collisions.
Do they make a difference in similarity values? Look at 10K random molecule pairs to find out.
random.seed(0xF00D)
smis = [x.split()[0] for x in file(filen)]
ivs=[random.randint(0,len(smis)-1) for x in range(10000)]
jvs=[random.randint(0,len(smis)-1) for x in range(10000)]
pairs=zip(ivs,jvs)
sims=defaultdict(list)
for i,j in pairs:
mi = Chem.MolFromSmiles(smis[i])
if not mi:
continue
mj = Chem.MolFromSmiles(smis[j])
if not mj:
continue
for r in 0,1,2,3:
fpi=rdmd.GetMorganFingerprint(mi,r)
for k,v in fpi.GetNonzeroElements().items():
fpi[k]=1
fpj=rdmd.GetMorganFingerprint(mj,r)
for k,v in fpj.GetNonzeroElements().items():
fpj[k]=1
sims[(r,-1)].append(DataStructs.TanimotoSimilarity(fpi,fpj))
for l in 1024,2048,4096,8192:
fpi=rdmd.GetMorganFingerprintAsBitVect(mi,r,l)
fpj=rdmd.GetMorganFingerprintAsBitVect(mj,r,l)
sims[(r,l)].append(DataStructs.TanimotoSimilarity(fpi,fpj))
figure(figsize=(12,20))
pidx=1
#----------------------------
for nbits in (8192,4096,2048,1024):
subplot(4,1,pidx)
pidx+=1
v1=np.array(sims[1,-1])
v2=np.array(sims[1,nbits])
d1 = v1-v2
d1p=np.array(d1,np.float)
d1p/=v1
v1=np.array(sims[2,-1])
v2=np.array(sims[2,nbits])
d2 = v1-v2
d2p=np.array(d2,np.float)
d2p/=v1
v1=np.array(sims[3,-1])
v2=np.array(sims[3,nbits])
d3 = v1-v2
d3p=np.array(d3,np.float)
d3p/=v1
_=hist((d1,d2,d3),bins=20,log=True,label=("r=1","r=2","r=3"))
title('%d bits'%nbits)
_=legend()
The changes in similarity are very minimal.
3 comments:
Cool. But a quick question
Why was series of lengths 1024, 2048, 4096, 9192 - would have expected 8192 for largest - of course power of two is just a convention, but it is aligned to word sizes on hardware/libraries.
It's a technical thing having to do with the details of the hashing algorithm used to do the folding.
Oh, um, wait, no it's not. It's an embarrassing typo.
Thanks for pointing that out. I'll rerun the analysis and update the post.
Updated.
Post a Comment