Wednesday, October 9, 2013

Building a similarity comparison set

Building a Similarity Comparison Set

Update 21.04.2016 This exercise has been redone in a more recent post.

Goal: construct a set of molecular pairs that can be used to compare similarity methods to each other.

I want to start with molecules that have some connection to each other, so I will pick pairs that have a baseline similarity: a Tanimoto similarity using count based Morgan0 fingerprints of at least 0.7. This threshold was selected empirically.

Note: this notebook and the data it uses/generates can be found in the github repo: https://github.com/greglandrum/rdkit_blog

I'm going to use ChEMBL as my data source, so I'll start by adding a table with Morgan0 fingerprints to my local copy of ChEMBL 16:

chembl_16=# select molregno,morganbv_fp(m,0) as mfp0 into rdk.tfps from rdk.mols;
SELECT 1291111
chembl_16=# create index fps_mfp0_idx on rdk.tfps using gist(mfp0);
CREATE INDEX

And now create a smaller table that only contains molecules with molwt<600:

chembl_16=# select molregno,mfp0 into table rdk.tfps_smaller from rdk.tfps join rdk.props using (molregno) where amw<600; 
SELECT 1182986
chembl_16=# create index sfps_mfp0_idx on rdk.tfps_smaller using gist(mfp0);
CREATE INDEX

And now I'll build the set of pairs using Python. This is definitely doable in SQL, but my SQL-fu isn't that strong.

Start by getting a set of 35K random small molecules with MW<600:

In [82]:
import psycopg2
cn = psycopg2.connect(dbname='chembl_16')
curs = cn.cursor()
curs.execute('select molregno,m from rdk.mols join rdk.props using (molregno) where amw<=600 order by random() limit 35000')
qs = curs.fetchall()

And now find one neighbor for 25K of those from the mfp0 table of smallish molecules:

In [83]:
curs.execute('set rdkit.tanimoto_threshold=0.7')

keep=[]
for i,row in enumerate(qs):
    curs.execute('select molregno,m from rdk.mols join (select molregno from rdk.tfps_smaller where mfp0%%morgan_fp(%s,0) and molregno!=%s limit 1) t2 using (molregno)',(row[1],row[0]))
    d = curs.fetchone()
    if not d: continue
    keep.append((row[0],row[1],d[0],d[1]))
    if len(keep)==25000: break
    if not i%500: print 'Done: %d'%i
Done: 0
Done: 500
Done: 1000

Finally, write those out to a file so that we can use them elsewhere:

In [84]:
import gzip
outf = gzip.open('../data/chembl16_25K.pairs.txt.gz','wb+')
for idx1,smi1,idx2,smi2 in keep: outf.write('%d %s %d %s\n'%(idx1,smi1,idx2,smi2))
outf=None

Early analysis of the data

Start by loading the pairs from the file we saved and creating RDKit molecules from them

In [85]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import gzip
In [86]:
rows=[]
for row in gzip.open('../data/chembl16_25K.pairs.txt.gz').readlines():
    row = row.split()
    row[1] = Chem.MolFromSmiles(row[1])
    row[3] = Chem.MolFromSmiles(row[3])
    rows.append(row)

Look at some pairs:

In [87]:
t = []
for x in rows[:5]:
    t.append(x[1])
    t.append(x[3])
    
Draw.MolsToGridImage(t,molsPerRow=2)
Out[87]:

Take a look at property distributions.

Each plot below contains two histograms. The one in blue is for the first set of molecules, the one in green is for the neighbor molecules.

In [88]:
from rdkit.Chem import Descriptors
In [89]:
mws = [(Descriptors.MolWt(x[1]),Descriptors.MolWt(x[3])) for x in rows]
nrots = [(Descriptors.NumRotatableBonds(x[1]),Descriptors.NumRotatableBonds(x[3])) for x in rows]
logps = [(Descriptors.MolLogP(x[1]),Descriptors.MolLogP(x[3])) for x in rows]
In [101]:
_=hist(([x for x,y in mws],[y for x,y in mws]),bins=20,histtype='bar')
xlabel('AMW')
Out[101]:
<matplotlib.text.Text at 0x8d6e6890>
In [102]:
_=hist(([x for x,y in logps],[y for x,y in logps]),bins=20,histtype='bar')
xlabel('mollogp')
Out[102]:
<matplotlib.text.Text at 0x47288310>
In [103]:
_=hist(([x for x,y in nrots],[y for x,y in nrots]),bins=20,histtype='bar')
xlabel('num rotatable bonds')
Out[103]:
<matplotlib.text.Text at 0x56ef06d0>

and a histogram of the similarities we used to construct the set

In [104]:
from rdkit import DataStructs
from rdkit.Chem import rdMolDescriptors
sims = [DataStructs.TanimotoSimilarity(rdMolDescriptors.GetMorganFingerprint(x[1],0),rdMolDescriptors.GetMorganFingerprint(x[3],0)) for x in rows]
In [105]:
_=hist(sims,bins=20)
xlabel('MFP0 sims within pairs')
Out[105]:
<matplotlib.text.Text at 0x472ae0d0>

compare to MFP2 similarity (more on this in a later post)

In [107]:
sims2 = [DataStructs.TanimotoSimilarity(rdMolDescriptors.GetMorganFingerprint(x[1],2),rdMolDescriptors.GetMorganFingerprint(x[3],2)) for x in rows]
In [108]:
_=scatter(sims,sims2,marker='o',edgecolors='none')
xlabel('MFP0 sim')
ylabel('MFP2 sim')
Out[108]:
<matplotlib.text.Text at 0x7d634250>

Look at the distribution of MFP0 similarities in random molecule pairs (more on this in a later post)

In [110]:
import random
idxs = list(range(len(rows)))
random.shuffle(idxs)
ms1 = [x[1] for x in rows]
ms2 = [rows[x][3] for x in idxs]
sims = [DataStructs.TanimotoSimilarity(rdMolDescriptors.GetMorganFingerprint(x,0),rdMolDescriptors.GetMorganFingerprint(y,0)) for x,y in zip(ms1,ms2)]
In [111]:
_=hist(sims,bins=20)
xlabel('MFP0 sim in random pairs')
Out[111]:
<matplotlib.text.Text at 0x452b390>
In []:

No comments: