Mapping fingerprint data into 2D¶
Inspired by this notebook from George Papadatos.
Scikit-learn makes it quite easy to apply multi-dimensional scaling (MDS) to either reduce the dimensionality of a dataset or to embed distance data into cartesian space. This enables one of the favorite activities of the cheminformatician: producing plots of where compounds land in an 2D space. Things like this one, embedding a set of ~120 compounds with Serine-protein kinase ATR data in ChEMBL:
Here I'm going to take a look at how accurate these embeddings are and how sensitive that accuracy is to the embedding method, dataset size, and dataset composition.
TL;DR Version
Here's a plot of the Euclidean distance in the projected 2D space vs the original Tanimoto distance for George's compounds: There are clearly a lot of distances that are not being accurately represented. It is somewhat encouraging that no distances that should be small (<0.4) end up being large, but the correlation really isn't all that great.
By plotting residual strain per point versus dimension: it becomes clear that, at least according to the MDS, this dataset really isn't 2D.
On to the analysis¶
import requests
import pandas as pd
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit.Chem import AllChem as Chem
from rdkit.Chem import DataStructs
from scipy.spatial.distance import *
import numpy as np
from sklearn import manifold
from rdkit import rdBase
print(rdBase.rdkitVersion)
import time
print time.asctime()
Grab the data
I'm going to start with a larger dataset so that I can explore the impact of dataset size on the results. The analysis of George's results are below.
I'll use the Dopamine D3 receptor as the target for this exercise. It's one of the targets we used in both the benchmarking and model fusion papers.
target='CHEMBL234'
re = requests.get('https://www.ebi.ac.uk/chemblws/targets/{0}/bioactivities.json'.format(target))
data = pd.DataFrame(re.json()['bioactivities'])
data.head()
data.shape
Remove duplicates
data = data.drop_duplicates(['parent_cmpd_chemblid'])
data.shape
Remove assays with too many or too few compounds
assays = data[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
assays.head(5)
assays.assay_chemblid.hist(bins=20)
goodassays = assays.ix[(assays.assay_chemblid >= 10)&(assays.assay_chemblid <= 30)]
print goodassays.shape
goodassays.assay_chemblid.hist(bins=20)
data2 = data.ix[data.assay_chemblid.isin(list(goodassays.index))]
data2.shape
Now get the SMILES
Note: George's original notebook used the ChEMBL web services to grab the structures. Since the ChEMBL web services don't seem to have any way to do bulk retrieval, this requires one call per compound, which takes a long time. For larger datasets it's almost definitely a better idea to do queries against a local copy of ChEMBL, so that's what we'll do here.
%load_ext sql
%config SqlMagic.feedback = False
def fetch_SMILES(chemblid):
data = %sql postgresql://localhost/chembl_16 \
select canonical_smiles smiles from compound_structures \
join molecule_dictionary using (molregno) \
where chembl_id=:chemblid ;
return data[0][0]
data2['SMILES'] = data2['parent_cmpd_chemblid'].map(fetch_SMILES)
PandasTools.AddMoleculeColumnToFrame(data2, smilesCol = 'SMILES',includeFingerprints=True)
Let's start by looking at assays that have between 10 and 12 compounds:
assays = data2[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
goodassays = assays.ix[(assays.assay_chemblid >= 10)&(assays.assay_chemblid <= 12)]
subset = data2.ix[data.assay_chemblid.isin(list(goodassays.index))]
print subset.shape
mols = subset[['parent_cmpd_chemblid','name_in_reference','SMILES', 'ROMol', 'assay_chemblid']]
mols.shape
mols.head(5)
Construct fingerprints:
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols['ROMol']]
Now the distance matrix:
dist_mat = []
for i,fp in enumerate(fps):
dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)
And now do the MDS into 2 dimensions:
mds = manifold.MDS(n_components=2, dissimilarity="precomputed", random_state=3, n_jobs = 4, verbose=1,max_iter=1000)
results = mds.fit(dist_mat)
coords = results.embedding_
print 'Final stress:',mds.stress_
And plot the points:
rcParams['figure.figsize'] = 6,6
scatter([x for x,y in coords], [y for x,y in coords],edgecolors='none')
Nice picture. How accurate is it?
The strain values above lead me to believe that there are probably some real problems here.
The overall strain value isn't particularly easy to interpret, so we'll check up on things by comparing the embedded distances with the Tanimoto distances we're trying to reproduce.
import random
def distCompare(dmat,coords,nPicks=5000,seed=0xf00d):
""" picks a random set of pairs of points to compare distances """
nPts=len(coords)
random.seed(seed)
res=[]
keep=set()
if nPicks>0:
while len(res)<nPicks:
idx1 = random.randint(0,nPts-1)
idx2 = random.randint(0,nPts-1)
if idx1==idx2:
continue
if idx1>idx2:
idx1,idx2=idx2,idx1
if (idx1,idx2) in keep:
continue
keep.add((idx1,idx2))
p1 = coords[idx1]
p2 = coords[idx2]
v = p1-p2
d = sqrt(v.dot(v))
res.append((dmat[idx1][idx2],d))
else:
for idx1 in range(nPts):
for idx2 in range(idx1+1,nPts):
p1 = coords[idx1]
p2 = coords[idx2]
v = p1-p2
d = sqrt(v.dot(v))
res.append((dmat[idx1][idx2],d))
return res
d = distCompare(dist_mat,coords)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
Yikes... that's no good. This really isn't so surprising given how much stress was left.
Let's try increasing the dimensionality to see how much improvement we can get:
mds2 = manifold.MDS(n_components=20, dissimilarity="precomputed", random_state=3, n_jobs = 4, verbose=1,max_iter=1000)
results = mds2.fit(dist_mat)
coords = results.embedding_
print 'Final stress:',mds2.stress_
d = distCompare(dist_mat,coords)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
Not so bad... of course, we can't visualize 20 dimensions.
What about if we try doing dimensionality reduction on the 20D results instead of using the distance matrix?
mds3 = manifold.MDS(n_components=2, random_state=3, n_jobs = 4, verbose=1,max_iter=1000)
results3 = mds3.fit(coords)
coords3 = results3.embedding_
print 'Final stress:',mds3.stress_
The strain is still really high. This isn't that surprising given that we already tried the same thing from the distance matrix.
Ok, so MDS doesn't work. Now that we have actual coordinates, we can use some of the other scikit-learn manifold learning approaches for embedding the points.
Instead of jumping all the way to a 2D system, where we're really unlikely to get decent results, start by dropping down to 10D.
First MDS in 10D:
mds3 = manifold.MDS(n_components=10, random_state=3, n_jobs = 4, verbose=1,max_iter=1000)
results3 = mds3.fit(coords)
coords3 = results3.embedding_
print 'Final stress:',results3.stress_
ds = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
Now try locally linear embedding:
lle = manifold.LocallyLinearEmbedding(n_components=10,random_state=3,max_iter=1000)
results3=lle.fit(coords)
coords3=results3.embedding_
print results3.reconstruction_error_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
Ick! What about Isomap?
embed = manifold.Isomap(n_components=10,max_iter=1000)
results3=embed.fit(coords)
coords3=results3.embedding_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
Also pretty grim. Does spectral embedding work?
embed = manifold.SpectralEmbedding(n_components=10)
results3=embed.fit(coords)
coords3=results3.embedding_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
Hmm, there's at least some signal there. How does spectral embedding do with 2D?
embed = manifold.SpectralEmbedding(n_components=2)
results3=embed.fit(coords)
coords3=results3.embedding_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
Not so hot...
Look at stress vs dimension
We're able to do a reasonable job of embedding into 20 dimensions with MDS. 10 even looks somewhat ok. Let's look at behavior of the residual stress as a function of dimension:
vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')
Yeah, the stress is going up exponentially with dimension. These data really don't want to be in a 2D space.
Try non-metric embedding
vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
mds3 = manifold.MDS(metric=False,n_components=nComps,random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')
helps somewhat at lower dimension, but not dramatically. Plus it's a lot slower
A larger data set
Maybe it's possible to do better with a larger data set? This seems unlikely, but we can at least try:
assays = data2[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
goodassays = assays.ix[(assays.assay_chemblid >= 10)&(assays.assay_chemblid <= 15)]
subset = data2.ix[data.assay_chemblid.isin(list(goodassays.index))]
print subset.shape
mols = subset[['parent_cmpd_chemblid','name_in_reference','SMILES', 'ROMol', 'assay_chemblid']]
print mols.shape
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols['ROMol']]
dist_mat = []
for i,fp in enumerate(fps):
dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)
print dist_mat.shape
vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')
Nope, that definitely made things worse.
and larger still
assays = data2[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
goodassays = assays.ix[(assays.assay_chemblid >= 10)&(assays.assay_chemblid <= 18)]
subset = data2.ix[data.assay_chemblid.isin(list(goodassays.index))]
print subset.shape
mols = subset[['parent_cmpd_chemblid','name_in_reference','SMILES', 'ROMol', 'assay_chemblid']]
print mols.shape
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols['ROMol']]
dist_mat = []
for i,fp in enumerate(fps):
dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)
print dist_mat.shape
vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')
Continuing to get worse.
Try less data
assays = data2[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
print assays.shape
goodassays = assays.ix[assays.assay_chemblid == 15]
subset = data2.ix[data.assay_chemblid.isin(list(goodassays.index))]
print subset.shape
mols = subset[['parent_cmpd_chemblid','name_in_reference','SMILES', 'ROMol', 'assay_chemblid']]
print mols.shape
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols['ROMol']]
dist_mat = []
for i,fp in enumerate(fps):
dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)
print dist_mat.shape
vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')
That's a lot less stress per point than before, but it's still pretty high. Look at the distances to be sure:
mds3 = manifold.MDS(n_components=2, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
coords3=results3.embedding_
d = distCompare(dist_mat,coords3)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
Embedding 100 random compounds:
data = %sql postgresql://localhost/chembl_16 \
select canonical_smiles smiles from compound_structures limit 50000 ;
random.seed(0xf00d)
random.shuffle(data)
smis = data[:100]
mols = [Chem.MolFromSmiles(x[0]) for x in smis]
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols]
dist_mat = []
for i,fp in enumerate(fps):
dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)
print dist_mat.shape
vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')
mds3 = manifold.MDS(n_components=2, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
coords3=results3.embedding_
d = distCompare(dist_mat,coords3,nPicks=-1)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
Yeah, that's really bad.
What about George's results?
TRGT = 'CHEMBL5024'
re = requests.get('https://www.ebi.ac.uk/chemblws/targets/{0}/bioactivities.json'.format(TRGT))
data = pd.DataFrame(re.json()['bioactivities'])
data = data.drop_duplicates(['parent_cmpd_chemblid'])
assays = data[['assay_chemblid','target_chemblid']].groupby('assay_chemblid').count()
goodassays = list(assays.ix[assays.assay_chemblid >= 4].index)
data = data.ix[data.assay_chemblid.isin(goodassays)]
data.shape
def fetch_SMILES2(chemblid):
return str(requests.get('https://www.ebi.ac.uk/chemblws/compounds/{}.json'.format(chemblid)).json()['compound']['smiles'])
data['SMILES'] = data['parent_cmpd_chemblid'].map(fetch_SMILES2)
PandasTools.AddMoleculeColumnToFrame(data, smilesCol = 'SMILES')
mols = data[['parent_cmpd_chemblid','name_in_reference','SMILES', 'ROMol', 'assay_chemblid']]
fps = [Chem.GetMorganFingerprintAsBitVect(m,2,nBits=2048) for m in mols['ROMol']]
dist_mat = []
for i,fp in enumerate(fps):
dist_mat.append(DataStructs.BulkTanimotoSimilarity(fps[i],fps,returnDistance=1))
dist_mat=numpy.array(dist_mat)
print dist_mat.shape
vs={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
mds3 = manifold.MDS(n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')
See if we can improve things by adjusting the MDS tolerance.
vs2={}
for nComps in (140,120,100,80,60,40,20,10,5,2):
mds3 = manifold.MDS(eps=1e-8,n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4,verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs2[nComps]=results3.stress_/len(dist_mat)
scatter(vs2.keys(),vs2.values(),c='r',edgecolors='none')
scatter(vs.keys(),vs.values(),edgecolors='none')
yscale('log')
That definitely helps at higher dimensions, but by the time we get down to 5D or 2D the improvement is mostly lost.
Try an even stricter eps value:
vs3={}
for nComps in (60,40,20,10,5,2):
mds3 = manifold.MDS(eps=1e-12,n_components=nComps, random_state=3, dissimilarity="precomputed",n_jobs = 4,verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
print 'Final stress at dim=%d:'%nComps,results3.stress_,'stress per element:',results3.stress_/len(dist_mat)
vs3[nComps]=results3.stress_/len(dist_mat)
scatter(vs.keys(),vs.values(),edgecolors='none')
scatter(vs2.keys(),vs2.values(),c='r',edgecolors='none')
scatter(vs3.keys(),vs3.values(),c='g',edgecolors='none')
yscale('log')
No additional benefit there.
The fact that increasing the precision of the MDS doesn't help at all with embedding into 2 dimensions seems to be yet another sign that this data set really, really doesn't fit into 2D.
Embed into 2D and then look at the coords:
mds3 = manifold.MDS(eps=1e-8,n_components=2, random_state=3, dissimilarity="precomputed",n_jobs = 4, verbose=0,max_iter=1000)
results3 = mds3.fit(dist_mat)
coords3=results3.embedding_scatter([x for x,y in coords3],[y for x,y in coords3],edgecolors='none')
That's a similar picture to the one George found.
Here's the distance-distance plot:
d = distCompare(dist_mat,coords3,nPicks=-1)
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
That's not too terrible in at least one respect: things that are supposed to be reasonably close together (dists < 0.4) are close together.
Zoom in to the small-distance region and see what that looks like:
scatter([x for x,y in d],[y for x,y in d],edgecolors='none')
ylim((-.05,.5))
_=xlim((-.05,.5))
Well, there's at least a correlation.
A summary of the situation: local environments in clusters (things close to each other) should be reasonably well represented, but the layout of the clusters relative to each other is probably not to be trusted.