Wednesday, August 9, 2017

Chemical Topic Modeling with the RDKit and KNIME

We recently published a paper on the application of topic modeling, a method developed in the text mining community, to chemical data: http://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00249. Here I'm going to show how to use this approach inside of KNIME. I'm really pleased with how the paper turned out and think the approach is a really useful one for efficiently organizing chemical datasets. We've got a bunch of cool ideas about what to do next too, so keep your eyes open...

An aside/apology: while doing the literature search, we somehow completely missed Rajarshi's blog post from 2012 (http://blog.rguha.net/?p=997). This is really embarrassing. Sorry Rajarshi...

Since we continue to work on the code that Nadine wrote while doing the research, called CheTo (for ChemicalTopic), Nadine and I have put it on GitHub (https://github.com/rdkit/CheTo). We'd also like to make it easy for other people to use the code, so we built a conda package for it and added it to the RDKit channel. If you're using the Anaconda Python distribution (and you should be!), you can install the package in your conda enviroment with a single command conda install -c rdkit cheto. If you don't already have the RDKit installed, it will automatically be installed for you. We'll be updating the git repository in the coming weeks to provide more information about and examples of how to use the CheTo python code. This blog post, though, is about using it from KNIME.

Let's start with the pre-requisites. You need an installation of at least v3.4.0 of KNIME (released July 2017). That installation should have the KNIME text mining extensions and the Python Integration version that supports Python 2 and Python 3. At the time of writing these are both in KNIME Labs. It's not a bad idea to have the RDKit nodes installed too (these are available in the KNIME Community Extensions section in the "Install KNIME Extensions" dialog). You also need to have the Python extension properly configured, I covered this in a post on the KNIME blog: https://www.knime.com/blog/setting-up-the-knime-python-extension-revisited-for-python-30-and-20. The condo environment you are using from KNIME should have both the RDKit and CheTo installed from the rdkit channel (see the CheTo installation instructions above).

phew... now we're ready to go. Here's a screenshot of the sample workflow for doing chemical topic modeling:

The table reader at the beginning brings in a set of a couple hundred molecules taken from 12 ChEMBL documents. The real work is done in the "fingerprint and do LDA" wrapped metanode, which expects an input table that has a column named "smiles" that contains SMILES. We won't get into the details of the contents of this node here, but if you configure the node (double click or right click and "configure") you'll get a dialog which allows you to change the important parameters:


Executing the node, which can take a while since it's not currently very well optimized, gives you two tables. The first has the individual molecules assigned to topics:

and the second has the bits that define the topics themselves,  including Nadine's very nice depictions of the fingerprint bits:


The GroupBy nodes provide a summary of the number of documents each topic shows up in as well as the number of topics that are identified in each document. This last was one of the validation metrics that we used in the paper; here's what we get with the sample data set:
You can see that the majority of the documents contain compounds that are assigned to a single topic, while a few documents contain compounds assigned to two topics and one, doc_id 44596, has compounds from three topics.

There's a lot more detail in the paper about what this all means and what you might do with it; the goal for this post was to provide a very quick overview of how to do the analysis and look at the results inside of KNIME. I hope I was successful with that.

The workflow itself is on the KNIME public examples server, you find that in KNIME by logging into the examples server and then navigating to Examples/99_Community/03_RDKit/08_Chemical_Topic_Modeling:



Wednesday, May 17, 2017

Looking at the Platinum dataset

A recent paper out of the ZBH in Hamburg published a new set of ligand structures from the PDB to be used for benchmarking conformational analysis methods: http://pubs.acs.org/doi/abs/10.1021/acs.jcim.6b00613 The dataset is available here: http://www.zbh.uni-hamburg.de/?id=628 Along with the paper the authors presented results from multiple conformation generation tools, including the RDKit. Sereina Riniker's ETKDG implementation (http://pubs.acs.org/doi/abs/10.1021/acs.jcim.5b00654) in the RDKit ended up doing quite well relative to the other approaches. That, of course, made me quite happy.
I hadn't done anything with the Platinum dataset at all yet, but an exchange with David Koes on Twitter - https://twitter.com/david_koes/status/862831249617862656 - prompted me to take a look. It's an interesting one to look at from the "predicting bioactive conformations" perspective since it has >4500 PDB ligands. In the ETKDG paper we only looked at 238 PDB structures; that paper also included 1290 CSD structures.
In [204]:
from collections import defaultdict
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_3d=True
%pylab inline
Populating the interactive namespace from numpy and matplotlib
/home/glandrum/anaconda3/envs/py36/lib/python3.6/site-packages/IPython/core/magics/pylab.py:160: UserWarning: pylab import has clobbered these variables: ['indices']
`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"
Read the molecules in and assign atomic chirality based on the structures provided.
In [193]:
ms = [x for x in Chem.SDMolSupplier('../data/platinum_dataset_2017_01.sdf',removeHs=False)]
# Assign atomic chirality based on the structures:
for m in ms: Chem.AssignAtomChiralTagsFromStructure(m)
len(ms)
Out[193]:
4548

Conformation generation

Start by generating 50 conformations via ETKDG
In [77]:
import time
tms = [Chem.Mol(x) for x in ms]
ps = AllChem.ETKDG()
ps.pruneRmsThresh=0.5
ps.numThreads=0
ts = []
for i,m in enumerate(tms):
    t1 = time.time()
    AllChem.EmbedMultipleConfs(m,50,ps)
    t2=time.time()
    ts.append((t2-t1))
    if not (i%50): print("done %d"%i)
done 0
done 50
done 100
done 150
  ... 
done 4350
done 4400
done 4450
done 4500
Repeat the conformation generation using standard DG
In [139]:
tms2 = [Chem.Mol(x) for x in ms]
ps = AllChem.ETKDG()
ps.useBasicKnowledge = False
ps.useExpTorsionAnglePrefs = False
ps.pruneRmsThresh=0.5
ps.numThreads=0
ts2 = []
for i,m in enumerate(tms2):
    t1 = time.time()
    AllChem.EmbedMultipleConfs(m,50,ps)
    t2=time.time()
    ts2.append((t2-t1))
    if not (i%50): print("done %d"%i)
done 0
done 50
done 100
done 150
 ...
done 4350
done 4400
done 4450
done 4500

RMSD calculation

For the RMS analysis we want molecules without Hs:
In [140]:
ms_noh = [Chem.RemoveHs(m) for m in ms]
tms_noh = [Chem.RemoveHs(m) for m in tms]
tms2_noh = [Chem.RemoveHs(m) for m in tms2]
In [141]:
import pickle
with open('../data/platinum_dataset_2017_01.chiral_confs.pkl','wb+') as outf:
    pickle.dump((ms_noh,tms_noh,tms2_noh,ts,ts2),outf)
Now generate RMS values to the crystal structure for 1, 5, 20, 25, and 50 conformer ensembles.
In [143]:
# ETKDG
rmsvs = defaultdict(list)
for m,tm in zip(ms_noh,tms_noh):
    best = 1e8
    cids = [x.GetId() for x in tm.GetConformers()]
    for i,cid in enumerate(cids):
        rms = AllChem.GetBestRMS(m,tm,probeConfId=cid)
        best = min(rms,best)
        if not i:
            best_1 = best
        if i<5:
            best_5 = best
        if i<20:
            best_20 = best
        if i<25:
            best_25 = best
    rmsvs[50].append(best)
    rmsvs[1].append(best_1)
    rmsvs[5].append(best_5)
    rmsvs[20].append(best_20)
    rmsvs[25].append(best_25)
# DG
rmsvs2 = defaultdict(list)
for m,tm in zip(ms_noh,tms2_noh):
    best = 1e8
    cids = [x.GetId() for x in tm.GetConformers()]
    for i,cid in enumerate(cids):
        rms = AllChem.GetBestRMS(m,tm,probeConfId=cid)
        best = min(rms,best)
        if not i:
            best_1 = best
        if i<5:
            best_5 = best
        if i<20:
            best_20 = best
        if i<25:
            best_25 = best
    rmsvs2[50].append(best)
    rmsvs2[1].append(best_1)
    rmsvs2[5].append(best_5)
    rmsvs2[20].append(best_20)
    rmsvs2[25].append(best_25)    
There are a couple molecules where we didn't manage to generate conformations:
In [181]:
len([1 for x in rmsvs[50] if x>1000]),len([1 for x in rmsvs2[50] if x>1000])
Out[181]:
(2, 1)
Let's look at those:
In [184]:
[i for i,x in enumerate(rmsvs[50]) if x > 1000],[i for i,x in enumerate(rmsvs2[50]) if x > 1000]
Out[184]:
([1641, 2708], [1641])
In [192]:
from rdkit.Chem import Draw
indices = [i for i,x in enumerate(rmsvs[50]) if x > 1000]
badms = [Draw.PrepareMolForDrawing(ms_noh[x],forceCoords=True) for x in indices]
legends = [ms[x].GetProp("_Name") for x in indices]
Draw.MolsToGridImage(badms,legends=legends)
Out[192]:
I'm surprised that there were problems with the second structure (H52). It doesn't look like it should be that hard to embed, so that's something to look into. The first structure (840) has multiple specified chiral centers in a highly constrained fused ring system, so it's not horribly surprising to encounter problems.

Looking at RMSD values

In [145]:
def values_there(seq,threshold=1000):
    return [x for x in seq if x<threshold]
Let's look at histograms of RMS values
In [248]:
figsize(18,6)
hist([values_there(rmsvs[x]) for x in sorted(rmsvs)],bins=20,label=[x for x in sorted(rmsvs)]);
legend();
title('ETKDG');
xlabel('RMSD')
ylabel('Number of structures');
In [249]:
figsize(18,6)
hist([values_there(rmsvs2[x]) for x in sorted(rmsvs2)],bins=20,label=[x for x in sorted(rmsvs2)]);
legend();
title('default DG');
xlabel('RMSD')
ylabel('Number of structures');
Another way of looking at that is with cumulative box plots similar to what was used in Fig 7 of the ETKDG paper. These show the fraction of structures that have a minimum RMSD less than a particular value.
In [250]:
def fig7_plot(rmsvs,nmols,title,blocks=(0.1,0.5,1.0,1.5,2.0)):
    pltVals=[]
    for thresh in blocks:
        tVals = []
        for k in sorted(rmsvs):
            tVals.append(len([1 for x in rmsvs[k] if x<thresh])/nmols)
        pltVals.append(tVals)
    N = len(rmsvs.keys())
    ind = np.arange(N)
    w=0.1
    figsize(18,6)
    fig,ax = subplots()
    rects=[]
    for i,nc in enumerate(rmsvs):
        rects.append(ax.bar(ind+i*w,[x[i] for x in pltVals],w))
    ax.set_xticks(ind+(w*len(blocks)/2))
    ax.set_xticklabels('<%.2f'%x for x in blocks)
    ax.set_ylabel('Fraction of structures')
    ax.set_xlabel('RMSD')
    ax.set_title(title)
    ax.legend((r[0] for r in rects),('%d'%k for k in sorted(rmsvs)))
    ax.set_ybound(lower=0,upper=1)
    ax.grid(axis='y')
In [251]:
fig7_plot(rmsvs,len(ms),'ETKDG');
In [252]:
fig7_plot(rmsvs2,len(ms),'DG');
Those show what we'd expect: ETKDG is significantly better than DG and adding conformations tends to help. With 50 conformations ETKDG finds a result within 1.0A of the crystal structure for about 70% of the structures.
David Koes made some interesting plots that looked at average RMSD in structures as a function of the number of rotatable bonds. Let's try a few of those.
In [151]:
nrots = [AllChem.CalcNumRotatableBonds(x) for x in ms_noh]
rms_by_nrot = defaultdict(lambda:defaultdict(list))
rms_by_nrot2 = defaultdict(lambda:defaultdict(list))
for i,nrot in enumerate(nrots):
    for k in rmsvs:
        rms_by_nrot[k][nrot].append(rmsvs[k][i])
        rms_by_nrot2[k][nrot].append(rmsvs2[k][i])
In [152]:
import numpy as np
counts = sorted([(x,len(values_there(rms_by_nrot[50][x]))) for x in rms_by_nrot[50]])
In [270]:
figsize(8,8)
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()

# integral = [counts[0][1]]
# for i in range(1,len(counts)): 
#     integral.append(counts[i][1]+integral[-1])
# integral = [x/len(ms_noh) for x in integral]
# ax2.plot([x for x,y in counts],[y for y in integral],linestyle='--',color='r')
# ax2.set_ylabel('frac structs')

ax2.plot([x for x,y in counts],[y for x,y in counts],linestyle='--', color='r')
ax2.set_ylabel('num structs')

for k in sorted(rms_by_nrot):
    means = [(x,np.average(values_there(rms_by_nrot[k][x]))) for x,_ in counts]
    means = sorted(means)
    ax1.plot([x for x,y in means],[y for x,y in means],label="ETKDG"+str(k))
for k in (5,50):
    means = [(x,np.average(values_there(rms_by_nrot2[k][x]))) for x,_ in counts]
    means = sorted(means)
    ax1.plot([x for x,y in means],[y for x,y in means],linestyle=':',label="DG"+str(k))

ax1.set_ylabel('mean(RMSD)');
ax1.set_xlabel('num_rotatable')
ax1.legend();
ax2.grid(axis='y')
ax1.grid(axis='x')
I've deviated a bit from the way David presents this by adding a line showing the number of structures in the dataset with the corresponding number of rotatable bonds (the red dashed line and the right-hand Y axis). There aren't a lot of structures with more than 10 rotatable bonds in the dataset.
From the graph we can see the expected result: mean(RMSD) increases steadily with the number of rotatable bonds and ETKDG is consistently better than DG.

Timing information

We know that the ETKDG procedure is more computationally intensive than DG. Let's look at how much of a difference it makes on this data set:
In [292]:
figsize(8,8)
ptimes = [x for i,x in enumerate(ts) if tms[i].GetNumConformers()]
ptimes2 = [x for i,x in enumerate(ts2) if tms[i].GetNumConformers()]
scatter(ptimes2,ptimes)
plot((0,12),(0,12),'r')
A = np.vstack([ptimes2, np.ones(len(ptimes2))]).T
m, c = np.linalg.lstsq(A, ptimes)[0]
print(m,c)
grid()
plot(ptimes2,m*np.array(ptimes2)+c,color='r',linestyle='--')             
xlabel('Time-DG (s)');
ylabel('Time-ETKDG (s)');
1.73472205238 0.0968224468535
So ETKDG takes, on average, 1.7 times as long.
A different look at timing: The fraction of structures that are completed in less than given time.
In [303]:
time_integration = defaultdict(int)
time_integration2 = defaultdict(int)
for tv in (0.1,.25, .5, 1, 1.5, 2, 3, 5,10,20):
    time_integration[tv] = len([1 for x in ts if x<=tv])
    time_integration2[tv] = len([1 for x in ts2 if x<=tv])
figsize(8,8)
plot(sorted(time_integration),[time_integration[x]/len(ms_noh) for x in sorted(time_integration)],label='ETKDG')
plot(sorted(time_integration),[time_integration2[x]/len(ms_noh) for x in sorted(time_integration)],label='DG')
grid()
xlabel('time (s)')
ylabel('fraction of structures')
legend()
xscale('log');
Interpretation of this: both ETKDG and DG are able to generate conformations (remember that the target number was 50 conformations with an RMS threshold of 0.5A between them) for >80% of the structures in less than 1 second. ETKDG completes ~85% and DG completes about 95%.
Look at how sensitive those times are to the number of rotatable bonds:
In [293]:
times_by_nrot = defaultdict(list)
times_by_nrot2 = defaultdict(list)
for i,nrot in enumerate(nrots):
    if not tms[i].GetNumConformers(): continue
    times_by_nrot[nrot].append(ts[i])
    times_by_nrot2[nrot].append(ts2[i])
In [294]:
figsize(8,8)
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()

ax2.plot([x for x,y in counts],[y for x,y in counts],linestyle='--', color='r')
ax2.set_ylabel('num structs')

means = [(x,np.average(times_by_nrot[x]),np.std(times_by_nrot[x])) for x,_ in counts]
means = sorted(means)
ax1.errorbar([x for x,y,z in means],[y for x,y,z in means],yerr=[z for x,y,z in means],label="ETKDG-50")
means = [(x,np.average(times_by_nrot2[x]),np.std(times_by_nrot2[x])) for x,_ in counts]
means = sorted(means)
ax1.errorbar([x for x,y,z in means],[y for x,y,z in means],yerr=[z for x,y,z in means],label="DG-50")

ax1.set_ylabel('time (s)');
ax1.set_xlabel('num_rotatable')
ax1.legend();
ax2.grid(axis='y')
ax1.grid(axis='x')
ax1.set_ybound(0,3)
Adding rotatable bonds causes conformation generation to take longer. This isn't a huge surprise.

Impact of force-field minimization

David also does an analysis of the impact of UFF minimization on the RMSD values. I'm not going to repeat exactly that, but it is worth seeing if minimizing the conformations improves the RMSDs. We'll do that here with UFF.
In [172]:
minms = [Chem.Mol(m) for m in tms]
mintimes = []
for i,m in enumerate(minms):
    t1 = time.time()
    if not m.GetNumConformers():
        continue
    AllChem.UFFOptimizeMoleculeConfs(m,numThreads=0)
    t2 = time.time()
    mintimes.append(t2-t1)
    if not i%500:
        print("Done:",i)
Done: 0
RDKit ERROR: [05:11:24] UFFTYPER: Unrecognized charge state for atom: 14
Done: 500
 ...
Done: 4000
RDKit ERROR: [05:26:56] UFFTYPER: Unrecognized charge state for atom: 13
RDKit ERROR: [05:27:20] UFFTYPER: Unrecognized charge state for atom: 12
Done: 4500
In [173]:
minms_noh = [Chem.RemoveHs(m) for m in minms]
In [174]:
with open('../data/platinum_dataset_2017_01.chiral_confs.minimized.pkl','wb+') as outf:
    pickle.dump((minms_noh,mintimes),outf)
In [175]:
rmsvs_min = defaultdict(list)
for m,tm in zip(ms_noh,minms_noh):
    best = 1e8
    cids = [x.GetId() for x in tm.GetConformers()]
    for i,cid in enumerate(cids):
        rms = AllChem.GetBestRMS(m,tm,probeConfId=cid)
        best = min(rms,best)
        if not i:
            best_1 = best
        if i<5:
            best_5 = best
        if i<20:
            best_20 = best
        if i<25:
            best_25 = best
    rmsvs_min[50].append(best)
    rmsvs_min[1].append(best_1)
    rmsvs_min[5].append(best_5)
    rmsvs_min[20].append(best_20)
    rmsvs_min[25].append(best_25)
In [176]:
rms_by_nrot_min = defaultdict(lambda:defaultdict(list))
for i,nrot in enumerate(nrots):
    for k in rmsvs_min:
        rms_by_nrot_min[k][nrot].append(rmsvs_min[k][i])
In [304]:
figsize(8,8)
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax2.plot([x for x,y in counts],[y for x,y in counts],linestyle='--', color='r')
ax2.set_ylabel('num structs')

for k in sorted(rms_by_nrot):
    means = [(x,np.average(values_there(rms_by_nrot[k][x]))) for x,_ in counts]
    means = sorted(means)
    ax1.plot([x for x,y in means],[y for x,y in means],label="ETKDG"+str(k))
for k in (1,5,25,50):
    means = [(x,np.average(values_there(rms_by_nrot_min[k][x]))) for x,_ in counts]
    means = sorted(means)
    ax1.plot([x for x,y in means],[y for x,y in means],linestyle=':',label="ETKDG_min"+str(k))

ax1.set_ylabel('mean(RMSD)');
ax1.set_xlabel('num_rotatable')
ax1.legend();
ax2.grid(axis='y')
ax1.grid(axis='x')
The impact is not large.
There's a lot more to be looked at here with the force fields, but this blog post is already getting fairly long, so I'm going to come back to that in a future post.