Showing posts with label knime. Show all posts
Showing posts with label knime. Show all posts

Wednesday, August 9, 2017

Chemical Topic Modeling with the RDKit and KNIME

We recently published a paper on the application of topic modeling, a method developed in the text mining community, to chemical data: http://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00249. Here I'm going to show how to use this approach inside of KNIME. I'm really pleased with how the paper turned out and think the approach is a really useful one for efficiently organizing chemical datasets. We've got a bunch of cool ideas about what to do next too, so keep your eyes open...

An aside/apology: while doing the literature search, we somehow completely missed Rajarshi's blog post from 2012 (http://blog.rguha.net/?p=997). This is really embarrassing. Sorry Rajarshi...

Since we continue to work on the code that Nadine wrote while doing the research, called CheTo (for ChemicalTopic), Nadine and I have put it on GitHub (https://github.com/rdkit/CheTo). We'd also like to make it easy for other people to use the code, so we built a conda package for it and added it to the RDKit channel. If you're using the Anaconda Python distribution (and you should be!), you can install the package in your conda enviroment with a single command conda install -c rdkit cheto. If you don't already have the RDKit installed, it will automatically be installed for you. We'll be updating the git repository in the coming weeks to provide more information about and examples of how to use the CheTo python code. This blog post, though, is about using it from KNIME.

Let's start with the pre-requisites. You need an installation of at least v3.4.0 of KNIME (released July 2017). That installation should have the KNIME text mining extensions and the Python Integration version that supports Python 2 and Python 3. At the time of writing these are both in KNIME Labs. It's not a bad idea to have the RDKit nodes installed too (these are available in the KNIME Community Extensions section in the "Install KNIME Extensions" dialog). You also need to have the Python extension properly configured, I covered this in a post on the KNIME blog: https://www.knime.com/blog/setting-up-the-knime-python-extension-revisited-for-python-30-and-20. The condo environment you are using from KNIME should have both the RDKit and CheTo installed from the rdkit channel (see the CheTo installation instructions above).

phew... now we're ready to go. Here's a screenshot of the sample workflow for doing chemical topic modeling:

The table reader at the beginning brings in a set of a couple hundred molecules taken from 12 ChEMBL documents. The real work is done in the "fingerprint and do LDA" wrapped metanode, which expects an input table that has a column named "smiles" that contains SMILES. We won't get into the details of the contents of this node here, but if you configure the node (double click or right click and "configure") you'll get a dialog which allows you to change the important parameters:


Executing the node, which can take a while since it's not currently very well optimized, gives you two tables. The first has the individual molecules assigned to topics:

and the second has the bits that define the topics themselves,  including Nadine's very nice depictions of the fingerprint bits:


The GroupBy nodes provide a summary of the number of documents each topic shows up in as well as the number of topics that are identified in each document. This last was one of the validation metrics that we used in the paper; here's what we get with the sample data set:
You can see that the majority of the documents contain compounds that are assigned to a single topic, while a few documents contain compounds assigned to two topics and one, doc_id 44596, has compounds from three topics.

There's a lot more detail in the paper about what this all means and what you might do with it; the goal for this post was to provide a very quick overview of how to do the analysis and look at the results inside of KNIME. I hope I was successful with that.

The workflow itself is on the KNIME public examples server, you find that in KNIME by logging into the examples server and then navigating to Examples/99_Community/03_RDKit/08_Chemical_Topic_Modeling:



Saturday, February 22, 2014

RDKit Knime Workflows V: Building a predictive model

Another workflow from the RDKit workshop at the Knime UGM in Zürich

This time we're going to look at applying a machine-learning method to building predictive models. We start with a dataset from ChEMBL that contains 85K activity values measured against a panel of cytochrome P450s. The data come from ChEMBL, but the panel assay itself is from PubChem (http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=1851). The workflow starts with the extracted values (the metanode in the lower left can be used to extract the data from a local ChEMBL install), labels any rows with AC50 < 10uM as active (this is the definition used in the PubChem assay), and then, on the top branch, picks out the ~5700 measurements for CYP 2D6. After generating RDKit fingerprints, we build a random forest using Knime's very nice "Tree Ensemble Learner" node: 

The rest of the top branch is concerned with evaluating the accuracy of the model. I'd normally also use some hold-out data to confirm accuracy, but in the interest of brevity here I use the out-of-bag predictions from the random-forest learner. In my experience these track reasonably well with hold-out data.

The scoring node provides the first glimpse at model quality:
I've set the confusion matrix up so that the experimental values are in rows and the predictions are in columns. The node tells us that the model is overall 64% accurate with a Cohen's kappa value of 0.275; not particularly impressive. Looking at the confusion matrix, it appears that the errors are pretty well balanced across the predictions, so the model at least isn't doing anything overly strange.

Predicting CYP inhibition is not an easy problem, so it's not all that surprising that the model isn't doing a great job overall. It's still worth taking a look at a few alternate ways of evaluating model performance. One classic approach to this is using an ROC curve. We construct this by ranking the compounds by the model's predicted probability that each is a CYP 2D6 inhibitor and then plotting the fraction of inhibitors found against the fraction of non-inhibitors found:

The straight gray line in this plot shows what we'd expect by randomly picking compounds. The area under the ROC curve (AUROC) is given at the lower portion of the plot: 0.70. A perfect model will give AUROC=1.0 while random picking gives AUROC=0.5. AUROC=0.7 is, once again, not great. However, if you look at the lower left part of the plot, you can see that the initial performance of the model is actually pretty good: when the model is confident in its prediction it seems to be more accurate.

This leads us to a third evaluation strategy, looking at prediction accuracy as a function of prediction confidence. This is done inside the metanode on the third branch and then plotted using the very nice 2D/3D scatterplot node that's provided as part of the Erlwood knime nodes:

This plots model accuracy versus prediction confidence, the points are sized by the number of predictions being made. At 80% confidence level (only generating predictions for 1597 of the 5687 rows), the model accuracy is 81%. The same type of behavior is seen for Cohen's kappa (not shown here): at a confidence of 0.8 kappa is 0.6, quite a respectable value.

So we've built a model that isn't particularly impressive in its overall accuracy, but that does quite well when only high-confidence predictions are considered. Maybe the model is actually useful after all.

Until I figure out how to host these in a sensible way on a public knime server, I'll just provide things via dropbox links.
Here's the data directory that is used in this and all other workflows: https://db.tt/qSdJn0St
And here's the link to this workflow: https://db.tt/ZTYQvRXn

RDKit Knime Workflows IV: Rooted fingerprints

Another workflow from the RDKit workshop at the Knime UGM in Zürich

This workflow shows how to use the RDKit's rooted fingerprints. The idea of a rooted fingerprint is to allow calculation of a molecular fingerprint that only includes bits that are rooted at particular atoms. In effect, we're only including information about specific pieces of the molecule in the fingerprint.

The specific application in this workflow is to use rooted torsion fingerprints to pick sets of molecules with diverse F environments and diverse CF3 environments. We published the idea for this a few years ago here: Vulpetti, A., Hommel, U., Landrum, G., Lewis, R. & Dalvit, C. "Design and NMR-Based Screening of LEF, a Library of Chemical Fragments with Different Local Environment of Fluorine." J. Am. Chem. Soc. 131, 12949–12959 (2009) http://pubs.acs.org/doi/abs/10.1021/ja905207t.

Here's the full workflow:

The metanode in the middle identifies and labels molecules that have either a single -F or a single -CF3. For the molecules with a single -F (top branch from the metanode) we identify the F atoms using a Substructure Filter node:

And use those labels for the rooted fingerprints:

These are provided as input to the standard RDKit diversity picker node. Looking of the molecules picked shows that the F environments are, indeed, diverse (though the overall molecule diversity isn't that high):


Until I figure out how to host these in a sensible way on a public knime server, I'll just provide things via dropbox links.
Here's the data directory that is used in this and all other workflows: https://db.tt/qSdJn0St
And here's the link to this workflow: https://db.tt/Y7MCj2BK

Thursday, February 20, 2014

RDKit Knime Workflows III: Flexible alignment

Another workflow from the RDKit workshop at the Knime UGM in Zürich

This workflow demonstrates a simple (and pretty standard) flexible molecular alignment workflow: diverse conformers are generated for a set of database molecules which are then aligned to a query molecule. The highest scoring conformer for each database molecule is kept. 

Conformer generation is done using the RDKit distance geometry implementation. The resulting conformers are cleaned up with the MMFF94 force field and then an RMS filtering step is applied to remove redundant conformers. The meat of the workflow is mostly hidden in the metanodes, so I set up some quickforms to make configuration easy. Here's the configure dialog for the "Generate Diverse Conformers" metanode:

Neither the RDKit nor Knime itself provide a 3D molecule visualizer, so you'll need to install something else to see the results. Here I used the Interactive BioSolveIT Viewer, which is freely available from their website, there's also a link in the workflow. Here are a couple of the alignments that were generated, the query molecule is in green and the database molecule is gray.


You can see that decent conformers are being generated and that the alignments are quite good.

Until I figure out how to host these in a sensible way on a public knime server, I'll just provide things via dropbox links.
Here's the data directory that is used in this and all other workflows: https://db.tt/qSdJn0St
And here's the link to this workflow: https://db.tt/4WOxcXWi

Wednesday, February 19, 2014

RDKit Knime Workflows II: Reaction enumeration

Another workflow from the RDKit workshop at the Knime UGM in Zürich

This one deals with reaction-based library enumeration. It starts by reading in a set of compounds from the Zinc BB set, filters those by properties, and then uses the RDKit Functional Group Filter node to pick subsets that are compatible with the reaction. After some final randomization, the library is enumerated using an RDKit Two Component Reaction node to attach the building blocks to a core.

The results look like this:

Since the molecules share a common core, I've taken the additional step of generating 2D coordinates where those cores are aligned in order to make it easier to see how the products are related.

Until I figure out how to host these in a sensible way on a public knime server, I'll just provide things via dropbox links.
Here's the data directory that is used in this and all other workflows: https://db.tt/qSdJn0St
And here's the link to this workflow: https://db.tt/XZeTRwsi

Tuesday, February 18, 2014

RDKit Knime Workflows I

Things have been quiet lately since I spent a fair amount of time getting ready for last week's Knime UGM in Zürich. On Friday I did a 90 minute workshop that introduced some (I think) interesting and useful functionality that is available via the RDKit Knime nodes.

Over the next few days I will post slightly cleaned up versions of those workflows here.

Here's the first example, a workflow that uses some computed property filters and the PAINS substructure patterns to clean up a set of compounds:



In addition to the always useful descriptor calculator and substructure counter nodes, this also uses the RDKit Interactive Table, which has the nice feature of being able to display molecule renderings in column headers:



Until I figure out how to host these in a sensible way on a public knime server, I'll just provide things via dropbox links.
Here's the data directory that is used in this and all other workflows: https://db.tt/qSdJn0St
And here's the link to this workflow: https://db.tt/crzUy5i1