Wednesday, August 9, 2017

Chemical Topic Modeling with the RDKit and KNIME

We recently published a paper on the application of topic modeling, a method developed in the text mining community, to chemical data: Here I'm going to show how to use this approach inside of KNIME. I'm really pleased with how the paper turned out and think the approach is a really useful one for efficiently organizing chemical datasets. We've got a bunch of cool ideas about what to do next too, so keep your eyes open...

An aside/apology: while doing the literature search, we somehow completely missed Rajarshi's blog post from 2012 ( This is really embarrassing. Sorry Rajarshi...

Since we continue to work on the code that Nadine wrote while doing the research, called CheTo (for ChemicalTopic), Nadine and I have put it on GitHub ( We'd also like to make it easy for other people to use the code, so we built a conda package for it and added it to the RDKit channel. If you're using the Anaconda Python distribution (and you should be!), you can install the package in your conda enviroment with a single command conda install -c rdkit cheto. If you don't already have the RDKit installed, it will automatically be installed for you. We'll be updating the git repository in the coming weeks to provide more information about and examples of how to use the CheTo python code. This blog post, though, is about using it from KNIME.

Let's start with the pre-requisites. You need an installation of at least v3.4.0 of KNIME (released July 2017). That installation should have the KNIME text mining extensions and the Python Integration version that supports Python 2 and Python 3. At the time of writing these are both in KNIME Labs. It's not a bad idea to have the RDKit nodes installed too (these are available in the KNIME Community Extensions section in the "Install KNIME Extensions" dialog). You also need to have the Python extension properly configured, I covered this in a post on the KNIME blog: The condo environment you are using from KNIME should have both the RDKit and CheTo installed from the rdkit channel (see the CheTo installation instructions above).

phew... now we're ready to go. Here's a screenshot of the sample workflow for doing chemical topic modeling:

The table reader at the beginning brings in a set of a couple hundred molecules taken from 12 ChEMBL documents. The real work is done in the "fingerprint and do LDA" wrapped metanode, which expects an input table that has a column named "smiles" that contains SMILES. We won't get into the details of the contents of this node here, but if you configure the node (double click or right click and "configure") you'll get a dialog which allows you to change the important parameters:

Executing the node, which can take a while since it's not currently very well optimized, gives you two tables. The first has the individual molecules assigned to topics:

and the second has the bits that define the topics themselves,  including Nadine's very nice depictions of the fingerprint bits:

The GroupBy nodes provide a summary of the number of documents each topic shows up in as well as the number of topics that are identified in each document. This last was one of the validation metrics that we used in the paper; here's what we get with the sample data set:
You can see that the majority of the documents contain compounds that are assigned to a single topic, while a few documents contain compounds assigned to two topics and one, doc_id 44596, has compounds from three topics.

There's a lot more detail in the paper about what this all means and what you might do with it; the goal for this post was to provide a very quick overview of how to do the analysis and look at the results inside of KNIME. I hope I was successful with that.

The workflow itself is on the KNIME public examples server, you find that in KNIME by logging into the examples server and then navigating to Examples/99_Community/03_RDKit/08_Chemical_Topic_Modeling: