Saturday, February 22, 2014

RDKit Knime Workflows V: Building a predictive model

Another workflow from the RDKit workshop at the Knime UGM in Zürich

This time we're going to look at applying a machine-learning method to building predictive models. We start with a dataset from ChEMBL that contains 85K activity values measured against a panel of cytochrome P450s. The data come from ChEMBL, but the panel assay itself is from PubChem (http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=1851). The workflow starts with the extracted values (the metanode in the lower left can be used to extract the data from a local ChEMBL install), labels any rows with AC50 < 10uM as active (this is the definition used in the PubChem assay), and then, on the top branch, picks out the ~5700 measurements for CYP 2D6. After generating RDKit fingerprints, we build a random forest using Knime's very nice "Tree Ensemble Learner" node: 

The rest of the top branch is concerned with evaluating the accuracy of the model. I'd normally also use some hold-out data to confirm accuracy, but in the interest of brevity here I use the out-of-bag predictions from the random-forest learner. In my experience these track reasonably well with hold-out data.

The scoring node provides the first glimpse at model quality:
I've set the confusion matrix up so that the experimental values are in rows and the predictions are in columns. The node tells us that the model is overall 64% accurate with a Cohen's kappa value of 0.275; not particularly impressive. Looking at the confusion matrix, it appears that the errors are pretty well balanced across the predictions, so the model at least isn't doing anything overly strange.

Predicting CYP inhibition is not an easy problem, so it's not all that surprising that the model isn't doing a great job overall. It's still worth taking a look at a few alternate ways of evaluating model performance. One classic approach to this is using an ROC curve. We construct this by ranking the compounds by the model's predicted probability that each is a CYP 2D6 inhibitor and then plotting the fraction of inhibitors found against the fraction of non-inhibitors found:

The straight gray line in this plot shows what we'd expect by randomly picking compounds. The area under the ROC curve (AUROC) is given at the lower portion of the plot: 0.70. A perfect model will give AUROC=1.0 while random picking gives AUROC=0.5. AUROC=0.7 is, once again, not great. However, if you look at the lower left part of the plot, you can see that the initial performance of the model is actually pretty good: when the model is confident in its prediction it seems to be more accurate.

This leads us to a third evaluation strategy, looking at prediction accuracy as a function of prediction confidence. This is done inside the metanode on the third branch and then plotted using the very nice 2D/3D scatterplot node that's provided as part of the Erlwood knime nodes:

This plots model accuracy versus prediction confidence, the points are sized by the number of predictions being made. At 80% confidence level (only generating predictions for 1597 of the 5687 rows), the model accuracy is 81%. The same type of behavior is seen for Cohen's kappa (not shown here): at a confidence of 0.8 kappa is 0.6, quite a respectable value.

So we've built a model that isn't particularly impressive in its overall accuracy, but that does quite well when only high-confidence predictions are considered. Maybe the model is actually useful after all.

Until I figure out how to host these in a sensible way on a public knime server, I'll just provide things via dropbox links.
Here's the data directory that is used in this and all other workflows: https://db.tt/qSdJn0St
And here's the link to this workflow: https://db.tt/ZTYQvRXn

3 comments:

tantrev said...

I probably misunderstand the KNIME settings but it looks like you're doing a 30% holdout with 100 trees generated with 15 max levels? I couldn't quite tell what the settings for the max number of features considered for each split was but the default usually tends to be the square root of the number of supplied features for classification problems.

It might be interesting to see what changing some of these variables might have on the classification accuracy. :)

greg landrum said...

Looks like you understood the settings perfectly. :-)
And, yes, it's picking from a random selection of the square root of the number of features at each node.

The only parameters here that I normally do much tweaking of other than the fingerprint are the number of trees and the max depth of the trees. I did spend some time exploring parameter space while putting this together, which is how I landed on a depth of 15.

Since I ended up using out-of-bag classification for validation, I really should have increased the number of trees: each final prediction is being made by, on average, only 70 trees. That feels a bit light to me.

Angus said...

Hi,

I tried to download this workflow from the provided link, but it seems to no longer work. Would you be able to provide it again?

Many thanks,

Angus