Wednesday, November 28, 2018

Working with unbalanced data, part I

If you are involved in building predictive models for biological end points in pharma research, you've certainly encountered the problem that our datasets tend to be very highly unbalanced - we usually have many, many more inactive molecules than active ones. The imbalance in the data can present challenges when evaluating model performance and "confuse" some of the machine-learning algorithms we use.
Let's start with evaluating models. Although at first glance a classifier (we'll focus on classifiers here) that shows 99% accuracy on hold-out data sounds pretty good, it's not going to be practically useful if the data set is 99% inactive and the model is just always predicting "inactive". There are a number of balanced accuracy metrics available to solve this problem; for this post we will use Cohen's kappa, which is readily available in scikit-learn. As an example of what a difference it makes, the normal accuracy for this confusion matrix:
[[400   0]
 [ 19   1]]
is 95.4% accurate but kappa, which can adopt values from 0 (bad) to 1 (good) is only 0.091. Paul Czodrowski has written an nice article discussing this in more detail that's well worth reading: Paul has also done a presentation summarizing this.
There are multiple strategies available for improving the performance of predictive models built on unbalanced data sets. Most of these focus on making the training set more balanced by either oversampling the active examples or undersampling the inactive examples. There's a broad literature on this and you can find plenty of other information online and if I end up having time I'll do a post looking at these approaches later. Here we're going to explore a different approach that I first learned about from Dean Abbott at a workshop he held at a KNIME User Group Meeting several years ago: adjusting the threshold a model uses to make its predictions. I've talked about this in a few venues but this is the first time that I've actually been somewhat systematic and written stuff down. There is also a body of literature available about the approach, which I will call adjusting the decision threshold, but since I've never formally written this up, I haven't yet done a real literature review. It looks like this article from Provost and Fawcett is one of the first publications:
Most classification algorithms can provide predicted probabilities for each class; the actual class predicted for any given new instance is the one which has the highest predicted probability. For a two-class problem this means that the class with a predicted probability of more than 0.5 is the final prediction. The key idea I learned from Dean (and that you can find in the literature) is that you often get much better balanced accuracy (as measured by things like Cohen's kappa) on unbalanced datasets by shifting the decision boundary and allowing the model to predict the minority class even when the predicted probability of that class is not the highest. To make that concrete: if I'm building a model to distinguish actives from inactives based on a training set where there are far more inactives than actives, I may benefit by assigning "active" to data points where the predicted probability of being active is smaller, often considerably smaller, than 0.5. For example, if I set the decision threshold for the model that produced the confusion matrix above to 0.1, I get the following confusion matrix:
[[384  16]
 [  6  14]]
This model is clearly producing far more false positives (16 instead of none) but finds far more of the true positives and has a kappa of 0.533. That's not great, but it is significantly better than the original 0.091.
It's possible to imagine many different approaches for picking the "best" decision threshold for a model as well as multiple definitions of what "best" means at all. The lower the threshold is set, the more true actives that will be retrieved, together with more false positives. Since I wanted to consider a large number of models here, I chose an automated approach: I try a number of different decision thresholds and calculate kappa using the out-of-bag prediction probabilities for the training set; the threshold that produces the best kappa value is picked as the best.
To demonstrate why this method works and when it doesn't work particularly well, here are two more examples taken from the ChEMBL datasets presented below; both are PubChem qHTS confirmatory assays. In the first example, assay ID CHEMBL1614421,the balancing approach works well: we start with a kappa of 0.033 on the holdout data, but by using the OOB predictions to pick a threshold of 0.200 we arrive at a significantly better kappa of 0.451.
assay_id CHEMBL1614421, description: PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation, Thioflavin T Binding. (Class of assay: confirmatory) [Related pubchem assays: 596 ]
--------- Default -----------
ratio: 0.129 kappa: 0.033, AUC: 0.860, OOB_AUC: 0.850
[[8681    4]
 [1101   22]]
--------- Balanced -----------
thresh: 0.200 kappa: 0.451
[[8261  424]
 [ 595  528]]
For assay CHEMBL1614249, on the other hand, kappa starts at 0.000 (no compounds from the holdout set are classified as active) and the best the balancing procedure (again based on OOB predictions) can do is a kappa of 0.102 when the threshold is set to 0.050.
assay_id CHEMBL1614249, description: PUBCHEM_BIOASSAY: qHTS Assay for Identification of Novel General Anesthetics. In this assay, a GABAergic mimetic model system, apoferritin and a profluorescent 1-aminoanthracene ligand (1-AMA), was used to construct a competitive binding assay for identification of novel general anesthetics (Class of assay: confirmatory) [Related pubchem assays: 2385 (Probe Development Summary for Identification of Novel General Anesthetics), 2323 (Validation apoferritin assay run on SigmaAldrich LOPAC1280 collection)]
--------- Default -----------
ratio: 0.006 kappa: 0.000, AUC: 0.725, OOB_AUC: 0.741
[[8356    0]
 [  51    0]]
--------- Balanced -----------
thresh: 0.050 kappa: 0.102
[[8302   54]
 [  45    6]]
What's the difference here? Why does the balancing procedure produce reasonable results for one of the datasets and poor quality results for the other? A quite obvious difference between the two models can be seen in the ROC curves generated for the OOB predictions on the training sets. Note that these curves, which are generated solely using the predicted probabilities of being active, are not sensitive to the decision threshold.
The ROC curve for assay CHEMBL1614249 shows significantly better early enrichment than that for CHEMBL1614421 and an overall better AUC. The nice-looking ROC curve indicates that the model for CHEMBL1614249 is actually pretty good at distinguishing actives from inactives, adjusting the decision threshold allows us to take advantage of this.
In order to try and determine how general the method is, I've applied the same balancing procedure to a collection of diverse datasets pulled from ChEMBL:
  1. 13 Ki datasets for Serotonin receptors. download
  2. The 80 "Dataset 1" datasets from our benchmarking set. download
  3. 8 large PubChem qHTS confirmatory assays. download
  4. 44 DrugMatrix assays. download
Here's a summary graphic plotting the balanced kappa values (obtained by shifting the decision threshold) against the original kappa values for all the data sets. The points are colored by the out-of-bag AUROC. You can see that the balancing procedure results in better kappa values, often dramatically better values, in almost all cases and that the datasets with low AUROC values tend to show less improvement.

The full details, along with the code to run the experiment, are in the jupyter notebook associated with this post, view it here. Since a couple of the datasets are a bit large, I haven't checked them all into github with this jupyter notebook, there are links above to download the data. I'm also happy to provide the KNIME workflows that I used to construct the datasets.
I plan a couple more blog posts in this series. One will show how to adjust the decision threshold when making predictions within KNIME, one will look at using machine learning methods other than random forests, and I'd also like to include a comparison to the results obtained using standard resampling methods.
I think this approach should be a standard part of our toolbox when building predictive models for unbalanced datasets. What do you think?

Tuesday, October 16, 2018

Using the new fingerprint bit rendering code


A frequent desire when working with chemical fingerprints is to see what the indvidual bits actually "mean". For most of the fingerprinting algorithms in the RDKit this question only means something in the context of a molecule, so there have been a number of code samples floating around (including in the RDKit documentation) allowing the user to show the atoms/bonds in a molecule that are responsible for setting a particular bit. In the 2018.09 release we made this easier and added a few functions to the rdkit.Chem.Draw package that allow direct visualization of the atoms and bonds from a molecule that set bits from Morgan and RDKit fingerprints.
This notebook shows how to use this functionality. Note that since this is new code, it will only work in RDKit versions 2018.09 and later.
The code that is used for this is derived from the bit rendering code that Nadine Schneider wrote for CheTo.
Start by importing the required packages:
In [1]:
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
We'll use epinephrine as the demo molecule:
In [2]:
epinephrine = Chem.MolFromSmiles('CNC[C@H](O)c1ccc(O)c(O)c1')

Looking at Morgan bits

Generate a Morgan fingerprint and save information about the bits that are set using the bitInfo argument:
In [3]:
bi = {}
fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(epinephrine, radius=2, bitInfo=bi)
# show 10 of the set bits:
[1, 80, 227, 315, 589, 606, 632, 807, 875, 1057]
In its simplest form, the new code lets you display the atomic environment that sets a particular bit. Here we will look at bit 589:
In [4]:
There's also a function, DrawMorganBits(), for drawing multiple bits at once (thanks to Pat Walters for suggesting this one):
In [5]:
tpls = [(epinephrine,x,bi) for x in fp.GetOnBits()]
Draw.DrawMorganBits(tpls[:12],molsPerRow=4,legends=[str(x) for x in fp.GetOnBits()][:12])
Some notes about the rendering here:
  • The molecule fragment is drawn with the atoms in the same positions as in the original molecule.
  • The central atom is highlighted in blue.
  • Aromatic atoms are highlighted in yellow
  • Aliphatic ring atoms (none present here) are highlighted in dark gray
  • Atoms/bonds that are drawn in light gray indicate pieces of the structure that influence the atoms' connectivity invariants but that are not directly part of the fingerprint.
We can also take advantage of the interactivity features of the Jupyter notebook and create a widget allowing is to browse through the bits. Note that this does not work in the blog post or the github view of the notebook; you will need to download the notebook and try it yourself to see this interactivity.
In [6]:
from ipywidgets import interact,fixed,IntSlider
def renderFpBit(mol,bitIdx,bitInfo,fn):
    bid = bitIdx
In [8]:
interact(renderFpBit, bitIdx=list(bi.keys()),mol=fixed(epinephrine),
This interactivity does not work in the blogger view, so here's an animated GIF of what it looks like:

Looking at RDKit bits

We can do the same thing with RDKit bits:
In [9]:
rdkbi = {}
rdkfp = Chem.RDKFingerprint(epinephrine, maxPath=5, bitInfo=rdkbi)
# show 10 of the set bits:
[93, 103, 112, 122, 148, 149, 161, 166, 194, 208]
In [10]:
128 128
In [11]:
In [12]:
tpls = [(epinephrine,x,rdkbi) for x in rdkbi]
Draw.DrawRDKitBits(tpls[:12],molsPerRow=4,legends=[str(x) for x in rdkbi][:12])
Some notes about the rendering here:
  • The molecule fragment is drawn with the atoms in the same positions as in the original molecule.
  • All bonds that are drawn are part of the bit.
  • Aromatic atoms are highlighted in yellow
  • By default there is no additional highlighting for aliphatic ring atoms, but this can be changed using the nonAromaticColor optional argument.
We can, of course, use the same jupyter interactivity features here as we did for the Morgan fingerprint:
In [13]:
interact(renderFpBit, bitIdx=list(rdkbi.keys()),mol=fixed(epinephrine),
This interactivity does not work in the blogger view, so here's an animated GIF of what it looks like: