## Wednesday, November 28, 2018

### Working with unbalanced data, part I

If you are involved in building predictive models for biological end points in pharma research, you've certainly encountered the problem that our datasets tend to be very highly unbalanced - we usually have many, many more inactive molecules than active ones. The imbalance in the data can present challenges when evaluating model performance and "confuse" some of the machine-learning algorithms we use.
Let's start with evaluating models. Although at first glance a classifier (we'll focus on classifiers here) that shows 99% accuracy on hold-out data sounds pretty good, it's not going to be practically useful if the data set is 99% inactive and the model is just always predicting "inactive". There are a number of balanced accuracy metrics available to solve this problem; for this post we will use Cohen's kappa, which is readily available in scikit-learn. As an example of what a difference it makes, the normal accuracy for this confusion matrix:
``````[[400   0]
[ 19   1]]``````
is 95.4% accurate but kappa, which can adopt values from 0 (bad) to 1 (good) is only 0.091. Paul Czodrowski has written an nice article discussing this in more detail that's well worth reading: https://link.springer.com/article/10.1007/s10822-014-9759-6. Paul has also done a presentation summarizing this.
Most classification algorithms can provide predicted probabilities for each class; the actual class predicted for any given new instance is the one which has the highest predicted probability. For a two-class problem this means that the class with a predicted probability of more than 0.5 is the final prediction. The key idea I learned from Dean (and that you can find in the literature) is that you often get much better balanced accuracy (as measured by things like Cohen's kappa) on unbalanced datasets by shifting the decision boundary and allowing the model to predict the minority class even when the predicted probability of that class is not the highest. To make that concrete: if I'm building a model to distinguish actives from inactives based on a training set where there are far more inactives than actives, I may benefit by assigning "active" to data points where the predicted probability of being active is smaller, often considerably smaller, than 0.5. For example, if I set the decision threshold for the model that produced the confusion matrix above to 0.1, I get the following confusion matrix:
``````[[384  16]
[  6  14]]``````
This model is clearly producing far more false positives (16 instead of none) but finds far more of the true positives and has a kappa of 0.533. That's not great, but it is significantly better than the original 0.091.
It's possible to imagine many different approaches for picking the "best" decision threshold for a model as well as multiple definitions of what "best" means at all. The lower the threshold is set, the more true actives that will be retrieved, together with more false positives. Since I wanted to consider a large number of models here, I chose an automated approach: I try a number of different decision thresholds and calculate kappa using the out-of-bag prediction probabilities for the training set; the threshold that produces the best kappa value is picked as the best.
To demonstrate why this method works and when it doesn't work particularly well, here are two more examples taken from the ChEMBL datasets presented below; both are PubChem qHTS confirmatory assays. In the first example, assay ID CHEMBL1614421,the balancing approach works well: we start with a kappa of 0.033 on the holdout data, but by using the OOB predictions to pick a threshold of 0.200 we arrive at a significantly better kappa of 0.451.
``````assay_id CHEMBL1614421, description: PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation, Thioflavin T Binding. (Class of assay: confirmatory) [Related pubchem assays: 596 ]
--------- Default -----------
ratio: 0.129 kappa: 0.033, AUC: 0.860, OOB_AUC: 0.850
[[8681    4]
[1101   22]]
--------- Balanced -----------
thresh: 0.200 kappa: 0.451
[[8261  424]
[ 595  528]]``````
For assay CHEMBL1614249, on the other hand, kappa starts at 0.000 (no compounds from the holdout set are classified as active) and the best the balancing procedure (again based on OOB predictions) can do is a kappa of 0.102 when the threshold is set to 0.050.
``````assay_id CHEMBL1614249, description: PUBCHEM_BIOASSAY: qHTS Assay for Identification of Novel General Anesthetics. In this assay, a GABAergic mimetic model system, apoferritin and a profluorescent 1-aminoanthracene ligand (1-AMA), was used to construct a competitive binding assay for identification of novel general anesthetics (Class of assay: confirmatory) [Related pubchem assays: 2385 (Probe Development Summary for Identification of Novel General Anesthetics), 2323 (Validation apoferritin assay run on SigmaAldrich LOPAC1280 collection)]
--------- Default -----------
ratio: 0.006 kappa: 0.000, AUC: 0.725, OOB_AUC: 0.741
[[8356    0]
[  51    0]]
--------- Balanced -----------
thresh: 0.050 kappa: 0.102
[[8302   54]
[  45    6]]``````
What's the difference here? Why does the balancing procedure produce reasonable results for one of the datasets and poor quality results for the other? A quite obvious difference between the two models can be seen in the ROC curves generated for the OOB predictions on the training sets. Note that these curves, which are generated solely using the predicted probabilities of being active, are not sensitive to the decision threshold.
The ROC curve for assay CHEMBL1614249 shows significantly better early enrichment than that for CHEMBL1614421 and an overall better AUC. The nice-looking ROC curve indicates that the model for CHEMBL1614249 is actually pretty good at distinguishing actives from inactives, adjusting the decision threshold allows us to take advantage of this.
In order to try and determine how general the method is, I've applied the same balancing procedure to a collection of diverse datasets pulled from ChEMBL:
Here's a summary graphic plotting the balanced kappa values (obtained by shifting the decision threshold) against the original kappa values for all the data sets. The points are colored by the out-of-bag AUROC. You can see that the balancing procedure results in better kappa values, often dramatically better values, in almost all cases and that the datasets with low AUROC values tend to show less improvement.

The full details, along with the code to run the experiment, are in the jupyter notebook associated with this post, view it here. Since a couple of the datasets are a bit large, I haven't checked them all into github with this jupyter notebook, there are links above to download the data. I'm also happy to provide the KNIME workflows that I used to construct the datasets.
I plan a couple more blog posts in this series. One will show how to adjust the decision threshold when making predictions within KNIME, one will look at using machine learning methods other than random forests, and I'd also like to include a comparison to the results obtained using standard resampling methods.
I think this approach should be a standard part of our toolbox when building predictive models for unbalanced datasets. What do you think?

# Motivation¶

A frequent desire when working with chemical fingerprints is to see what the indvidual bits actually "mean". For most of the fingerprinting algorithms in the RDKit this question only means something in the context of a molecule, so there have been a number of code samples floating around (including in the RDKit documentation) allowing the user to show the atoms/bonds in a molecule that are responsible for setting a particular bit. In the 2018.09 release we made this easier and added a few functions to the `rdkit.Chem.Draw` package that allow direct visualization of the atoms and bonds from a molecule that set bits from Morgan and RDKit fingerprints.
This notebook shows how to use this functionality. Note that since this is new code, it will only work in RDKit versions 2018.09 and later.
The code that is used for this is derived from the bit rendering code that Nadine Schneider wrote for CheTo.
Start by importing the required packages:
In [1]:
```from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
```
We'll use epinephrine as the demo molecule:
In [2]:
```epinephrine = Chem.MolFromSmiles('CNC[C@H](O)c1ccc(O)c(O)c1')
epinephrine
```
Out[2]:

## Looking at Morgan bits¶

Generate a Morgan fingerprint and save information about the bits that are set using the `bitInfo` argument:
In [3]:
```bi = {}
# show 10 of the set bits:
list(fp.GetOnBits())[:10]
```
Out[3]:
`[1, 80, 227, 315, 589, 606, 632, 807, 875, 1057]`
In its simplest form, the new code lets you display the atomic environment that sets a particular bit. Here we will look at bit 589:
In [4]:
```Draw.DrawMorganBit(epinephrine,589,bi)
```
Out[4]:
There's also a function, `DrawMorganBits()`, for drawing multiple bits at once (thanks to Pat Walters for suggesting this one):
In [5]:
```tpls = [(epinephrine,x,bi) for x in fp.GetOnBits()]
Draw.DrawMorganBits(tpls[:12],molsPerRow=4,legends=[str(x) for x in fp.GetOnBits()][:12])
```
Out[5]:
Some notes about the rendering here:
• The molecule fragment is drawn with the atoms in the same positions as in the original molecule.
• The central atom is highlighted in blue.
• Aromatic atoms are highlighted in yellow
• Aliphatic ring atoms (none present here) are highlighted in dark gray
• Atoms/bonds that are drawn in light gray indicate pieces of the structure that influence the atoms' connectivity invariants but that are not directly part of the fingerprint.
We can also take advantage of the interactivity features of the Jupyter notebook and create a widget allowing is to browse through the bits. Note that this does not work in the blog post or the github view of the notebook; you will need to download the notebook and try it yourself to see this interactivity.
In [6]:
```from ipywidgets import interact,fixed,IntSlider
def renderFpBit(mol,bitIdx,bitInfo,fn):
bid = bitIdx
return(display(fn(mol,bid,bitInfo)))
```
In [8]:
```interact(renderFpBit, bitIdx=list(bi.keys()),mol=fixed(epinephrine),
bitInfo=fixed(bi),fn=fixed(Draw.DrawMorganBit));
```
This interactivity does not work in the blogger view, so here's an animated GIF of what it looks like:

## Looking at RDKit bits¶

We can do the same thing with RDKit bits:
In [9]:
```rdkbi = {}
rdkfp = Chem.RDKFingerprint(epinephrine, maxPath=5, bitInfo=rdkbi)
# show 10 of the set bits:
list(rdkfp.GetOnBits())[:10]
```
Out[9]:
`[93, 103, 112, 122, 148, 149, 161, 166, 194, 208]`
In [10]:
```print(rdkfp.GetNumOnBits(),len(rdkbi))
```
```128 128
```
In [11]:
```Draw.DrawRDKitBit(epinephrine,103,rdkbi)
```
Out[11]:
In [12]:
```tpls = [(epinephrine,x,rdkbi) for x in rdkbi]
Draw.DrawRDKitBits(tpls[:12],molsPerRow=4,legends=[str(x) for x in rdkbi][:12])
```
Out[12]:
Some notes about the rendering here:
• The molecule fragment is drawn with the atoms in the same positions as in the original molecule.
• All bonds that are drawn are part of the bit.
• Aromatic atoms are highlighted in yellow
• By default there is no additional highlighting for aliphatic ring atoms, but this can be changed using the `nonAromaticColor` optional argument.
We can, of course, use the same jupyter interactivity features here as we did for the Morgan fingerprint:
In [13]:
```interact(renderFpBit, bitIdx=list(rdkbi.keys()),mol=fixed(epinephrine),
bitInfo=fixed(rdkbi),fn=fixed(Draw.DrawRDKitBit));
```
This interactivity does not work in the blogger view, so here's an animated GIF of what it looks like: