Wednesday, November 28, 2018

Working with unbalanced data, part I

If you are involved in building predictive models for biological end points in pharma research, you've certainly encountered the problem that our datasets tend to be very highly unbalanced - we usually have many, many more inactive molecules than active ones. The imbalance in the data can present challenges when evaluating model performance and "confuse" some of the machine-learning algorithms we use.
Let's start with evaluating models. Although at first glance a classifier (we'll focus on classifiers here) that shows 99% accuracy on hold-out data sounds pretty good, it's not going to be practically useful if the data set is 99% inactive and the model is just always predicting "inactive". There are a number of balanced accuracy metrics available to solve this problem; for this post we will use Cohen's kappa, which is readily available in scikit-learn. As an example of what a difference it makes, the normal accuracy for this confusion matrix:
[[400   0]
 [ 19   1]]
is 95.4% accurate but kappa, which can adopt values from 0 (bad) to 1 (good) is only 0.091. Paul Czodrowski has written an nice article discussing this in more detail that's well worth reading: Paul has also done a presentation summarizing this.
There are multiple strategies available for improving the performance of predictive models built on unbalanced data sets. Most of these focus on making the training set more balanced by either oversampling the active examples or undersampling the inactive examples. There's a broad literature on this and you can find plenty of other information online and if I end up having time I'll do a post looking at these approaches later. Here we're going to explore a different approach that I first learned about from Dean Abbott at a workshop he held at a KNIME User Group Meeting several years ago: adjusting the threshold a model uses to make its predictions. I've talked about this in a few venues but this is the first time that I've actually been somewhat systematic and written stuff down. There is also a body of literature available about the approach, which I will call adjusting the decision threshold, but since I've never formally written this up, I haven't yet done a real literature review. It looks like this article from Provost and Fawcett is one of the first publications:
Most classification algorithms can provide predicted probabilities for each class; the actual class predicted for any given new instance is the one which has the highest predicted probability. For a two-class problem this means that the class with a predicted probability of more than 0.5 is the final prediction. The key idea I learned from Dean (and that you can find in the literature) is that you often get much better balanced accuracy (as measured by things like Cohen's kappa) on unbalanced datasets by shifting the decision boundary and allowing the model to predict the minority class even when the predicted probability of that class is not the highest. To make that concrete: if I'm building a model to distinguish actives from inactives based on a training set where there are far more inactives than actives, I may benefit by assigning "active" to data points where the predicted probability of being active is smaller, often considerably smaller, than 0.5. For example, if I set the decision threshold for the model that produced the confusion matrix above to 0.1, I get the following confusion matrix:
[[384  16]
 [  6  14]]
This model is clearly producing far more false positives (16 instead of none) but finds far more of the true positives and has a kappa of 0.533. That's not great, but it is significantly better than the original 0.091.
It's possible to imagine many different approaches for picking the "best" decision threshold for a model as well as multiple definitions of what "best" means at all. The lower the threshold is set, the more true actives that will be retrieved, together with more false positives. Since I wanted to consider a large number of models here, I chose an automated approach: I try a number of different decision thresholds and calculate kappa using the out-of-bag prediction probabilities for the training set; the threshold that produces the best kappa value is picked as the best.
To demonstrate why this method works and when it doesn't work particularly well, here are two more examples taken from the ChEMBL datasets presented below; both are PubChem qHTS confirmatory assays. In the first example, assay ID CHEMBL1614421,the balancing approach works well: we start with a kappa of 0.033 on the holdout data, but by using the OOB predictions to pick a threshold of 0.200 we arrive at a significantly better kappa of 0.451.
assay_id CHEMBL1614421, description: PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation, Thioflavin T Binding. (Class of assay: confirmatory) [Related pubchem assays: 596 ]
--------- Default -----------
ratio: 0.129 kappa: 0.033, AUC: 0.860, OOB_AUC: 0.850
[[8681    4]
 [1101   22]]
--------- Balanced -----------
thresh: 0.200 kappa: 0.451
[[8261  424]
 [ 595  528]]
For assay CHEMBL1614249, on the other hand, kappa starts at 0.000 (no compounds from the holdout set are classified as active) and the best the balancing procedure (again based on OOB predictions) can do is a kappa of 0.102 when the threshold is set to 0.050.
assay_id CHEMBL1614249, description: PUBCHEM_BIOASSAY: qHTS Assay for Identification of Novel General Anesthetics. In this assay, a GABAergic mimetic model system, apoferritin and a profluorescent 1-aminoanthracene ligand (1-AMA), was used to construct a competitive binding assay for identification of novel general anesthetics (Class of assay: confirmatory) [Related pubchem assays: 2385 (Probe Development Summary for Identification of Novel General Anesthetics), 2323 (Validation apoferritin assay run on SigmaAldrich LOPAC1280 collection)]
--------- Default -----------
ratio: 0.006 kappa: 0.000, AUC: 0.725, OOB_AUC: 0.741
[[8356    0]
 [  51    0]]
--------- Balanced -----------
thresh: 0.050 kappa: 0.102
[[8302   54]
 [  45    6]]
What's the difference here? Why does the balancing procedure produce reasonable results for one of the datasets and poor quality results for the other? A quite obvious difference between the two models can be seen in the ROC curves generated for the OOB predictions on the training sets. Note that these curves, which are generated solely using the predicted probabilities of being active, are not sensitive to the decision threshold.
The ROC curve for assay CHEMBL1614249 shows significantly better early enrichment than that for CHEMBL1614421 and an overall better AUC. The nice-looking ROC curve indicates that the model for CHEMBL1614249 is actually pretty good at distinguishing actives from inactives, adjusting the decision threshold allows us to take advantage of this.
In order to try and determine how general the method is, I've applied the same balancing procedure to a collection of diverse datasets pulled from ChEMBL:
  1. 13 Ki datasets for Serotonin receptors. download
  2. The 80 "Dataset 1" datasets from our benchmarking set. download
  3. 8 large PubChem qHTS confirmatory assays. download
  4. 44 DrugMatrix assays. download
Here's a summary graphic plotting the balanced kappa values (obtained by shifting the decision threshold) against the original kappa values for all the data sets. The points are colored by the out-of-bag AUROC. You can see that the balancing procedure results in better kappa values, often dramatically better values, in almost all cases and that the datasets with low AUROC values tend to show less improvement.

The full details, along with the code to run the experiment, are in the jupyter notebook associated with this post, view it here. Since a couple of the datasets are a bit large, I haven't checked them all into github with this jupyter notebook, there are links above to download the data. I'm also happy to provide the KNIME workflows that I used to construct the datasets.
I plan a couple more blog posts in this series. One will show how to adjust the decision threshold when making predictions within KNIME, one will look at using machine learning methods other than random forests, and I'd also like to include a comparison to the results obtained using standard resampling methods.
I think this approach should be a standard part of our toolbox when building predictive models for unbalanced datasets. What do you think?