Background and Intro
Back in the dark ages before we had access to ChEMBL (it's hard to believe that it was only 10 years ago!) there weren't a lot of datasets available for testing/validating new computational methods. One that became something of a standard for the validation of similarity-based methods was published by Uta Lessel and Hans Briem back in 2000: Lessel, U. F. & Briem, H. Flexsim-X: A Method for the Detection of Molecules with Similar Biological Activity. J. Chem. Inf. Comput. Sci. 40, 246–253 (2000). There's also a followup paper from later that year that uses the dataset again Briem, H. & Lessel, U. F. In vitro and in silico affinity fingerprints: Finding similarities beyond structural classes. Perspectives in Drug Discovery and Design 20, 231–244 (2000). This dataset was derived from the MDDR ("MACCS Drug Data Report") and included five sets of actives (ligands for particular targets) and a set of assumed inactives (compounds drawn randomly from the MDDR).
At the RDKit UGM last week Roger Sayle - another member of the RDKit community who remembers and bears the scars from the pre-ChEMBL dark ages - mentioned the Briem and Lessel dataset and used it in his presentation about a new clustering algorithm implementation. I had some questions about Roger's presentation and wanted to play with the data a little bit, but I encountered a problem: licensing restrictions on the MDDR prevented Lessel and Briem from publishing the structure of the compounds they used. They did include the MDDR IDs in the supplementary material of the paper, so anyone with a copy of the MDDR could reproduce their work. (An aside: this seems pretty normal these days, but back then it wasn't all that common.) However, I don't have an MDDR license and Roger can't legally send me the data, so there's a problem. Since we do have ChEMBL now, I decided to create a similar type of dataset that I could use to answer some questions about similarity and activity. As you might expect, this led me off on a big yak-shaving expedition, but I did get a few potentially interesting datasets along with ideas/material for some blog posts. This is the first of those.
First a word on why I didn't just use the ChEMBL datasets from either our benchmarking paper (https://doi.org/10.1186/1758-2946-5-26) or the model-fusion paper (https://doi.org/10.1021/ci400466r). The dataset from the first paper (and "datasets 1" from the second one) are derived from a publication from Heilkamp and Bajorath (https://doi.org/10.1021/ci200199u) and include "actives" for a set of targets along with assumed inactives randomly picked from ChEMBL. For those sets "active" was defined to be a potency < 10uM against the target. For these new datasets I wanted to use a much stricter definition of "active".
I will do another post with a bit more detail about how I constructed the dataset (along with code), but here's a quick summary. For the purposes of this dataset I worked with ChEMBL25 and only considered Ki values (so that I can compare results between assays on the same target) and an "active" is defined to be a compound with a Ki value <= 1nM (ignoring qualified values). I then picked targets that have at least 200 "actives". This leaves me with 35 targets with between 205 and 934 "actives", listed below. For later use I also defined "inactive" for these targets to be compounds with a Ki value >= 100nM (again, ignoring qualified values). The exported dataset includes SMILES, ChEMBL IDs, active/inactive labels and measured Ki values
I will do another post with a bit more detail about how I constructed the dataset (along with code), but here's a quick summary. For the purposes of this dataset I worked with ChEMBL25 and only considered Ki values (so that I can compare results between assays on the same target) and an "active" is defined to be a compound with a Ki value <= 1nM (ignoring qualified values). I then picked targets that have at least 200 "actives". This leaves me with 35 targets with between 205 and 934 "actives", listed below. For later use I also defined "inactive" for these targets to be compounds with a Ki value >= 100nM (again, ignoring qualified values). The exported dataset includes SMILES, ChEMBL IDs, active/inactive labels and measured Ki values
I also needed a set of decoys that we will assume are inactive. For this dataset I picked molecules from ChEMBL25 that have a molecular weight between 250 and 800. Since they're assumed inactives, I figured I should avoid picking anything that has actually been seen to be active in any assay, so I further filtered the set to compounds that have only been tested in a single assay. I required that the single assay be one that measures IC50 and that the measured IC50 be > 1uM. I randomly picked about 4000 molecules satisfying these criteria.
It's worth mentioning at least one other dataset of the Lessel and Briem type (it's a blog, I'm not doing a full literature survey here): a larger (11 targets instead of 5), also MDDR-based dataset, for the validation of similarity-based virtual screening methods was published by the Sheffield and Novartis groups in 2004: Hert, J. et al. Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures. Journal of Chemical Information and Modeling 44, 1177–1185 (2004). That dataset was also interesting, and the paper is an absolute classic, but we're not talking about that here.
Looking at the Lessel and Briem datasets
The goal of this exercise is to come up with something analagous to the Lessel and Briem dataset for at least some similarity comparison tests. Let's start with characterizing that data.
As I mentioned above, I don't have access to the data, but Roger agreed to run some comparisons for me and share the results with me. Here are the images along with a brief bit of description. The code to generate the images using the new dataset is below.
Let's start by creating plots similar to what Roger showed. To generate these, at each Tanimoto threshold point we look at the neighbor list for every active compound and calculate the fraction of neighbors that are from the same activity class. A "neighbor" in this context is any compound that has a similarity value to the active compound that is >= the threshold value. For the purposes of this plot, each compound is included in its own neighbor list.
Here's the plot showing the curve for each of the five targets in the B&L sets using the MFP2 fingerprint (Morgan radius 2 with default parameters) to generate similarity. The vertical black line indicates the "noise threshold" for the MFP2 fingerprint that I calculated in an earlier blog post.All five curves for the RDKit fingerprint (default parameters).
To make comparing the fingerprints easier, here's the averages of the two sets of curves:
Another plot worth looking out is the in-class similarity for each class of actives:and for the RDKit fingerprintIn each of these plots class 0 corresponds to the in-class similarity for the decoy set.
Unsurprisingly the RDKit fingerprints are shifted towards higher similarity values than MFP2. The RDKit fingerprint tends to have more bits set for a given molecule (I should do a blog post on that too) and the "noise threshold" is significantly higher (0.51 instead of 0.31).
What we see here more or less matches what Roger has in slides 4 and 5 of his presentation, with the addition of the RDKit fingerprints.
I'm going to skip a huge amount of detail here since blogger isn't cooperating.
Let's jump ahead to the end. Apologies that things get even uglier than normal from here on down:
In [25]:
In [26]:
In [27]:
In [28]:
In [31]:
Out[31]:
In [33]:
The data for this is all in github:
- The final set of actives picked: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_selected_actives.txt
- The decoys: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_singleic50_decoys.txt
- The full collection of Ki date: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_actives.txt
- Metadata about the targets: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_actives_target_key.txt
Many thanks to Roger for the inspiration for this, generating the data on the original dataset, and for an email discussion as I was putting this together.
No comments:
Post a Comment