This is another one that ends up being a bit beyond what blogger can handle. So the post is just a summary, full details are in the jupyter notebook in github, here's the nbviewer link.

Background and Intro

Back in the dark ages before we had access to ChEMBL (it's hard to believe that it was only 10 years ago!) there weren't a lot of datasets available for testing/validating new computational methods. One that became something of a standard for the validation of similarity-based methods was published by Uta Lessel and Hans Briem back in 2000: Lessel, U. F. & Briem, H. Flexsim-X: A Method for the Detection of Molecules with Similar Biological Activity. J. Chem. Inf. Comput. Sci. 40, 246–253 (2000). There's also a followup paper from later that year that uses the dataset again Briem, H. & Lessel, U. F. In vitro and in silico affinity fingerprints: Finding similarities beyond structural classes. Perspectives in Drug Discovery and Design 20, 231–244 (2000). This dataset was derived from the MDDR ("MACCS Drug Data Report") and included five sets of actives (ligands for particular targets) and a set of assumed inactives (compounds drawn randomly from the MDDR).

At the RDKit UGM last week Roger Sayle - another member of the RDKit community who remembers and bears the scars from the pre-ChEMBL dark ages - mentioned the Briem and Lessel dataset and used it in his presentation about a new clustering algorithm implementation. I had some questions about Roger's presentation and wanted to play with the data a little bit, but I encountered a problem: licensing restrictions on the MDDR prevented Lessel and Briem from publishing the structure of the compounds they used. They did include the MDDR IDs in the supplementary material of the paper, so anyone with a copy of the MDDR could reproduce their work. (An aside: this seems pretty normal these days, but back then it wasn't all that common.) However, I don't have an MDDR license and Roger can't legally send me the data, so there's a problem. Since we do have ChEMBL now, I decided to create a similar type of dataset that I could use to answer some questions about similarity and activity. As you might expect, this led me off on a big yak-shaving expedition, but I did get a few potentially interesting datasets along with ideas/material for some blog posts. This is the first of those.

First a word on why I didn't just use the ChEMBL datasets from either our benchmarking paper (https://doi.org/10.1186/1758-2946-5-26) or the model-fusion paper (https://doi.org/10.1021/ci400466r). The dataset from the first paper (and "datasets 1" from the second one) are derived from a publication from Heilkamp and Bajorath (https://doi.org/10.1021/ci200199u) and include "actives" for a set of targets along with assumed inactives randomly picked from ChEMBL. For those sets "active" was defined to be a potency < 10uM against the target. For these new datasets I wanted to use a much stricter definition of "active".

I will do another post with a bit more detail about how I constructed the dataset (along with code), but here's a quick summary. For the purposes of this dataset I worked with ChEMBL25 and only considered Ki values (so that I can compare results between assays on the same target) and an "active" is defined to be a compound with a Ki value <= 1nM (ignoring qualified values). I then picked targets that have at least 200 "actives". This leaves me with 35 targets with between 205 and 934 "actives", listed below. For later use I also defined "inactive" for these targets to be compounds with a Ki value >= 100nM (again, ignoring qualified values). The exported dataset includes SMILES, ChEMBL IDs, active/inactive labels and measured Ki values

I also needed a set of decoys that we will assume are inactive. For this dataset I picked molecules from ChEMBL25 that have a molecular weight between 250 and 800. Since they're assumed inactives, I figured I should avoid picking anything that has actually been seen to be active in any assay, so I further filtered the set to compounds that have only been tested in a single assay. I required that the single assay be one that measures IC50 and that the measured IC50 be > 1uM. I randomly picked about 4000 molecules satisfying these criteria.

It's worth mentioning at least one other dataset of the Lessel and Briem type (it's a blog, I'm not doing a full literature survey here): a larger (11 targets instead of 5), also MDDR-based dataset, for the validation of similarity-based virtual screening methods was published by the Sheffield and Novartis groups in 2004: Hert, J. et al. Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures. Journal of Chemical Information and Modeling 44, 1177–1185 (2004). That dataset was also interesting, and the paper is an absolute classic, but we're not talking about that here.

Looking at the Lessel and Briem datasets

The goal of this exercise is to come up with something analagous to the Lessel and Briem dataset for at least some similarity comparison tests. Let's start with characterizing that data.

As I mentioned above, I don't have access to the data, but Roger agreed to run some comparisons for me and share the results with me. Here are the images along with a brief bit of description. The code to generate the images using the new dataset is below.

Let's start by creating plots similar to what Roger showed. To generate these, at each Tanimoto threshold point we look at the neighbor list for every active compound and calculate the fraction of neighbors that are from the same activity class. A "neighbor" in this context is any compound that has a similarity value to the active compound that is >= the threshold value. For the purposes of this plot, each compound is included in its own neighbor list.

Here's the plot showing the curve for each of the five targets in the B&L sets using the MFP2 fingerprint (Morgan radius 2 with default parameters) to generate similarity. The vertical black line indicates the "noise threshold" for the MFP2 fingerprint that I calculated in an earlier blog post.

All five curves for the RDKit fingerprint (default parameters).

To make comparing the fingerprints easier, here's the averages of the two sets of curves:

Another plot worth looking out is the in-class similarity for each class of actives:

and for the RDKit fingerprint

In each of these plots class 0 corresponds to the in-class similarity for the decoy set.

Unsurprisingly the RDKit fingerprints are shifted towards higher similarity values than MFP2. The RDKit fingerprint tends to have more bits set for a given molecule (I should do a blog post on that too) and the "noise threshold" is significantly higher (0.51 instead of 0.31).

What we see here more or less matches what Roger has in slides 4 and 5 of his presentation, with the addition of the RDKit fingerprints.

I'm going to skip a huge amount of detail here since blogger isn't cooperating.

Let's jump ahead to the end. Apologies that things get even uglier than normal from here on down:

Look at the cumulative distributions

In [25]:

figsize(12,6)
for dkey in filtered_sets:
    dset = self_sims_mfp2[dkey]
    sns.distplot(dset,hist=False)
sns.distplot(decoy_sims_mfp2,hist=False,kde_kws={"lw":3, "color":"k"},label='decoys')
title('MFP2');
xlabel('Tanimoto');

In [26]:

figsize(12,6)
for dkey in filtered_sets:
    dset = self_sims_rdk[dkey]
    sns.distplot(dset,hist=False)
sns.distplot(decoy_sims_rdk,hist=False,kde_kws={"lw":3, "color":"k"},label='decoys')
title('RDKit');
xlabel('Tanimoto');

That looks a lot better. Let's try the other plots again:

In [27]:

figsize(6,5)
for cls in filtered_sets:
    if cls=='overall': continue
    plot([x for x,y in mfp2_cls_means[cls]],[y for x,y in mfp2_cls_means[cls]],'-');
plot([0.32,0.32],[0.3,1.0],'k-')
xlabel('Tanimoto neighbor threshold');
ylabel('Fraction neighbors in the same class');
title('MFP2')
grid();

In [28]:

figsize(6,5)
for cls in filtered_sets:
    if cls=='overall': continue
    plot([x for x,y in rdk_cls_means[cls]],[y for x,y in rdk_cls_means[cls]],'-');
plot([0.51,0.51],[0.3,1.0],'k-')
xlabel('Tanimoto neighbor threshold');
ylabel('Fraction neighbors in the same class');
title('RDKit')
grid();

These are still shifted quite a bit further to the left than the plots for the original data. This is very likely due to the differences in the way decoys are used in the two experiments.

For now, let's take a look at the targets that have survived the analysis:

In [31]:

sel_df = key_df[key_df.chembl_id.isin(filtered_sets)]
sel_df

Out[31]:

	tid	chembl_id	pref_name	organism	nActives	nInactives
1	11	CHEMBL204	Thrombin	Homo sapiens	452	2098
2	15	CHEMBL205	Carbonic anhydrase II	Homo sapiens	283	3094
4	10280	CHEMBL264	Histamine H3 receptor	Homo sapiens	492	527
5	51	CHEMBL214	Serotonin 1a (5-HT1a) receptor	Homo sapiens	229	1079
6	72	CHEMBL217	Dopamine D2 receptor	Homo sapiens	273	3502
7	10841	CHEMBL4552	Peripheral-type benzodiazepine receptor	Rattus norvegicus	205	188
9	107	CHEMBL224	Serotonin 2a (5-HT2a) receptor	Homo sapiens	290	981
10	125	CHEMBL229	Alpha-1a adrenergic receptor	Homo sapiens	250	154
12	130	CHEMBL234	Dopamine D3 receptor	Homo sapiens	373	1369
21	194	CHEMBL244	Coagulation factor X	Homo sapiens	637	1752
22	14037	CHEMBL339	Dopamine D2 receptor	Rattus norvegicus	260	1422
24	252	CHEMBL251	Adenosine A2a receptor	Homo sapiens	238	2449
25	259	CHEMBL253	Cannabinoid CB2 receptor	Homo sapiens	277	1660
28	10529	CHEMBL270	Mu opioid receptor	Rattus norvegicus	580	940
29	11085	CHEMBL1946	Melatonin receptor 1B	Homo sapiens	265	78
30	10576	CHEMBL273	Serotonin 1a (5-HT1a) receptor	Rattus norvegicus	250	1223
31	104290	CHEMBL1907596	Neuronal acetylcholine receptor; alpha4/beta2	Rattus norvegicus	529	800
32	10627	CHEMBL3371	Serotonin 6 (5-HT6) receptor	Homo sapiens	256	726
33	12198	CHEMBL313	Serotonin transporter	Rattus norvegicus	220	1272

And let's write the actives from those sets out to a file so that it's easy for others to use them later. The decoys are in the decoys file ../data/BLSets_singleic50_decoys.txt.

The data, as well as this notebook, are all in the github repo for this blog: https://github.com/greglandrum/rdkit_blog

In [33]:

clsDict = dict(zip(sel_df.chembl_id,range(len(sel_df))))
with open('../data/BLSets_selected_actives.txt','w+') as outf:
    print('smiles target_chembl_id target_index',file=outf)
    for smiles,cid in zip(just_actives.smiles,just_actives.target_chembl_id):
        if cid not in filtered_sets:
            continue
        print(smiles,cid,clsDict[cid],file=outf)

The data for this is all in github:

The final set of actives picked: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_selected_actives.txt
The decoys: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_singleic50_decoys.txt
The full collection of Ki date: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_actives.txt
Metadata about the targets: https://github.com/greglandrum/rdkit_blog/blob/master/data/BLSets_actives_target_key.txt

Many thanks to Roger for the inspiration for this, generating the data on the original dataset, and for an email discussion as I was putting this together.

Saturday, October 5, 2019

A new "Lessel and Briem like" dataset

Background and Intro

Looking at the Lessel and Briem datasets

No comments:

About Me

Blog Archive