The idea here is to try out using Pandas to visualize and work with the output of Jameed Hussain's MMPA code in the IPython notebook. The code is available in the RDKit Contrib dir. Jameed gave a couple tutorials on use of the tools at the 2013 UGM. The notebooks from his tutorials are here and here.
I'll use a ChEMBL hERG dataset. This was somewhat inspired/informed by Paul's work here: https://github.com/pzc/herg_chembl_jcim
This is another post that was too big for blogger. The full post in viewable in the nbviewer here: http://nbviewer.ipython.org/github/greglandrum/rdkit_blog/blob/master/notebooks/ChEMBL%20hERG%20MMPA.ipynb?raw=true.
I'll use a ChEMBL hERG dataset. This was somewhat inspired/informed by Paul's work here: https://github.com/pzc/herg_chembl_jcim
In [1]:
from rdkit import Chem,DataStructs
import time,random
from collections import defaultdict
import psycopg2
from rdkit.Chem import Draw,PandasTools,rdMolDescriptors
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from __future__ import print_function
import requests
from xml.etree import ElementTree
import pandas as pd
%load_ext sql
print(rdBase.rdkitVersion)
Preparation
Start by finding the hERG data in ChEMBL
In [3]:
%sql postgresql://localhost/chembl_19 \
select * from chembl_id_lookup where chembl_id = 'CHEMBL240';
Out[3]:
In [10]:
%sql select count(*) from activities join assays using (assay_id) where tid=165;
Out[10]:
Look at all the activity units available.
In [11]:
%sql select distinct(standard_type) from activities join assays using (assay_id) where tid=165;
Out[11]:
In [12]:
%sql select count(*) from activities join assays using (assay_id) where tid=165 and standard_type='Ki';
Out[12]:
Create the data set
Pull all hERG assay Ki values where the value is not qualified and the SMILES doesn't include a dot (the MMPA code doesn't get along with dot-separated SMILES).Reproducibility note: though the queries here are shown against chembl_19, I did the original data export from chembl_18, so the pairs shown may differ somewhat from what you'd get.
In [4]:
data = %sql select canonical_smiles,molregno,activity_id,standard_value,standard_units from activities \
join assays using (assay_id) \
join compound_structures using (molregno) \
where tid=165 and standard_type='Ki' and standard_value is not null and standard_relation='=' \
and canonical_smiles not like '%.%';
Convert to a Pandas DataFrame and write it to a text file
In [5]:
df = data.DataFrame()
In [42]:
df.to_csv('../data/herg_data.txt',sep=" ",index=False)
Build the matched pairs
Call the MMPA fragmentation program.This can take a few minutes.
In [43]:
!python $RDBASE/Contrib/mmpa/rfrag.py < ../data/herg_data.txt > ../data/herg_fragmented.txt
Now generate the MMPs that differ by less than 10% of the molecule. Generate symmetrically so that we can detect tforms in both directions.
In [153]:
!python $RDBASE/Contrib/mmpa/indexing.py -s -r 0.1 < ../data/herg_fragmented.txt > ../data/mmps_default.txt
Read those into a Pandas data frame and look at some of the data.
In [6]:
mmps = pd.read_csv('../data/mmps_default.txt',header=None,names=('smiles1','smiles2','molregno1','molregno2','tform','core'))
In [7]:
mmps[mmps.molregno1==290813]
Out[7]:
There are dupes... drop them:
In [8]:
mmps=mmps.drop_duplicates(subset=("molregno1","molregno2"))
Add a couple molecule columns and remove the SMILES columns:
In [9]:
PandasTools.AddMoleculeColumnToFrame(mmps,'smiles1','mol1')
PandasTools.AddMoleculeColumnToFrame(mmps,'smiles2','mol2')
mmps = mmps[['mol1','mol2','molregno1','molregno2','tform','core']]
mmps.head()
Out[9]:
Now join back on the original data so that we have activities in the table again. Pandas makes this easy.
In [10]:
t1=df[['molregno','standard_value']]
mmpdds = mmps.merge(t1,left_on='molregno1',right_on='molregno',suffixes=("_1","_2")).\
merge(t1,left_on='molregno2',right_on='molregno',suffixes=("_1","_2"))
mmpdds.head()
Out[10]:
Calculate pKi values and the difference between them:
In [11]:
import math
mmpdds['pKi_1']=mmpdds.apply(lambda row:-1*math.log10(float(row['standard_value_1'])*1e-9),axis=1)
mmpdds['pKi_2']=mmpdds.apply(lambda row:-1*math.log10(float(row['standard_value_2'])*1e-9),axis=1)
mmpdds['delta']=mmpdds['pKi_2']-mmpdds['pKi_1']
And, remove some extra columns:
In [12]:
mmpdds=mmpdds[['mol1','mol2','molregno1','molregno2','pKi_1','pKi_2','delta','tform','core']]
mmpdds.head()
Out[12]: