Monday, December 8, 2014

Exploring Blaze

Continuum's Blaze package provides an interesting way of working with datasets from within Python. Inspired by this blog post, here's a quick exploration of using Blaze Expressions

In [5]:

%pylab inline
import blaze
import pandas as pd

Populating the interactive namespace from numpy and matplotlib

Some technical details before getting started.

This is the first post I've done using Python 3.4. I'm using the anaconda distribution and its excellent conda package manager to handle installing packages.

The Blaze version I'm using is from the blaze channel: conda install -c blaze blaze and I got psycopg2 for Python3.4 (to allow connections to PostgreSQL) from the pandas channel: conda install -c pandas psycopg2.

In [2]:

chembl = blaze.Data('postgresql://localhost/chembl_19')

Blaze collects information about the schema while connecting, so I have access to all the tables and their columns. This doesn't show in the static blog post, but I can use IPython's tab completion on table names and/or columns.

Start with a basic join, blaze figures out the column(s) to use:

In [3]:

docs_and_compounds = blaze.join(chembl.docs,chembl.compound_records)

In [4]:

docs_and_compounds.fields

Out[4]:

['doc_id',
 'journal',
 'year',
 'volume',
 'issue',
 'first_page',
 'last_page',
 'pubmed_id',
 'doi',
 'chembl_id',
 'title',
 'doc_type',
 'authors',
 'abstract',
 'record_id',
 'molregno',
 'compound_key',
 'compound_name',
 'src_id',
 'src_compound_id']

In [5]:

docs_and_compounds.nrows

Out[5]:

1637862

In [6]:

blaze.sort(docs_and_compounds,key='year',ascending=False)

Out[6]:

	doc_id	journal	year	volume	issue	first_page	last_page	pubmed_id	doi	chembl_id	title	doc_type	authors	abstract	record_id	molregno	compound_key	compound_name	src_id	src_compound_id
0	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979019	1626883	MMV672625	None	23	MMV672625
1	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979020	1626884	MMV672626	None	23	MMV672626
2	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979021	1626885	MMV672686	None	23	MMV672686
3	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979022	1626886	MMV672687	None	23	MMV672687
4	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979023	1626887	MMV672688	None	23	MMV672688
5	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979024	1626888	MMV672689	None	23	MMV672689
6	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979025	1626889	MMV672723	None	23	MMV672723
7	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979026	1626890	MMV672725	None	23	MMV672725
8	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979027	1626891	MMV672726	None	23	MMV672726
9	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979028	1626892	MMV672727	None	23	MMV672727
10	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979029	1626893	MMV672730	None	23	MMV672730

Get rid of the unpublished datasets (doc_id=-1) and datasets where the year is None by doing a query on the docs_and_compounds object:

In [6]:

docs_and_compounds = docs_and_compounds[(docs_and_compounds['year']>0) & (docs_and_compounds['doc_id']>-1)]
docs_and_compounds.nrows

Out[6]:

1095936

In [7]:

docs_and_compounds

Out[7]:

	doc_id	journal	year	volume	issue	first_page	last_page	pubmed_id	doi	chembl_id	title	doc_type	authors	abstract	record_id	molregno	compound_key	compound_name	src_id	src_compound_id
0	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350140	292848	15	(3R,4S)-1,4-Bis-(4-methoxy-phenyl)-3-((Z)-3-ph...	1	None
1	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350141	292849	16	(3R,4S)-1,4-Bis-(4-methoxy-phenyl)-3-((E)-3-ph...	1	None
2	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350142	292858	17	(S)-7-(4-Chloro-phenyl)-3-(4-methoxy-phenyl)-2...	1	None
3	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350143	292858	18	(S)-7-(4-Chloro-phenyl)-3-(4-methoxy-phenyl)-2...	1	None
4	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350144	292870	20	4-[(2S,3R)-1-(4-Hydroxy-phenyl)-4-oxo-3-(3-phe...	1	None
5	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350145	292107	21	(3R,4S)-1,4-Bis-(4-hydroxy-phenyl)-3-(3-phenyl...	1	None
6	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350146	292254	22	4-[1-(4-formylphenyl)-3-[3-(4-hydroxyphenyl)pr...	1	None
7	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350147	292403	23	4-[1-(4-formylphenyl)-3-(3-hydroxy-3-phenylpro...	1	None
8	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350148	292557	24	4-[1-(4-formylphenyl)-3-(3-hydroxy-3-phenylpro...	1	None
9	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350149	292687	25	(3R,4S)-1,4-Bis-(4-acetyl-phenyl)-3-(3-oxo-3-p...	1	None
10	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350150	175106	Ezetimibe	(3R,4S)-1-(4-Fluoro-phenyl)-3-[(S)-3-(4-fluoro...	1	None

Find the first year each molregno appeared using blaze.by. This is analogous to SQLs "group by".

In [8]:

minyear = blaze.by(docs_and_compounds.molregno,yr=docs_and_compounds.year.min())
minyear

Out[8]:

	molregno	yr
0	1	1983
1	2	1983
2	3	1983
3	4	1983
4	5	1983
5	6	1983
6	7	1983
7	8	1983
8	9	1983
9	10	1997
10	11	1993

In [9]:

year_counts = blaze.by(minyear.yr,cnt=minyear.molregno.count())

In [10]:

blaze.sort(year_counts,key='yr',ascending=False)

Out[10]:

	yr	cnt
0	2014	15131
1	2013	70429
2	2012	69407
3	2011	67026
4	2010	84525
5	2009	65118
6	2008	61158
7	2007	52060
8	2006	33837
9	2005	30778
10	2004	31118

That's what we had before. Excellent.

Let's try something else: the number of documents per compound. We'll use the limited set of docs that have years and doc_ids

In [11]:

docs_per_compound = blaze.by(docs_and_compounds.molregno,ndocs=docs_and_compounds.doc_id.nunique())

In [12]:

blaze.sort(docs_per_compound,key='ndocs',ascending=False)

Out[12]:

	molregno	ndocs
0	78759	1140
1	241	1045
2	6579	761
3	8062	660
4	8873	617
5	173	593
6	13758	564
7	305519	544
8	27307	539
9	365189	530
10	70140	501

That's also more or less what we got before.

Blaze also makes it easy to pull the data from the query (which is only executed in a lazy manner) into other formats. Here's an example of grabbing it as a Pandas DataFrame (for more on this, look at the blaze migrations blog post)

In [13]:

df = blaze.into(pd.DataFrame,docs_per_compound)

In [14]:

_=hist(df['ndocs'],bins=20,log=True)
_=xlabel('num docs per compound')

In [18]:

len(df[df['ndocs']>10])

Out[18]:

In []:

Sunday, November 23, 2014

Switching to Python 3

Intro

Now that the RDKit supports Python 3, I am planning to make Python 3 my standard environment. Here are some notes I collected while doing so.

Rather than mess around on my own with trying to get a working Python 3 install that co-exists with Python 2, I'm going to use anaconda, which provides most every package I'd be interested in as well as a very capable package manager. Riccardo's conda-rdkit helps a lot. For those wanting a pre-built version of the RDKit, we've made binaries available via the RDKit binstar channel.

I'm want to set up a development environment where I can do my usual RDKit work, so the standard conda stuff doesn't help me. The rest of this post is about making that work.

Linux

Here's what I did on my linux box (Ubuntu 14.04).

After installing anaconda and ensuring that it's in the PATH, setup a python 3.4 environment:
conda create -n py34 python=3.4 anaconda
source activate py34

Then grab a copy of conda-rdkit from github and, from within the conda-rdkit directory, build adnd install boost:
CONDA_PY=34 conda build boost
CONDA_PY=34 conda install --use-local boost

Note: if I'd been thinking about it, I probably could have skipped this very time-consuming step by installing the boost binaries from binstar.

Now go to the RDKit source directory, create a build dir, configure the build using cmake, and build normally (note, I'm buildnig with the optional InChI and avalontools support enabled) :
BOOST_ROOT=~/anaconda/envs/py34 cmake -D RDK_BUILD_SWIG_WRAPPERS=OFF \
-D AVALONTOOLS_DIR=$RDBASE/External/AvalonTools/distrib/SourceDistribution \
-D RDK_BUILD_AVALON_SUPPORT=ON -D RDK_BUILD_INCHI_SUPPORT=ON -D RDK_INSTALL_STATIC_LIBS=OFF \
-D RDK_USE_FLEXBISON=OFF \
-D Boost_NO_SYSTEM_PATHS=ON \
-D CMAKE_SYSTEM_PREFIX_PATH=~/anaconda/envs/py34 \
..
make -j4 install

To test the build, you need to have $RDBASE set and ensure your environment is properly configured:

declare -x LD_LIBRARY_PATH="$RDBASE/lib"

export PYTHONPATH="$RDBASE"

At this point running ctest should show all tests passing and you should have a working RDKit build.

MacOS

[Update 20 December, 2014. Thanks to Pat for reminding me that I hadn't done this.]

The conda-rdkit recipes also support building on the Mac. I have not yet managed to get a free-standing build like what's described above working, but using the master version of the conda-rdkit recipe does now allow you to build a working RDKit version of the most recent release for either python 2.7 or 3.4. Pre-built MacOS binaries are also available via the RDKit binstar channel. We'll work on getting the development branch (which does a build based on the current github master) working and I will, hopefully, in a future post be able to document how to do a "standard" free-standing build working with anaconda on the Mac.

Sunday, October 26, 2014

Molecule counts in ChEMBL documents

I was curious about how many compounds are added to the data sources that are loaded into ChEMBL each year. This is an exploration around that theme, with a digression or two at the end, but no real conclusions

In [1]:

import time,random
import psycopg2
from __future__ import print_function
import pandas as pd
%pylab inline
%load_ext sql

Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['random']
`%matplotlib` prevents importing * from pylab and numpy

Start by pulling the number of new compounds appearing per year.

In [2]:

data = %sql postgresql://localhost/chembl_19 \
    select miny as year,count(molregno) from \
       (select molregno,min(year) as miny from compound_records join docs using (doc_id) where year>0 group by molregno) tmp \
    group by miny order by miny desc;
df1 = data.DataFrame()

37 rows affected.

In [3]:

df1.head(10)

Out[3]:

	year	count
0	2014	15131
1	2013	70429
2	2012	69407
3	2011	67026
4	2010	84525
5	2009	65118
6	2008	61158
7	2007	52060
8	2006	33837
9	2005	30778

Obviously 2014 isn't complete yet, so we should ignore that one.

How many documents per year show up?

In [5]:

data = %sql \
       select year,count(doc_id) as cnt from docs where year>0 \
       group by year order by year desc ;
df2 = data.DataFrame()

37 rows affected.

In [8]:

data=pd.merge(df1,df2)
data.rename(columns={'count':'molCount','cnt':'docCount'},inplace=True)
data.head(10)

Out[8]:

	year	molCount	docCount
0	2014	15131	895
1	2013	70429	4211
2	2012	69407	4174
3	2011	67026	4022
4	2010	84525	4599
5	2009	65118	4443
6	2008	61158	4063
7	2007	52060	3822
8	2006	33837	2132
9	2005	30778	2009

2010 was a big year.

Look at the general relationship between number of documents and number of compounds:

In [10]:

scatter(data['docCount'],data['molCount'])
_=xlabel('doc count')
_=ylabel('mol count')

That looks pretty linear; we can confirm that by looking at the ratio:

In [12]:

data['ratio']=data['molCount']/data['docCount']

In [13]:

scatter(data['year'],data['ratio'])
_=xlabel('year')
_=ylabel('mols / doc')

There's a clear early outlier, leave it out:

In [15]:

data2=data[data.year>=1980]
scatter(data2['year'],data2['ratio'])
_=xlabel('year')
_=ylabel('mols / doc')

Out[15]:

<matplotlib.collections.PathCollection at 0x10b810950>

Looks like there was dip in the late 90s. Odd.

Compounds per paper

A more systematic look at the number of compounds appearing in each paper.

In [16]:

data = %sql postgresql://localhost/chembl_19 \
    select count(molregno) numcmpds,doc_id from compound_records join docs using (doc_id) \
    where doc_id>0 and year>1 group by doc_id
df1 = data.DataFrame()

57115 rows affected.

Let's ignore documents that have more than 200 compounds; they probably aren't SAR papers anyway:

In [24]:

smaller=df1[df1.numcmpds<200]
print(len(smaller))
_=hist(smaller['numcmpds'],bins=20,log=True)

Papers per compound

Turn the question around and ask how many papers each compound appears in:

In [17]:

data = %sql postgresql://localhost/chembl_19 \
    select count(doc_id) numdocs,molregno from compound_records join docs using (doc_id) \
    where doc_id>0 and year>1 group by molregno order by numdocs desc
df1 = data.DataFrame()

874168 rows affected.

In [19]:

_=hist(df1['numdocs'],bins=20,log=True)
_=xlabel('num papers')

In [20]:

df1.head()

Out[20]:

	numdocs	molregno
0	1142	78759
1	1049	241
2	763	6579
3	667	8062
4	618	8873

top 10 appearing molecules

In [21]:

data = %sql postgresql://localhost/chembl_19 \
    select molregno,m as smiles,numdocs from ( \
      select count(doc_id) numdocs,molregno from compound_records join docs using (doc_id) \
      where doc_id>0 and year>1 group by molregno order by numdocs desc limit 10 \
    ) tmp join rdk.mols using (molregno)
df1 = data.DataFrame()

10 rows affected.

In [22]:

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools

In [23]:

PandasTools.AddMoleculeColumnToFrame(df1,smilesCol='smiles')

In [24]:

PandasTools.FrameToGridImage(df1,legendsCol='numdocs')

Out[24]:

In [25]:

df1

Out[25]:

	molregno	smiles	numdocs
0	78759	COc1cccc2c1C(=O)c1c(O)c3c(c(O)c1C2=O)C[C@@](O)(C(=O)CO)C[C@@H]3O[C@H]1C[C@H](N)[C@H](O)[C@H](C)O1	1142
1	241	O=C(O)c1cn(C2CC2)c2cc(N3CCNCC3)c(F)cc2c1=O	1049
2	6579	CCN(CC)CCCC(C)Nc1ccnc2cc(Cl)ccc21	763
3	8062	CC(=O)O[C@@H]1C2=C(C)[C@@H](OC(=O)[C@H](O)[C@@H](NC(=O)c3ccccc3)c3ccccc3)C[C@@](O)([C@@H](OC(=O)c3ccccc3)[C@@H]3[C@]4(OC(C)=O)CO[C@@H]4C[C@H](O)[C@@]3(C)C1=O)C2(C)C	667
4	8873	C[C@@H]1[C@H](O)[C@@H](C)C=CC=CC=CC=CC=CC=CC=C[C@H](O[C@@H]2O[C@H](C)[C@@H](O)[C@H](N)[C@@H]2O)C[C@@H]2O[C@](O)(C[C@@H](O)C[C@@H](O)[C@H](O)CC[C@@H](O)C[C@@H](O)CC(=O)O[C@H]1C)C[C@H](O)[C@H]2C(=O)O	618
5	173	COc1ccc2c(c1)c(CC(=O)O)c(C)n2C(=O)c1ccc(Cl)cc1	596
6	13758	OC(Cn1cncn1)(Cn1cncn1)c1ccc(F)cc1F	565
7	305519	CC1(C)S[C@@H]2[C@H](NC(=O)[C@H](N)c3ccccc3)C(=O)N2[C@H]1C(=O)O	545
8	27307	Cc1cn([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)c(=O)[nH]c1=O	543
9	365189	CO[C@H]1C=CO[C@@]2(C)Oc3c(c4c(O)c(/C=N/N5CCN(C)CC5)c(c(O)c4c(O)c3C)NC(=O)C(C)=CC=C[C@H](C)[C@H](O)[C@@H](C)[C@@H](O)[C@@H](C)[C@H](OC(C)=O)[C@@H]1C)C2=O	532

Let's look some of these popular compounds up via the NCI Resolver. I would use ChemSpider for this, but their API is overly painful to use for these simple tasks.

In [34]:

import requests
def lookup(x):
    resp = requests.get('http://cactus.nci.nih.gov/chemical/structure/'+x+'%09/names')
    return '|'.join([x for x in resp.content.split('\n') if x.find('USAN')>-1 or x.find('USP')>-1])
df1['names']=df1['smiles'].map(lookup)

In [40]:

df1.drop('smiles',1)

Out[40]:

	molregno	numdocs	names
0	78759	1142	Doxorubicin [USAN:BAN:INN]\|Doxorubicin (USAN)
1	241	1049	Ciprofloxacin (JAN/USP/INN)\|Ciprofloxacin [USAN:BAN:INN]
2	6579	763	Chloroquine (USP)\|Chloroquine [USAN:BAN:INN]
3	8062	667	Paclitaxel [USAN:BAN:INN]
4	8873	618
5	173	596	USAN\|Indomethacin [USAN:BAN]\|Indomethacin (USP)
6	13758	565	Fluconazole (JAN/USAN/INN)\|Fluconazole [USAN:BAN:INN:JAN]
7	305519	545	Ampicillin (USP/INN)\|Ampicillin [USAN:BAN:INN:JAN]
8	27307	543	Zidovudine [USAN:BAN:INN:JAN]\|Zidovudine (JAN/USP/INN)
9	365189	532

I could look the missing two up in chemspider manually. Number 4 is Amphotericin B, and number 9 is Rifampicin. Definitely lots of antibiotics in that list.

In []:

Monday, December 8, 2014

Exploring Blaze

Sunday, November 23, 2014

Switching to Python 3

Intro

Linux

MacOS

Sunday, October 26, 2014

Molecule counts in ChEMBL documents

Compounds per paper

Papers per compound

top 10 appearing molecules

About Me

Blog Archive