Monday, December 8, 2014

Exploring Blaze

Continuum's Blaze package provides an interesting way of working with datasets from within Python. Inspired by this blog post, here's a quick exploration of using Blaze Expressions

In [5]:
%pylab inline
import blaze
import pandas as pd
Populating the interactive namespace from numpy and matplotlib

Some technical details before getting started.

This is the first post I've done using Python 3.4. I'm using the anaconda distribution and its excellent conda package manager to handle installing packages.

The Blaze version I'm using is from the blaze channel: conda install -c blaze blaze and I got psycopg2 for Python3.4 (to allow connections to PostgreSQL) from the pandas channel: conda install -c pandas psycopg2.

In [2]:
chembl = blaze.Data('postgresql://localhost/chembl_19')

Blaze collects information about the schema while connecting, so I have access to all the tables and their columns. This doesn't show in the static blog post, but I can use IPython's tab completion on table names and/or columns.

Start with a basic join, blaze figures out the column(s) to use:

In [3]:
docs_and_compounds = blaze.join(chembl.docs,chembl.compound_records)
In [4]:
docs_and_compounds.fields
Out[4]:
['doc_id',
 'journal',
 'year',
 'volume',
 'issue',
 'first_page',
 'last_page',
 'pubmed_id',
 'doi',
 'chembl_id',
 'title',
 'doc_type',
 'authors',
 'abstract',
 'record_id',
 'molregno',
 'compound_key',
 'compound_name',
 'src_id',
 'src_compound_id']
In [5]:
docs_and_compounds.nrows
Out[5]:
1637862
In [6]:
blaze.sort(docs_and_compounds,key='year',ascending=False)
Out[6]:
doc_id journal year volume issue first_page last_page pubmed_id doi chembl_id title doc_type authors abstract record_id molregno compound_key compound_name src_id src_compound_id
0 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979019 1626883 MMV672625 None 23 MMV672625
1 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979020 1626884 MMV672626 None 23 MMV672626
2 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979021 1626885 MMV672686 None 23 MMV672686
3 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979022 1626886 MMV672687 None 23 MMV672687
4 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979023 1626887 MMV672688 None 23 MMV672688
5 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979024 1626888 MMV672689 None 23 MMV672689
6 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979025 1626889 MMV672723 None 23 MMV672723
7 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979026 1626890 MMV672725 None 23 MMV672725
8 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979027 1626891 MMV672726 None 23 MMV672726
9 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979028 1626892 MMV672727 None 23 MMV672727
10 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979029 1626893 MMV672730 None 23 MMV672730

Get rid of the unpublished datasets (doc_id=-1) and datasets where the year is None by doing a query on the docs_and_compounds object:

In [6]:
docs_and_compounds = docs_and_compounds[(docs_and_compounds['year']>0) & (docs_and_compounds['doc_id']>-1)]
docs_and_compounds.nrows
Out[6]:
1095936
In [7]:
docs_and_compounds
Out[7]:
doc_id journal year volume issue first_page last_page pubmed_id doi chembl_id title doc_type authors abstract record_id molregno compound_key compound_name src_id src_compound_id
0 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350140 292848 15 (3R,4S)-1,4-Bis-(4-methoxy-phenyl)-3-((Z)-3-ph... 1 None
1 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350141 292849 16 (3R,4S)-1,4-Bis-(4-methoxy-phenyl)-3-((E)-3-ph... 1 None
2 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350142 292858 17 (S)-7-(4-Chloro-phenyl)-3-(4-methoxy-phenyl)-2... 1 None
3 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350143 292858 18 (S)-7-(4-Chloro-phenyl)-3-(4-methoxy-phenyl)-2... 1 None
4 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350144 292870 20 4-[(2S,3R)-1-(4-Hydroxy-phenyl)-4-oxo-3-(3-phe... 1 None
5 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350145 292107 21 (3R,4S)-1,4-Bis-(4-hydroxy-phenyl)-3-(3-phenyl... 1 None
6 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350146 292254 22 4-[1-(4-formylphenyl)-3-[3-(4-hydroxyphenyl)pr... 1 None
7 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350147 292403 23 4-[1-(4-formylphenyl)-3-(3-hydroxy-3-phenylpro... 1 None
8 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350148 292557 24 4-[1-(4-formylphenyl)-3-(3-hydroxy-3-phenylpro... 1 None
9 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350149 292687 25 (3R,4S)-1,4-Bis-(4-acetyl-phenyl)-3-(3-oxo-3-p... 1 None
10 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350150 175106 Ezetimibe (3R,4S)-1-(4-Fluoro-phenyl)-3-[(S)-3-(4-fluoro... 1 None

Find the first year each molregno appeared using blaze.by. This is analogous to SQLs "group by".

In [8]:
minyear = blaze.by(docs_and_compounds.molregno,yr=docs_and_compounds.year.min())
minyear
Out[8]:
molregno yr
0 1 1983
1 2 1983
2 3 1983
3 4 1983
4 5 1983
5 6 1983
6 7 1983
7 8 1983
8 9 1983
9 10 1997
10 11 1993
In [9]:
year_counts = blaze.by(minyear.yr,cnt=minyear.molregno.count())
In [10]:
blaze.sort(year_counts,key='yr',ascending=False)
Out[10]:
yr cnt
0 2014 15131
1 2013 70429
2 2012 69407
3 2011 67026
4 2010 84525
5 2009 65118
6 2008 61158
7 2007 52060
8 2006 33837
9 2005 30778
10 2004 31118

That's what we had before. Excellent.

Let's try something else: the number of documents per compound. We'll use the limited set of docs that have years and doc_ids

In [11]:
docs_per_compound = blaze.by(docs_and_compounds.molregno,ndocs=docs_and_compounds.doc_id.nunique())
In [12]:
blaze.sort(docs_per_compound,key='ndocs',ascending=False)
Out[12]:
molregno ndocs
0 78759 1140
1 241 1045
2 6579 761
3 8062 660
4 8873 617
5 173 593
6 13758 564
7 305519 544
8 27307 539
9 365189 530
10 70140 501

That's also more or less what we got before.

Blaze also makes it easy to pull the data from the query (which is only executed in a lazy manner) into other formats. Here's an example of grabbing it as a Pandas DataFrame (for more on this, look at the blaze migrations blog post)

In [13]:
df = blaze.into(pd.DataFrame,docs_per_compound)
In [14]:
_=hist(df['ndocs'],bins=20,log=True)
_=xlabel('num docs per compound')
In [18]:
len(df[df['ndocs']>10])
Out[18]:
2381
In []:
 

Sunday, November 23, 2014

Switching to Python 3

Intro

Now that the RDKit supports Python 3, I am planning to make Python 3 my standard environment. Here are some notes I collected while doing so.

Rather than mess around on my own with trying to get a working Python 3 install that co-exists with Python 2, I'm going to use anaconda, which provides most every package I'd be interested in as well as a very capable package manager. Riccardo's conda-rdkit helps a lot. For those wanting a pre-built version of the RDKit, we've made binaries available via the RDKit binstar channel.

I'm want to set up a development environment where I can do my usual RDKit work, so the standard conda stuff doesn't help me. The rest of this post is about making that work.

Linux

Here's what I did on my linux box (Ubuntu 14.04).

After installing anaconda and ensuring that it's in the PATH, setup a python 3.4 environment:
conda create -n py34 python=3.4 anaconda
source activate py34 

Then grab a copy of conda-rdkit from github and, from within the conda-rdkit directory, build adnd install boost:
CONDA_PY=34 conda build boost
CONDA_PY=34 conda install --use-local boost

Note: if I'd been thinking about it, I probably could have skipped this very time-consuming step by installing the boost binaries from binstar.

Now go to the RDKit source directory, create a build dir, configure the build using cmake, and build normally (note, I'm buildnig with the optional InChI and avalontools support enabled) :
BOOST_ROOT=~/anaconda/envs/py34 cmake -D RDK_BUILD_SWIG_WRAPPERS=OFF \
 -D AVALONTOOLS_DIR=$RDBASE/External/AvalonTools/distrib/SourceDistribution \
 -D RDK_BUILD_AVALON_SUPPORT=ON -D RDK_BUILD_INCHI_SUPPORT=ON -D RDK_INSTALL_STATIC_LIBS=OFF \
 -D RDK_USE_FLEXBISON=OFF \
 -D Boost_NO_SYSTEM_PATHS=ON \
 -D CMAKE_SYSTEM_PREFIX_PATH=~/anaconda/envs/py34 \
 ..
make -j4 install

To test the build, you need to have $RDBASE set and ensure your environment is properly configured:
declare -x LD_LIBRARY_PATH="$RDBASE/lib"
export PYTHONPATH="$RDBASE"

At this point running ctest should show all tests passing and you should have a working RDKit build.

MacOS

[Update 20 December, 2014. Thanks to Pat for reminding me that I hadn't done this.]

The conda-rdkit recipes also support building on the Mac. I have not yet managed to get a free-standing build like what's described above working, but using the master version of the conda-rdkit recipe does now allow you to build a working RDKit version of the most recent release for either python 2.7 or 3.4. Pre-built MacOS binaries are also available via the RDKit binstar channel. We'll work on getting the development branch (which does a build based on the current github master) working and I will, hopefully, in a future post be able to document how to do a "standard" free-standing build working with anaconda on the Mac.



Sunday, October 26, 2014

Molecule counts in ChEMBL documents

I was curious about how many compounds are added to the data sources that are loaded into ChEMBL each year. This is an exploration around that theme, with a digression or two at the end, but no real conclusions

In [1]:
import time,random
import psycopg2
from __future__ import print_function
import pandas as pd
%pylab inline
%load_ext sql
Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['random']
`%matplotlib` prevents importing * from pylab and numpy

Start by pulling the number of new compounds appearing per year.

In [2]:
data = %sql postgresql://localhost/chembl_19 \
    select miny as year,count(molregno) from \
       (select molregno,min(year) as miny from compound_records join docs using (doc_id) where year>0 group by molregno) tmp \
    group by miny order by miny desc;
df1 = data.DataFrame()
37 rows affected.

In [3]:
df1.head(10)
Out[3]:
year count
0 2014 15131
1 2013 70429
2 2012 69407
3 2011 67026
4 2010 84525
5 2009 65118
6 2008 61158
7 2007 52060
8 2006 33837
9 2005 30778

Obviously 2014 isn't complete yet, so we should ignore that one.

How many documents per year show up?

In [5]:
data = %sql \
       select year,count(doc_id) as cnt from docs where year>0 \
       group by year order by year desc ;
df2 = data.DataFrame()
37 rows affected.

In [8]:
data=pd.merge(df1,df2)
data.rename(columns={'count':'molCount','cnt':'docCount'},inplace=True)
data.head(10)
Out[8]:
year molCount docCount
0 2014 15131 895
1 2013 70429 4211
2 2012 69407 4174
3 2011 67026 4022
4 2010 84525 4599
5 2009 65118 4443
6 2008 61158 4063
7 2007 52060 3822
8 2006 33837 2132
9 2005 30778 2009

2010 was a big year.

Look at the general relationship between number of documents and number of compounds:

In [10]:
scatter(data['docCount'],data['molCount'])
_=xlabel('doc count')
_=ylabel('mol count')

That looks pretty linear; we can confirm that by looking at the ratio:

In [12]:
data['ratio']=data['molCount']/data['docCount']
In [13]:
scatter(data['year'],data['ratio'])
_=xlabel('year')
_=ylabel('mols / doc')

There's a clear early outlier, leave it out:

In [15]:
data2=data[data.year>=1980]
scatter(data2['year'],data2['ratio'])
_=xlabel('year')
_=ylabel('mols / doc')
Out[15]:
<matplotlib.collections.PathCollection at 0x10b810950>

Looks like there was dip in the late 90s. Odd.

Compounds per paper

A more systematic look at the number of compounds appearing in each paper.

In [16]:
data = %sql postgresql://localhost/chembl_19 \
    select count(molregno) numcmpds,doc_id from compound_records join docs using (doc_id) \
    where doc_id>0 and year>1 group by doc_id
df1 = data.DataFrame()
57115 rows affected.

Let's ignore documents that have more than 200 compounds; they probably aren't SAR papers anyway:

In [24]:
smaller=df1[df1.numcmpds<200]
print(len(smaller))
_=hist(smaller['numcmpds'],bins=20,log=True)
57031

Papers per compound

Turn the question around and ask how many papers each compound appears in:

In [17]:
data = %sql postgresql://localhost/chembl_19 \
    select count(doc_id) numdocs,molregno from compound_records join docs using (doc_id) \
    where doc_id>0 and year>1 group by molregno order by numdocs desc
df1 = data.DataFrame()
874168 rows affected.

In [19]:
_=hist(df1['numdocs'],bins=20,log=True)
_=xlabel('num papers')
In [20]:
df1.head()
Out[20]:
numdocs molregno
0 1142 78759
1 1049 241
2 763 6579
3 667 8062
4 618 8873

top 10 appearing molecules

In [21]:
data = %sql postgresql://localhost/chembl_19 \
    select molregno,m as smiles,numdocs from ( \
      select count(doc_id) numdocs,molregno from compound_records join docs using (doc_id) \
      where doc_id>0 and year>1 group by molregno order by numdocs desc limit 10 \
    ) tmp join rdk.mols using (molregno)
df1 = data.DataFrame()
10 rows affected.

In [22]:
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
In [23]:
PandasTools.AddMoleculeColumnToFrame(df1,smilesCol='smiles')
In [24]:
PandasTools.FrameToGridImage(df1,legendsCol='numdocs')
Out[24]:
In [25]:
df1
Out[25]:
molregno smiles numdocs ROMol
0 78759 COc1cccc2c1C(=O)c1c(O)c3c(c(O)c1C2=O)C[C@@](O)(C(=O)CO)C[C@@H]3O[C@H]1C[C@H](N)[C@H](O)[C@H](C)O1 1142 Mol
1 241 O=C(O)c1cn(C2CC2)c2cc(N3CCNCC3)c(F)cc2c1=O 1049 Mol
2 6579 CCN(CC)CCCC(C)Nc1ccnc2cc(Cl)ccc21 763 Mol
3 8062 CC(=O)O[C@@H]1C2=C(C)[C@@H](OC(=O)[C@H](O)[C@@H](NC(=O)c3ccccc3)c3ccccc3)C[C@@](O)([C@@H](OC(=O)c3ccccc3)[C@@H]3[C@]4(OC(C)=O)CO[C@@H]4C[C@H](O)[C@@]3(C)C1=O)C2(C)C 667 Mol
4 8873 C[C@@H]1[C@H](O)[C@@H](C)C=CC=CC=CC=CC=CC=CC=C[C@H](O[C@@H]2O[C@H](C)[C@@H](O)[C@H](N)[C@@H]2O)C[C@@H]2O[C@](O)(C[C@@H](O)C[C@@H](O)[C@H](O)CC[C@@H](O)C[C@@H](O)CC(=O)O[C@H]1C)C[C@H](O)[C@H]2C(=O)O 618 Mol
5 173 COc1ccc2c(c1)c(CC(=O)O)c(C)n2C(=O)c1ccc(Cl)cc1 596 Mol
6 13758 OC(Cn1cncn1)(Cn1cncn1)c1ccc(F)cc1F 565 Mol
7 305519 CC1(C)S[C@@H]2[C@H](NC(=O)[C@H](N)c3ccccc3)C(=O)N2[C@H]1C(=O)O 545 Mol
8 27307 Cc1cn([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)c(=O)[nH]c1=O 543 Mol
9 365189 CO[C@H]1C=CO[C@@]2(C)Oc3c(c4c(O)c(/C=N/N5CCN(C)CC5)c(c(O)c4c(O)c3C)NC(=O)C(C)=CC=C[C@H](C)[C@H](O)[C@@H](C)[C@@H](O)[C@@H](C)[C@H](OC(C)=O)[C@@H]1C)C2=O 532 Mol

Let's look some of these popular compounds up via the NCI Resolver. I would use ChemSpider for this, but their API is overly painful to use for these simple tasks.

In [34]:
import requests
def lookup(x):
    resp = requests.get('http://cactus.nci.nih.gov/chemical/structure/'+x+'%09/names')
    return '|'.join([x for x in resp.content.split('\n') if x.find('USAN')>-1 or x.find('USP')>-1])
df1['names']=df1['smiles'].map(lookup)
In [40]:
df1.drop('smiles',1)
Out[40]:
molregno numdocs ROMol names
0 78759 1142 Mol Doxorubicin [USAN:BAN:INN]|Doxorubicin (USAN)
1 241 1049 Mol Ciprofloxacin (JAN/USP/INN)|Ciprofloxacin [USAN:BAN:INN]
2 6579 763 Mol Chloroquine (USP)|Chloroquine [USAN:BAN:INN]
3 8062 667 Mol Paclitaxel [USAN:BAN:INN]
4 8873 618 Mol
5 173 596 Mol USAN|Indomethacin [USAN:BAN]|Indomethacin (USP)
6 13758 565 Mol Fluconazole (JAN/USAN/INN)|Fluconazole [USAN:BAN:INN:JAN]
7 305519 545 Mol Ampicillin (USP/INN)|Ampicillin [USAN:BAN:INN:JAN]
8 27307 543 Mol Zidovudine [USAN:BAN:INN:JAN]|Zidovudine (JAN/USP/INN)
9 365189 532 Mol

I could look the missing two up in chemspider manually. Number 4 is Amphotericin B, and number 9 is Rifampicin. Definitely lots of antibiotics in that list.

In []: