Monday, December 8, 2014

Exploring Blaze

Continuum's Blaze package provides an interesting way of working with datasets from within Python. Inspired by this blog post, here's a quick exploration of using Blaze Expressions

In [5]:
%pylab inline
import blaze
import pandas as pd
Populating the interactive namespace from numpy and matplotlib

Some technical details before getting started.

This is the first post I've done using Python 3.4. I'm using the anaconda distribution and its excellent conda package manager to handle installing packages.

The Blaze version I'm using is from the blaze channel: conda install -c blaze blaze and I got psycopg2 for Python3.4 (to allow connections to PostgreSQL) from the pandas channel: conda install -c pandas psycopg2.

In [2]:
chembl = blaze.Data('postgresql://localhost/chembl_19')

Blaze collects information about the schema while connecting, so I have access to all the tables and their columns. This doesn't show in the static blog post, but I can use IPython's tab completion on table names and/or columns.

Start with a basic join, blaze figures out the column(s) to use:

In [3]:
docs_and_compounds = blaze.join(chembl.docs,chembl.compound_records)
In [4]:
docs_and_compounds.fields
Out[4]:
['doc_id',
 'journal',
 'year',
 'volume',
 'issue',
 'first_page',
 'last_page',
 'pubmed_id',
 'doi',
 'chembl_id',
 'title',
 'doc_type',
 'authors',
 'abstract',
 'record_id',
 'molregno',
 'compound_key',
 'compound_name',
 'src_id',
 'src_compound_id']
In [5]:
docs_and_compounds.nrows
Out[5]:
1637862
In [6]:
blaze.sort(docs_and_compounds,key='year',ascending=False)
Out[6]:
doc_id journal year volume issue first_page last_page pubmed_id doi chembl_id title doc_type authors abstract record_id molregno compound_key compound_name src_id src_compound_id
0 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979019 1626883 MMV672625 None 23 MMV672625
1 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979020 1626884 MMV672626 None 23 MMV672626
2 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979021 1626885 MMV672686 None 23 MMV672686
3 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979022 1626886 MMV672687 None 23 MMV672687
4 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979023 1626887 MMV672688 None 23 MMV672688
5 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979024 1626888 MMV672689 None 23 MMV672689
6 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979025 1626889 MMV672723 None 23 MMV672723
7 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979026 1626890 MMV672725 None 23 MMV672725
8 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979027 1626891 MMV672726 None 23 MMV672726
9 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979028 1626892 MMV672727 None 23 MMV672727
10 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979029 1626893 MMV672730 None 23 MMV672730

Get rid of the unpublished datasets (doc_id=-1) and datasets where the year is None by doing a query on the docs_and_compounds object:

In [6]:
docs_and_compounds = docs_and_compounds[(docs_and_compounds['year']>0) & (docs_and_compounds['doc_id']>-1)]
docs_and_compounds.nrows
Out[6]:
1095936
In [7]:
docs_and_compounds
Out[7]:
doc_id journal year volume issue first_page last_page pubmed_id doi chembl_id title doc_type authors abstract record_id molregno compound_key compound_name src_id src_compound_id
0 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350140 292848 15 (3R,4S)-1,4-Bis-(4-methoxy-phenyl)-3-((Z)-3-ph... 1 None
1 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350141 292849 16 (3R,4S)-1,4-Bis-(4-methoxy-phenyl)-3-((E)-3-ph... 1 None
2 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350142 292858 17 (S)-7-(4-Chloro-phenyl)-3-(4-methoxy-phenyl)-2... 1 None
3 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350143 292858 18 (S)-7-(4-Chloro-phenyl)-3-(4-methoxy-phenyl)-2... 1 None
4 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350144 292870 20 4-[(2S,3R)-1-(4-Hydroxy-phenyl)-4-oxo-3-(3-phe... 1 None
5 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350145 292107 21 (3R,4S)-1,4-Bis-(4-hydroxy-phenyl)-3-(3-phenyl... 1 None
6 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350146 292254 22 4-[1-(4-formylphenyl)-3-[3-(4-hydroxyphenyl)pr... 1 None
7 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350147 292403 23 4-[1-(4-formylphenyl)-3-(3-hydroxy-3-phenylpro... 1 None
8 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350148 292557 24 4-[1-(4-formylphenyl)-3-(3-hydroxy-3-phenylpro... 1 None
9 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350149 292687 25 (3R,4S)-1,4-Bis-(4-acetyl-phenyl)-3-(3-oxo-3-p... 1 None
10 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350150 175106 Ezetimibe (3R,4S)-1-(4-Fluoro-phenyl)-3-[(S)-3-(4-fluoro... 1 None

Find the first year each molregno appeared using blaze.by. This is analogous to SQLs "group by".

In [8]:
minyear = blaze.by(docs_and_compounds.molregno,yr=docs_and_compounds.year.min())
minyear
Out[8]:
molregno yr
0 1 1983
1 2 1983
2 3 1983
3 4 1983
4 5 1983
5 6 1983
6 7 1983
7 8 1983
8 9 1983
9 10 1997
10 11 1993
In [9]:
year_counts = blaze.by(minyear.yr,cnt=minyear.molregno.count())
In [10]:
blaze.sort(year_counts,key='yr',ascending=False)
Out[10]:
yr cnt
0 2014 15131
1 2013 70429
2 2012 69407
3 2011 67026
4 2010 84525
5 2009 65118
6 2008 61158
7 2007 52060
8 2006 33837
9 2005 30778
10 2004 31118

That's what we had before. Excellent.

Let's try something else: the number of documents per compound. We'll use the limited set of docs that have years and doc_ids

In [11]:
docs_per_compound = blaze.by(docs_and_compounds.molregno,ndocs=docs_and_compounds.doc_id.nunique())
In [12]:
blaze.sort(docs_per_compound,key='ndocs',ascending=False)
Out[12]:
molregno ndocs
0 78759 1140
1 241 1045
2 6579 761
3 8062 660
4 8873 617
5 173 593
6 13758 564
7 305519 544
8 27307 539
9 365189 530
10 70140 501

That's also more or less what we got before.

Blaze also makes it easy to pull the data from the query (which is only executed in a lazy manner) into other formats. Here's an example of grabbing it as a Pandas DataFrame (for more on this, look at the blaze migrations blog post)

In [13]:
df = blaze.into(pd.DataFrame,docs_per_compound)
In [14]:
_=hist(df['ndocs'],bins=20,log=True)
_=xlabel('num docs per compound')
In [18]:
len(df[df['ndocs']>10])
Out[18]:
2381
In []: