Continuum's Blaze package provides an interesting way of working with datasets from within Python. Inspired by this blog post, here's a quick exploration of using Blaze Expressions

In [5]:

%pylab inline
import blaze
import pandas as pd

Populating the interactive namespace from numpy and matplotlib

Some technical details before getting started.

This is the first post I've done using Python 3.4. I'm using the anaconda distribution and its excellent conda package manager to handle installing packages.

The Blaze version I'm using is from the blaze channel: conda install -c blaze blaze and I got psycopg2 for Python3.4 (to allow connections to PostgreSQL) from the pandas channel: conda install -c pandas psycopg2.

In [2]:

chembl = blaze.Data('postgresql://localhost/chembl_19')

Blaze collects information about the schema while connecting, so I have access to all the tables and their columns. This doesn't show in the static blog post, but I can use IPython's tab completion on table names and/or columns.

Start with a basic join, blaze figures out the column(s) to use:

In [3]:

docs_and_compounds = blaze.join(chembl.docs,chembl.compound_records)

In [4]:

docs_and_compounds.fields

Out[4]:

['doc_id',
 'journal',
 'year',
 'volume',
 'issue',
 'first_page',
 'last_page',
 'pubmed_id',
 'doi',
 'chembl_id',
 'title',
 'doc_type',
 'authors',
 'abstract',
 'record_id',
 'molregno',
 'compound_key',
 'compound_name',
 'src_id',
 'src_compound_id']

In [5]:

docs_and_compounds.nrows

Out[5]:

1637862

In [6]:

blaze.sort(docs_and_compounds,key='year',ascending=False)

Out[6]:

	doc_id	journal	year	volume	issue	first_page	last_page	pubmed_id	doi	chembl_id	title	doc_type	authors	abstract	record_id	molregno	compound_key	compound_name	src_id	src_compound_id
0	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979019	1626883	MMV672625	None	23	MMV672625
1	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979020	1626884	MMV672626	None	23	MMV672626
2	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979021	1626885	MMV672686	None	23	MMV672686
3	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979022	1626886	MMV672687	None	23	MMV672687
4	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979023	1626887	MMV672688	None	23	MMV672688
5	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979024	1626888	MMV672689	None	23	MMV672689
6	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979025	1626889	MMV672723	None	23	MMV672723
7	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979026	1626890	MMV672725	None	23	MMV672725
8	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979027	1626891	MMV672726	None	23	MMV672726
9	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979028	1626892	MMV672727	None	23	MMV672727
10	77449	None	None	None	None	None	None	None	10.6019/CHEMBL3137547	CHEMBL3137547	Open Source Malaria Deposition 2. http://malar...	DATASET	None	None	1979029	1626893	MMV672730	None	23	MMV672730

Get rid of the unpublished datasets (doc_id=-1) and datasets where the year is None by doing a query on the docs_and_compounds object:

In [6]:

docs_and_compounds = docs_and_compounds[(docs_and_compounds['year']>0) & (docs_and_compounds['doc_id']>-1)]
docs_and_compounds.nrows

Out[6]:

1095936

In [7]:

docs_and_compounds

Out[7]:

	doc_id	journal	year	volume	issue	first_page	last_page	pubmed_id	doi	chembl_id	title	doc_type	authors	abstract	record_id	molregno	compound_key	compound_name	src_id	src_compound_id
0	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350140	292848	15	(3R,4S)-1,4-Bis-(4-methoxy-phenyl)-3-((Z)-3-ph...	1	None
1	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350141	292849	16	(3R,4S)-1,4-Bis-(4-methoxy-phenyl)-3-((E)-3-ph...	1	None
2	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350142	292858	17	(S)-7-(4-Chloro-phenyl)-3-(4-methoxy-phenyl)-2...	1	None
3	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350143	292858	18	(S)-7-(4-Chloro-phenyl)-3-(4-methoxy-phenyl)-2...	1	None
4	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350144	292870	20	4-[(2S,3R)-1-(4-Hydroxy-phenyl)-4-oxo-3-(3-phe...	1	None
5	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350145	292107	21	(3R,4S)-1,4-Bis-(4-hydroxy-phenyl)-3-(3-phenyl...	1	None
6	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350146	292254	22	4-[1-(4-formylphenyl)-3-[3-(4-hydroxyphenyl)pr...	1	None
7	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350147	292403	23	4-[1-(4-formylphenyl)-3-(3-hydroxy-3-phenylpro...	1	None
8	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350148	292557	24	4-[1-(4-formylphenyl)-3-(3-hydroxy-3-phenylpro...	1	None
9	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350149	292687	25	(3R,4S)-1,4-Bis-(4-acetyl-phenyl)-3-(3-oxo-3-p...	1	None
10	1	J. Med. Chem.	2004	47	1	1	9	14695813	None	CHEMBL1139451	None	PUBLICATION	None	None	350150	175106	Ezetimibe	(3R,4S)-1-(4-Fluoro-phenyl)-3-[(S)-3-(4-fluoro...	1	None

Find the first year each molregno appeared using blaze.by. This is analogous to SQLs "group by".

In [8]:

minyear = blaze.by(docs_and_compounds.molregno,yr=docs_and_compounds.year.min())
minyear

Out[8]:

	molregno	yr
0	1	1983
1	2	1983
2	3	1983
3	4	1983
4	5	1983
5	6	1983
6	7	1983
7	8	1983
8	9	1983
9	10	1997
10	11	1993

In [9]:

year_counts = blaze.by(minyear.yr,cnt=minyear.molregno.count())

In [10]:

blaze.sort(year_counts,key='yr',ascending=False)

Out[10]:

	yr	cnt
0	2014	15131
1	2013	70429
2	2012	69407
3	2011	67026
4	2010	84525
5	2009	65118
6	2008	61158
7	2007	52060
8	2006	33837
9	2005	30778
10	2004	31118

That's what we had before. Excellent.

Let's try something else: the number of documents per compound. We'll use the limited set of docs that have years and doc_ids

In [11]:

docs_per_compound = blaze.by(docs_and_compounds.molregno,ndocs=docs_and_compounds.doc_id.nunique())

In [12]:

blaze.sort(docs_per_compound,key='ndocs',ascending=False)

Out[12]:

	molregno	ndocs
0	78759	1140
1	241	1045
2	6579	761
3	8062	660
4	8873	617
5	173	593
6	13758	564
7	305519	544
8	27307	539
9	365189	530
10	70140	501

That's also more or less what we got before.

Blaze also makes it easy to pull the data from the query (which is only executed in a lazy manner) into other formats. Here's an example of grabbing it as a Pandas DataFrame (for more on this, look at the blaze migrations blog post)

In [13]:

df = blaze.into(pd.DataFrame,docs_per_compound)

In [14]:

_=hist(df['ndocs'],bins=20,log=True)
_=xlabel('num docs per compound')

In [18]:

len(df[df['ndocs']>10])

Out[18]:

In []:

Monday, December 8, 2014

Exploring Blaze

No comments:

About Me

Blog Archive