%pylab inline
import blaze
import pandas as pd
Some technical details before getting started.
This is the first post I've done using Python 3.4. I'm using the anaconda distribution and its excellent conda package manager to handle installing packages.
The Blaze version I'm using is from the blaze channel: conda install -c blaze blaze
and I got psycopg2 for Python3.4 (to allow connections to PostgreSQL) from the pandas channel: conda install -c pandas psycopg2
.
chembl = blaze.Data('postgresql://localhost/chembl_19')
Blaze collects information about the schema while connecting, so I have access to all the tables and their columns. This doesn't show in the static blog post, but I can use IPython's tab completion on table names and/or columns.
Start with a basic join, blaze figures out the column(s) to use:
docs_and_compounds = blaze.join(chembl.docs,chembl.compound_records)
docs_and_compounds.fields
docs_and_compounds.nrows
blaze.sort(docs_and_compounds,key='year',ascending=False)
Get rid of the unpublished datasets (doc_id=-1) and datasets where the year is None by doing a query on the docs_and_compounds object:
docs_and_compounds = docs_and_compounds[(docs_and_compounds['year']>0) & (docs_and_compounds['doc_id']>-1)]
docs_and_compounds.nrows
docs_and_compounds
Find the first year each molregno appeared using blaze.by
. This is analogous to SQLs "group by".
minyear = blaze.by(docs_and_compounds.molregno,yr=docs_and_compounds.year.min())
minyear
year_counts = blaze.by(minyear.yr,cnt=minyear.molregno.count())
blaze.sort(year_counts,key='yr',ascending=False)
That's what we had before. Excellent.
Let's try something else: the number of documents per compound. We'll use the limited set of docs that have years and doc_ids
docs_per_compound = blaze.by(docs_and_compounds.molregno,ndocs=docs_and_compounds.doc_id.nunique())
blaze.sort(docs_per_compound,key='ndocs',ascending=False)
That's also more or less what we got before.
Blaze also makes it easy to pull the data from the query (which is only executed in a lazy manner) into other formats. Here's an example of grabbing it as a Pandas DataFrame (for more on this, look at the blaze migrations blog post)
df = blaze.into(pd.DataFrame,docs_per_compound)
_=hist(df['ndocs'],bins=20,log=True)
_=xlabel('num docs per compound')
len(df[df['ndocs']>10])