Sunday, October 26, 2014

Molecule counts in ChEMBL documents

I was curious about how many compounds are added to the data sources that are loaded into ChEMBL each year. This is an exploration around that theme, with a digression or two at the end, but no real conclusions

In [1]:
import time,random
import psycopg2
from __future__ import print_function
import pandas as pd
%pylab inline
%load_ext sql
Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['random']
`%matplotlib` prevents importing * from pylab and numpy

Start by pulling the number of new compounds appearing per year.

In [2]:
data = %sql postgresql://localhost/chembl_19 \
    select miny as year,count(molregno) from \
       (select molregno,min(year) as miny from compound_records join docs using (doc_id) where year>0 group by molregno) tmp \
    group by miny order by miny desc;
df1 = data.DataFrame()
37 rows affected.

In [3]:
df1.head(10)
Out[3]:
year count
0 2014 15131
1 2013 70429
2 2012 69407
3 2011 67026
4 2010 84525
5 2009 65118
6 2008 61158
7 2007 52060
8 2006 33837
9 2005 30778

Obviously 2014 isn't complete yet, so we should ignore that one.

How many documents per year show up?

In [5]:
data = %sql \
       select year,count(doc_id) as cnt from docs where year>0 \
       group by year order by year desc ;
df2 = data.DataFrame()
37 rows affected.

In [8]:
data=pd.merge(df1,df2)
data.rename(columns={'count':'molCount','cnt':'docCount'},inplace=True)
data.head(10)
Out[8]:
year molCount docCount
0 2014 15131 895
1 2013 70429 4211
2 2012 69407 4174
3 2011 67026 4022
4 2010 84525 4599
5 2009 65118 4443
6 2008 61158 4063
7 2007 52060 3822
8 2006 33837 2132
9 2005 30778 2009

2010 was a big year.

Look at the general relationship between number of documents and number of compounds:

In [10]:
scatter(data['docCount'],data['molCount'])
_=xlabel('doc count')
_=ylabel('mol count')

That looks pretty linear; we can confirm that by looking at the ratio:

In [12]:
data['ratio']=data['molCount']/data['docCount']
In [13]:
scatter(data['year'],data['ratio'])
_=xlabel('year')
_=ylabel('mols / doc')

There's a clear early outlier, leave it out:

In [15]:
data2=data[data.year>=1980]
scatter(data2['year'],data2['ratio'])
_=xlabel('year')
_=ylabel('mols / doc')
Out[15]:
<matplotlib.collections.PathCollection at 0x10b810950>

Looks like there was dip in the late 90s. Odd.

Compounds per paper

A more systematic look at the number of compounds appearing in each paper.

In [16]:
data = %sql postgresql://localhost/chembl_19 \
    select count(molregno) numcmpds,doc_id from compound_records join docs using (doc_id) \
    where doc_id>0 and year>1 group by doc_id
df1 = data.DataFrame()
57115 rows affected.

Let's ignore documents that have more than 200 compounds; they probably aren't SAR papers anyway:

In [24]:
smaller=df1[df1.numcmpds<200]
print(len(smaller))
_=hist(smaller['numcmpds'],bins=20,log=True)
57031

Papers per compound

Turn the question around and ask how many papers each compound appears in:

In [17]:
data = %sql postgresql://localhost/chembl_19 \
    select count(doc_id) numdocs,molregno from compound_records join docs using (doc_id) \
    where doc_id>0 and year>1 group by molregno order by numdocs desc
df1 = data.DataFrame()
874168 rows affected.

In [19]:
_=hist(df1['numdocs'],bins=20,log=True)
_=xlabel('num papers')
In [20]:
df1.head()
Out[20]:
numdocs molregno
0 1142 78759
1 1049 241
2 763 6579
3 667 8062
4 618 8873

top 10 appearing molecules

In [21]:
data = %sql postgresql://localhost/chembl_19 \
    select molregno,m as smiles,numdocs from ( \
      select count(doc_id) numdocs,molregno from compound_records join docs using (doc_id) \
      where doc_id>0 and year>1 group by molregno order by numdocs desc limit 10 \
    ) tmp join rdk.mols using (molregno)
df1 = data.DataFrame()
10 rows affected.

In [22]:
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
In [23]:
PandasTools.AddMoleculeColumnToFrame(df1,smilesCol='smiles')
In [24]:
PandasTools.FrameToGridImage(df1,legendsCol='numdocs')
Out[24]:
In [25]:
df1
Out[25]:
molregno smiles numdocs ROMol
0 78759 COc1cccc2c1C(=O)c1c(O)c3c(c(O)c1C2=O)C[C@@](O)(C(=O)CO)C[C@@H]3O[C@H]1C[C@H](N)[C@H](O)[C@H](C)O1 1142 Mol
1 241 O=C(O)c1cn(C2CC2)c2cc(N3CCNCC3)c(F)cc2c1=O 1049 Mol
2 6579 CCN(CC)CCCC(C)Nc1ccnc2cc(Cl)ccc21 763 Mol
3 8062 CC(=O)O[C@@H]1C2=C(C)[C@@H](OC(=O)[C@H](O)[C@@H](NC(=O)c3ccccc3)c3ccccc3)C[C@@](O)([C@@H](OC(=O)c3ccccc3)[C@@H]3[C@]4(OC(C)=O)CO[C@@H]4C[C@H](O)[C@@]3(C)C1=O)C2(C)C 667 Mol
4 8873 C[C@@H]1[C@H](O)[C@@H](C)C=CC=CC=CC=CC=CC=CC=C[C@H](O[C@@H]2O[C@H](C)[C@@H](O)[C@H](N)[C@@H]2O)C[C@@H]2O[C@](O)(C[C@@H](O)C[C@@H](O)[C@H](O)CC[C@@H](O)C[C@@H](O)CC(=O)O[C@H]1C)C[C@H](O)[C@H]2C(=O)O 618 Mol
5 173 COc1ccc2c(c1)c(CC(=O)O)c(C)n2C(=O)c1ccc(Cl)cc1 596 Mol
6 13758 OC(Cn1cncn1)(Cn1cncn1)c1ccc(F)cc1F 565 Mol
7 305519 CC1(C)S[C@@H]2[C@H](NC(=O)[C@H](N)c3ccccc3)C(=O)N2[C@H]1C(=O)O 545 Mol
8 27307 Cc1cn([C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2)c(=O)[nH]c1=O 543 Mol
9 365189 CO[C@H]1C=CO[C@@]2(C)Oc3c(c4c(O)c(/C=N/N5CCN(C)CC5)c(c(O)c4c(O)c3C)NC(=O)C(C)=CC=C[C@H](C)[C@H](O)[C@@H](C)[C@@H](O)[C@@H](C)[C@H](OC(C)=O)[C@@H]1C)C2=O 532 Mol

Let's look some of these popular compounds up via the NCI Resolver. I would use ChemSpider for this, but their API is overly painful to use for these simple tasks.

In [34]:
import requests
def lookup(x):
    resp = requests.get('http://cactus.nci.nih.gov/chemical/structure/'+x+'%09/names')
    return '|'.join([x for x in resp.content.split('\n') if x.find('USAN')>-1 or x.find('USP')>-1])
df1['names']=df1['smiles'].map(lookup)
In [40]:
df1.drop('smiles',1)
Out[40]:
molregno numdocs ROMol names
0 78759 1142 Mol Doxorubicin [USAN:BAN:INN]|Doxorubicin (USAN)
1 241 1049 Mol Ciprofloxacin (JAN/USP/INN)|Ciprofloxacin [USAN:BAN:INN]
2 6579 763 Mol Chloroquine (USP)|Chloroquine [USAN:BAN:INN]
3 8062 667 Mol Paclitaxel [USAN:BAN:INN]
4 8873 618 Mol
5 173 596 Mol USAN|Indomethacin [USAN:BAN]|Indomethacin (USP)
6 13758 565 Mol Fluconazole (JAN/USAN/INN)|Fluconazole [USAN:BAN:INN:JAN]
7 305519 545 Mol Ampicillin (USP/INN)|Ampicillin [USAN:BAN:INN:JAN]
8 27307 543 Mol Zidovudine [USAN:BAN:INN:JAN]|Zidovudine (JAN/USP/INN)
9 365189 532 Mol

I could look the missing two up in chemspider manually. Number 4 is Amphotericin B, and number 9 is Rifampicin. Definitely lots of antibiotics in that list.

In []: