Friday, November 15, 2013

Substructure fingerprints and ChEMBL

Substructure fingerprints and the PostgreSQL cartridge 2: Application to ChEMBL

This is a quick addendum to the previous post about using fingerprints for substructure screening in the RDKit PostgreSQL cartridge. The earlier post showed performance statistics for my benchmarking data set.

A more interesting and relevant example would be to see how well the fingerprints do for substructure screening across a larger dataset: the full set of ChEMBL molecules.

Start by building the molecular index:

chembl_16=# create index molidx on rdk.mols using gist(m);
CREATE INDEX
Time: 1638865.515 ms

Now move on to our standard queries:

chembl_16=# select count(*) from rdk.mols mt cross join rdk.zinc_leads qt where mt.m@>qt.m;
count 
-------
32518
(1 row)

Time: 50870.368 ms
chembl_16=# select count(*) from rdk.mols mt cross join rdk.zinc_frags qt where mt.m@>qt.m;
count  
--------
118223
(1 row)

Time: 95133.888 ms
chembl_16=# select count(*) from rdk.mols mt cross join rdk.zinc_leads_scaffolds qt where mt.m@>qt.m;
count  
---------
7949449
(1 row)

Time: 664180.683 ms
chembl_16=# select count(*) from rdk.mols mt cross join rdk.pubchem_pieces qt where mt.m@>qt.m;
count   
----------
51987307
(1 row)

Time: 7035820.224 ms

These take a lot longer than the previous examples, but the queries are against a much larger set of molecules -- about 1.3 million instead of 50K -- and return many more results. To set expectations, the zinc_leads, zinc_frags, and zinc_leads_scaffolds queries are, theoretically, 645 million comparisons (1.3 million*500). The pubchem_pieces queries involve more than a billion comparisons.

It's worth looking at the same screenout query as in the previous post:

chembl_16=# explain analyze select * from rdk.mols where m@>'c1ccc([C@@H]2CO2)cc1';
QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on mols  (cost=284.96..4917.12 rows=1291 width=423) (actual time=71.638..602.236 rows=394 loops=1)
Recheck Cond: (m @> 'c1ccc([C@@H]2CO2)cc1'::mol)
Rows Removed by Index Recheck: 681
->  Bitmap Index Scan on molidx  (cost=0.00..284.64 rows=1291 width=0) (actual time=71.358..71.358 rows=1075 loops=1)
Index Cond: (m @> 'c1ccc([C@@H]2CO2)cc1'::mol)
Total runtime: 604.713 ms
(6 rows)

The accuracy for this query has dropped to about 0.40, but it's still only taking 0.6 seconds to find the 394 result rows in the 1.3 million compound table.

No comments: