Friday, November 15, 2013

Substructure fingerprints and ChEMBL

Substructure fingerprints and the PostgreSQL cartridge 2: Application to ChEMBL

This is a quick addendum to the previous post about using fingerprints for substructure screening in the RDKit PostgreSQL cartridge. The earlier post showed performance statistics for my benchmarking data set.

A more interesting and relevant example would be to see how well the fingerprints do for substructure screening across a larger dataset: the full set of ChEMBL molecules.

Start by building the molecular index:

chembl_16=# create index molidx on rdk.mols using gist(m);
CREATE INDEX
Time: 1638865.515 ms

Now move on to our standard queries:

chembl_16=# select count(*) from rdk.mols mt cross join rdk.zinc_leads qt where mt.m@>qt.m;
count 
-------
32518
(1 row)

Time: 50870.368 ms
chembl_16=# select count(*) from rdk.mols mt cross join rdk.zinc_frags qt where mt.m@>qt.m;
count  
--------
118223
(1 row)

Time: 95133.888 ms
chembl_16=# select count(*) from rdk.mols mt cross join rdk.zinc_leads_scaffolds qt where mt.m@>qt.m;
count  
---------
7949449
(1 row)

Time: 664180.683 ms
chembl_16=# select count(*) from rdk.mols mt cross join rdk.pubchem_pieces qt where mt.m@>qt.m;
count   
----------
51987307
(1 row)

Time: 7035820.224 ms

These take a lot longer than the previous examples, but the queries are against a much larger set of molecules -- about 1.3 million instead of 50K -- and return many more results. To set expectations, the zinc_leads, zinc_frags, and zinc_leads_scaffolds queries are, theoretically, 645 million comparisons (1.3 million*500). The pubchem_pieces queries involve more than a billion comparisons.

It's worth looking at the same screenout query as in the previous post:

chembl_16=# explain analyze select * from rdk.mols where m@>'c1ccc([C@@H]2CO2)cc1';
QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on mols  (cost=284.96..4917.12 rows=1291 width=423) (actual time=71.638..602.236 rows=394 loops=1)
Recheck Cond: (m @> 'c1ccc([C@@H]2CO2)cc1'::mol)
Rows Removed by Index Recheck: 681
->  Bitmap Index Scan on molidx  (cost=0.00..284.64 rows=1291 width=0) (actual time=71.358..71.358 rows=1075 loops=1)
Index Cond: (m @> 'c1ccc([C@@H]2CO2)cc1'::mol)
Total runtime: 604.713 ms
(6 rows)

The accuracy for this query has dropped to about 0.40, but it's still only taking 0.6 seconds to find the 394 result rows in the 1.3 million compound table.

Substructure fingerprints and the cartridge

Substructure fingerprints and the PostgreSQL cartridge

The previous post about using fingerprints for substructure screening covered the basics and looked at screenout rates. A more important point is how well the fingerprints perform in a real-world application. The use I will pick here is integration with the RDKit cartridge for PostgreSQL.

The RDKit cartridge includes integration with PostgreSQL's Generalized Search Tree (GiST) index engine and allows indices to be built for molecule columns that help with substructure search.

To demonstrate this, I'll start by collecting the 50K compounds that I've been using into a database and building an index:

~/RDKit_blog/data > psql -c 'create table raw_data2 (chemblid integer, smiles text)' fptest
CREATE TABLE
~/RDKit_blog/data > zcat chembl16_50K_sss.txt.gz | sed 's/\\/\\\\/g' | psql -c "copy raw_data2 (chemblid,smiles) from stdin with delimiter ' '" fptest
~/RDKit_blog/data > psql fptest
psql (9.2.4)
Type "help" for help.

fptest=# \timing
Timing is on.
fptest=# select * into mols from (select chemblid,mol_from_smiles(smiles::cstring) m from raw_data2) tmp where m is not null;
SELECT 50000
Time: 36408.113 ms
fptest=# create index molidx on mols using gist(m);
CREATE INDEX
Time: 56578.796 ms

Add an additional query set, the Murcko scaffolds for the ZINC_leads set:

fptest=# select id,mol_murckoscaffold(m) m into zinc_leads_scaffolds from zinc_leads;
SELECT 500
Time: 108.871 ms

Now try a couple substructure queries using some random ZINC_leads scaffolds:

fptest=# select * from zinc_leads_scaffolds order by random() limit 5;
      id      |               m               
--------------+-------------------------------
 ZINC35324029 | c1ccc([C@@H]2CO2)cc1
 ZINC05920112 | S=c1ssc2c1SCS2
 ZINC16848052 | c1nnc[nH]1
 ZINC00102676 | O=c1[nH]c(=O)n(-c2ccccc2)s1
 ZINC00265495 | O=C1O[C@H](c2ccoc2)C2CCCC=C12
(5 rows)

Time: 3.029 ms
fptest=# select * from mols where m@>'c1ccc([C@@H]2CO2)cc1';
 chemblid |                   m                   
----------+---------------------------------------
  1355230 | C=CCOC1=CC(=O)c2ccccc2C12CO2
  1122083 | OC1c2ccc3c4ccccc4c4ccccc4c3c2C2OC2C1O
   238553 | NC(=O)N1c2ccccc2C2OC2c2ccccc21
   624377 | Cc1ccc(C2OC2C(=O)c2ccc3c(c2)CCCC3)cc1
(4 rows)

Time: 13.225 ms
fptest=# select * from mols where m@>'O=c1[nH]c(=O)n(-c2ccccc2)s1';
 chemblid |              m               
----------+------------------------------
   495448 | O=c1sn(-c2ccccc2)c(=O)n1CCCl
(1 row)

Time: 3.698 ms
fptest=# select count(*) from mols where m@>'c1nnc[nH]1';
 count 
-------
   986
(1 row)

Time: 80.336 ms

The individual queries using the index are nice and fast. How about if I turn the index off?

fptest=# SET enable_indexscan=off;
SET
Time: 0.424 ms
fptest=# SET enable_bitmapscan=off;
SET
Time: 0.206 ms
fptest=# SET enable_seqscan=on;
SET
Time: 0.195 ms
fptest=# select * from mols where m@>'c1ccc([C@@H]2CO2)cc1';
 chemblid |                   m                   
----------+---------------------------------------
  1355230 | C=CCOC1=CC(=O)c2ccccc2C12CO2
  1122083 | OC1c2ccc3c4ccccc4c4ccccc4c3c2C2OC2C1O
   238553 | NC(=O)N1c2ccccc2C2OC2c2ccccc21
   624377 | Cc1ccc(C2OC2C(=O)c2ccc3c(c2)CCCC3)cc1
(4 rows)

Time: 4191.875 ms

That certainly makes a difference. We can see why by turning the index back on and looking at the query analysis:

fptest=# SET enable_indexscan=on;
SET
Time: 0.447 ms
fptest=# SET enable_bitmapscan=on;
SET
Time: 0.158 ms
fptest=# explain analyze select * from mols where m@>'c1ccc([C@@H]2CO2)cc1';
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on mols  (cost=12.74..192.77 rows=50 width=389) (actual time=10.848..11.482 rows=4 loops=1)
   Recheck Cond: (m @> 'c1ccc([C@@H]2CO2)cc1'::mol)
   Rows Removed by Index Recheck: 3
   ->  Bitmap Index Scan on molidx  (cost=0.00..12.72 rows=50 width=0) (actual time=10.262..10.262 rows=7 loops=1)
         Index Cond: (m @> 'c1ccc([C@@H]2CO2)cc1'::mol)
 Total runtime: 11.702 ms
(6 rows)

Time: 13.743 ms

There's lots of information there, most of which I'm going to ignore. For the purposes of this post, the important bits are the row counts from the index scan -- rows=7 -- and the final bitmap heap scan -- rows=4. For this query, seven database molecules passed the fingerprint scan and needed to have an actual substructure search done. Four of those actually contained the substructure. That's about the same screenout rate observed in the previous post.

If you're interested, you can get more information about explain output here or here.

The value of the fingerprints and the index really becomes clear when we do bulk searches using the cross join operation in SQL:

fptest=# select count(*) from mols cross join zinc_leads qt where mols.m@>qt.m;
 count 
-------
  1274
(1 row)

Time: 715.373 ms

This query does a substructure search across the molecules for each row in the ZINC_leads set. Without the index, the database would have to load all 50K molecules from the mols table and do 25000000 (50000*500) substructure comparisons. This takes a while:

fptest=# SET enable_indexscan=off;
SET
Time: 0.306 ms
fptest=# SET enable_bitmapscan=off;
SET
Time: 0.222 ms
fptest=# SET enable_seqscan=on;
SET
Time: 0.200 ms
fptest=# select count(*) from mols cross join zinc_leads qt where mols.m@>qt.m;
 count 
-------
  1274
(1 row)

Time: 1234519.855 ms

The index enables a speedup of >1700 here. Awesome. :-)

If we make the queries more generic by using their Murcko scaffolds, we get more results and the queries take longer:

fptest=# select count(*) from mols cross join zinc_leads_scaffolds qt where mols.m@>qt.m;
 count  
--------
 310719
(1 row)

Time: 19654.489 ms

Finding the 310000 matches takes just under 20 seconds.

For reference, here are the results for the other two query sets:

fptest=# select count(*) from mols cross join pubchem_pieces qt where mols.m@>qt.m;
  count  
---------
 1935315
(1 row)

Time: 214718.647 ms
fptest=# select count(*) from mols cross join zinc_frags qt where mols.m@>qt.m;
 count 
-------
  4659
(1 row)

Switching to a fingerprint that was less effective in the screenout benchhmarking, the Layered fingerprint, we see a substantial increase in search times:

fptest=# select count(*) from mols cross join pubchem_pieces qt where mols.m@>qt.m;
  count  
---------
 1935315
(1 row)

Time: 359185.914 ms
fptest=# select count(*) from mols cross join zinc_frags qt where mols.m@>qt.m;
 count 
-------
  4659
(1 row)

Time: 5478.102 ms
fptest=# select count(*) from mols cross join zinc_leads qt where mols.m@>qt.m;
 count 
-------
  1274
(1 row)

Time: 2861.711 ms
fptest=# select count(*) from mols cross join zinc_leads_scaffolds qt where mols.m@>qt.m;
 count  
--------
 310719
(1 row)

Time: 34194.532 ms

Saturday, November 9, 2013

Fingerprints for Substructure Screening 1

Fingerprint-based substructure screening 1

[Edited 10/11/2013 due to the bug fix for https://github.com/rdkit/rdkit/issues/151]
Some form of fingerprint is often used to make substructure searching more efficient. The idea is to use a fingerprinting algorithm with the property:
FP(query) & FP(mol) = FP(query)
if query is a substructure of mol. In words: if query is a substructure of mol then every bit set in the fingerprint of query should also be set in mol.
A bunch of different approaches to this have been developed, I'm not going to review them all here. :-)
Andrew Dalke has done some writing on the topic: http://dalkescientific.com/writings/diary/archive/2012/06/11/optimizing_substructure_keys.html
One of the best-known approaches is the Daylight algorithm: http://www.daylight.com/dayhtml/doc/theory/theory.finger.html
As Andrew mentions in the post I link to above, one of the big problems here is the lack of a reasonable query dataset for benchmarking. The approach I've taken is to use collections of small molecules from ZINC as well as some fragments of PubChem molecules. The database I query is a set of ChEMBL molecules from an earlier post.
Here's the atom-number count distributions of the queries and molecules:

It's not perfect, but it's a start.

TL;DR summary

Using the RDKit pattern fingerprinter (the default used in the postgresql cartridge, more on that in another post), a screenout accuracy of around 60% is achievable with these three query sets. This means that 60% of molecules passing the fingerprint screen actually have a substructure match.
To put that in perspective: with the leads query set and the pattern fingerprint, 4650 substructure matches are found in a total of 7601 searches, that's 61% accuracy for the fingerprint. If the fingerprint had not been used to pre-screen, 25000000 (500*50000) searches would have needed to be done, so we've reduced the number of substructure searches by 99.97%. By this logic, even a comparatively poor fingerprint with an accuracy of 5% would help enormously: reducing the number of searches by 99.7%. This will become important when actually using the fingerprints in a database index, but that's a different post.
On the technical side: there's some info below about using IPython's ability to run code in parallel.

In [1]:

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Avalon import pyAvalonTools
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from rdkit import DataStructs
import cPickle,random,gzip,time
from __future__ import print_function
print(rdBase.rdkitVersion)

2014.03.1pre

Read in the data

Database molecules

Here we will use the 50K molecules that make up the set of 25K reference pairs generated in an earlier post: http://rdkit.blogspot.ch/2013/10/building-similarity-comparison-set-goal.html
As a quick reminder: these are pairs of molecules taken from ChEMBL with MW<600 and a count-based MFP0 similarity of at least 0.7 to each other.

In [2]:

ind = [x.split() for x in gzip.open('../data/chembl16_25K.pairs.txt.gz')]
mols = []
for i,row in enumerate(ind):
    m1 = Chem.MolFromSmiles(row[1])
    mols.append(m1)
    m2 = Chem.MolFromSmiles(row[3])
    mols.append(m2)

Query molecules

We'll use three sets:

Fragments: 500 diverse molecules taken from the ZINC Fragments set
Leads: 500 diverse molecules taken from the ZINC Lead-like set
Pieces: 823 pieces of molecules obtained by doing a BRICS fragmentation of some molecules from the pubchem screening set.

These sets were discussed in this thread on the mailing list: http://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg02066.html and this presentation: http://www.hinxton.wellcome.ac.uk/advancedcourses/MIOSS%20Greg%20Landrum.pdf

In [3]:

frags = [Chem.MolFromSmiles(x.split()[0]) for x in file('../data/zinc.frags.500.q.smi')]
leads = [Chem.MolFromSmiles(x.split()[0]) for x in file('../data/zinc.leads.500.q.smi')]
pieces = [Chem.MolFromSmiles(x) for x in file('../data/fragqueries.q.txt')]

Look at the size distributions:

In [97]:

figure(figsize=(16,4))
subplot('141')
hist([x.GetNumAtoms() for x in pieces],bins=20)
title('Pieces')
xlabel('NumAtoms')
subplot('142')
hist([x.GetNumAtoms() for x in frags],bins=20)
title('Fragments')
xlabel('NumAtoms')
subplot('143')
hist([x.GetNumAtoms() for x in leads],bins=20)
title('Leads')
_=xlabel('NumAtoms')
subplot('144')
hist([x.GetNumAtoms() for x in mols],bins=20)
title('Mols')
_=xlabel('NumAtoms')

Setting up the parallelization

I'm going to be doing a fair amount of embarassingly parallel computation, so I'll take advantage of IPython's support for parallel computing. For this to work, you need to have a cluster of workers set up. This is easy in the Notebook (there's a tab in the dashboard), and it's doable from the command line (http://ipython.org/ipython-doc/stable/parallel/parallel_process.html)
It took a bit of time to figure out how to do this, but once over that hurdle, this thing is super easy and useful.

In [4]:

from IPython.parallel import Client
rc = Client()
dview = rc[:]

In [5]:

with dview.sync_imports():
    from rdkit import Chem
    from rdkit import DataStructs
    from rdkit.Avalon import pyAvalonTools

importing Chem from rdkit on engine(s)
importing DataStructs from rdkit on engine(s)
importing pyAvalonTools from rdkit.Avalon on engine(s)

Verify that it works:

In [28]:

t1=time.time()
_ = [Chem.RDKFingerprint(m) for m in mols[:10000]]
t2=time.time()
print('Serial: %.2f'%(t2-t1))

Serial: 15.39

In [70]:

fn = lambda x:Chem.RDKFingerprint(x)
t1=time.time()
_ = dview.map_sync(fn,mols[:10000])
t2=time.time()
print('Parallel: %.2f'%(t2-t1))

Parallel: 5.24

The above runtimes were collected with a cluster of 4 workers, so it looks like the parallization is working.

The test harness

In [6]:

def subTest(mol,queries,qfps,fpf,verify,silent):
    nFound=0
    nScreened=0
    nTot=0
    nFailed = 0
    molfp = fpf(mol,False)
    for j,queryfp in enumerate(qfps):
        nTot += 1 
        if DataStructs.AllProbeBitsMatch(queryfp,molfp):
            nScreened += 1
            if mol.HasSubstructMatch(queries[j]):
                nFound += 1
        elif verify:
            if mol.HasSubstructMatch(queries[j]):
                nFailed += 1
                if not silent:
                    print('Failure: %s %s'%(Chem.MolToSmiles(mol,True),Chem.MolToSmiles(queries[j],True)))
    return nFound,nScreened,nTot,nFailed

def testScreenout(mols,queries,fpf,dview,verify=False,silent=False):
    nFound=0
    nScreened=0
    nTot=0
    nFailed = 0
    #if not silent: print("Building query fingerprints")
    qfps = [fpf(m,True) for m in queries]
    #if not silent: print("Running Queries")
    t = lambda x,qfps=qfps,queries=queries,fpf=fpf,verify=verify,silent=silent,subTest=subTest:subTest(x,queries,qfps,fpf,verify,silent)
    res=dview.map_async(t,mols)
    for entry in res:
        nFound+=entry[0]
        nScreened+=entry[1]
        nTot+=entry[2]
        nFailed+=entry[3]
    if not silent:
        accuracy = float(nFound)/nScreened
        print("Found %d matches in %d searches with %d failures. Accuracy: %.3f"%(nFound,nScreened,nFailed,accuracy))
    return nTot,nScreened,nFound,nFailed

Tests

In [11]:

methods = {
   'Avalon-2K':lambda x,y: pyAvalonTools.GetAvalonFP(x,nBits=2048,isQuery=y,bitFlags=pyAvalonTools.avalonSSSBits),

   'Pattern-1K':lambda x,y: Chem.PatternFingerprint(x,fpSize=1024),
   'Pattern-2K':lambda x,y: Chem.PatternFingerprint(x,fpSize=2048),
   'Pattern-4K':lambda x,y: Chem.PatternFingerprint(x,fpSize=4096),

   'Layered': lambda x,y:Chem.LayeredFingerprint(x,layerFlags=Chem.LayeredFingerprint_substructLayers),
   }

In [12]:

leadResults={}
fragResults={}
pieceResults={}
for method,func in methods.iteritems():
    print("----------------------------")
    print("Doing %s"%method)
    print("Leads")
    leadResults[method]=testScreenout(mols,leads,func,dview)
    print("Frags")
    fragResults[method]=testScreenout(mols,frags,func,dview)
    print("Pieces")
    pieceResults[method]=testScreenout(mols,pieces,func,dview)

----------------------------
Doing Avalon-2K
Leads
Found 886 matches in 2044 searches with 0 failures. Accuracy: 0.433
Frags
Found 4479 matches in 13656 searches with 0 failures. Accuracy: 0.328

Summarize that:

In [14]:

mns = sorted(methods.keys())
for mn in mns:
    print(mn,'Pieces','%.3f'%(float(pieceResults[mn][2])/pieceResults[mn][1]),'Fragments','%.3f'%(float(fragResults[mn][2])/fragResults[mn][1]),
          'Leads','%.3f'%(float(leadResults[mn][2])/leadResults[mn][1]))

Avalon-2K Pieces 0.465 Fragments 0.328 Leads 0.433
Layered Pieces 0.417 Fragments 0.105 Leads 0.187
Pattern-1K Pieces 0.500 Fragments 0.285 Leads 0.169
Pattern-2K Pieces 0.572 Fragments 0.590 Leads 0.715
Pattern-4K Pieces 0.635 Fragments 0.602 Leads 0.729

A caveat

The results from the Avalon fingerprint shown above are not directly comparable to the others since structures are being filtered out that shouldn't be.
To demonstrate this, we need a serial form of the screenout test:

In [8]:

def testScreenoutSerial(mols,queries,fpf,verify=False,silent=False):
    nFound=0
    nScreened=0
    nTot=0
    nFailed = 0
    #if not silent: print("Building query fingerprints")
    qfps = [fpf(m,True) for m in queries]
    #if not silent: print("Running Queries")
    t = lambda x,qfps=qfps,queries=queries,fpf=fpf,verify=verify,silent=silent,subTest=subTest:subTest(x,queries,qfps,fpf,verify,silent)
    res=map(t,mols)
    for entry in res:
        nFound+=entry[0]
        nScreened+=entry[1]
        nTot+=entry[2]
        nFailed+=entry[3]
    if not silent:
        accuracy = float(nFound)/nScreened
        print("Found %d matches in %d searches with %d failures. Accuracy: %.3f"%(nFound,nScreened,nFailed,accuracy))
    return nTot,nScreened,nFound,nFailed

And now we can get some examples:

In [16]:

testScreenoutSerial(mols[:500],leads,methods['Avalon-2K'],verify=True)

Failure: Cc1ccc(C)c(S(=O)(=O)c2nnn3c4ccsc4c(Nc4ccc(C)c(C)c4)nc23)c1 c1cn[nH]n1
Failure: COc1ccc(OCC#Cc2cn([C@H](C)C[C@@H]3CC[C@H]([C@@H](C)C(=O)N(C)Cc4ccccc4)O3)nn2)cc1 c1cn[nH]n1
Failure: Cc1c(C(=O)Nc2ccccc2)nnn1Cc1ccccc1O c1cn[nH]n1

Out[16]:

(250000, 22, 10, 5)

Look at a specific example:

In [17]:

mfp = pyAvalonTools.GetAvalonFP('Cc1ccc(C)c(S(=O)(=O)c2nnn3c4ccsc4c(Nc4ccc(C)c(C)c4)nc23)c1',True,2048,isQuery=False)
qfp = pyAvalonTools.GetAvalonFP('c1cn[nH]n1',True,2048,isQuery=True)
DataStructs.AllProbeBitsMatch(qfp,mfp)

Out[17]:

False

In [18]:

Chem.MolFromSmiles('Cc1ccc(C)c(S(=O)(=O)c2nnn3c4ccsc4c(Nc4ccc(C)c(C)c4)nc23)c1')

Out[18]:

The normal explanation for this would be differences in the aromaticity model. In this case, however, that turns out not be exactly it.
Here's the molecule we're querying with:

In [19]:

Chem.MolFromSmiles('c1cn[nH]n1')

Out[19]:

In [20]:

qfp2 = pyAvalonTools.GetAvalonFP('c1cnn[nH]1',True,2048,isQuery=True)
qfp==qfp2

Out[20]:

False

The deciding factor is the tautomer. The Avalon fingerprinter considers this significant, and that difference explains the different screenout.
If we use the fingerprint from the second tautomer we see the expected screenout behavior:

In [102]:

DataStructs.AllProbeBitsMatch(qfp2,mfp)

Out[102]:

True

This ends up mattering because the RDKit's substructure matcher ignores the H completely:

In [103]:

Chem.MolFromSmiles('c1cnn[nH]1').HasSubstructMatch(Chem.MolFromSmiles('c1cn[nH]n1'))

Out[103]:

True

In [104]:

Chem.MolFromSmiles('c1cnnn1C').HasSubstructMatch(Chem.MolFromSmiles('c1cn[nH]n1'))

Out[104]:

True

Unless, of course, you construct the query from SMARTS:

In [105]:

Chem.MolFromSmiles('c1cnnn1C').HasSubstructMatch(Chem.MolFromSmarts('c1cn[nH]n1'))

Out[105]:

False

Other fingerprints

The RDKit fingerprint

In [27]:

fn = lambda x,v:Chem.RDKFingerprint(x,minPath=1,maxPath=5,nBitsPerHash=1,useHs=False)
testScreenoutSerial(mols,leads,fn,verify=True)

Failure: COc1ccc(-n2c(SC(C)=O)nc3sc4c(c3c2=O)CCCC4)cc1 CSC(C)=O
Failure: CCCCCC[C@H]1SC(=O)c2ccccc2[C@@H]1C(=O)O CSC(C)=O
Failure: COc1ccccc1SC(=O)c1cccc(C=O)n1 CSC(C)=O

Out[27]:

(25000000, 1399, 1240, 34)

The failures happen here because the RDKit fingerprinter uses information about aromaticity in the calculation of atom invariants and while hashing bonds. That extra information is a big part of why the accuracy is so high.
Turning off all bond order information (finer grained control is unfortunately not possible) and using atomic number as the atom invariants gives no errors, but a much lower screenout rate:

In [53]:

def func(mol,v):
    invs = [x.GetAtomicNum() for x in mol.GetAtoms()]
    return Chem.RDKFingerprint(mol,minPath=1,maxPath=5,nBitsPerHash=1,useHs=False,useBondOrder=False,atomInvariants=invs)
testScreenoutSerial(mols,leads,func,verify=True)

Found 1274 matches in 125651 searches with 0 failures. Accuracy: 0.010

Out[53]:

(25000000, 125651, 1274, 0)

Using a shorter maxPath length also hurts accuracy:

In [35]:

fn = lambda x,v:Chem.RDKFingerprint(x,minPath=1,maxPath=4,nBitsPerHash=1,useHs=False)
testScreenout(mols,leads,fn,dview)

Found 1240 matches in 2331 searches with 0 failures. Accuracy: 0.532

Out[35]:

(25000000, 2331, 1240, 0)

It is most likely luck that there are no failures here.

Chemfp

Andrew Dalke's chemfp package (http://chemfp.com/) includes a pattern-based fingerprinter which is based on the PubChem/CACTVS substructure keys:

In [31]:

from chemfp import rdkit_patterns
dalke_fp=rdkit_patterns.SubstructRDKitFingerprinter_v1.make_fingerprinter()
fn = lambda x,v,fp=dalke_fp:DataStructs.CreateFromBinaryText(fp(x))
testScreenoutSerial(mols[:500],leads,fn,verify=True)

Failure: CC(C)(C)c1cc2c(cc1SC1=C(O)OC(CCc3ccccc3)(c3ccccc3)CC1=O)CCCC2 C1=COCCC1
Failure: CC(C)(C)c1cc2c(cc1SC1=C(O)OC(CCc3ccccc3)(c3ccccc3)CC1=O)CCCC2 O=C1C=CO[C@H](c2ccccc2)C1
Failure: Cc1ccc(C)c(S(=O)(=O)c2nnn3c4ccsc4c(Nc4ccc(C)c(C)c4)nc23)c1 c1cn[nH]n1
Failure: CCOC(=O)Cc1c(C)n(CC2CCCCC2)c2ccc(OC)cc12 Cc1cc2cc(O)ccc2[nH]1

Out[31]:

(250000, 108, 3, 12)

It's not particularly effective as a screening fingerprint and, as seen above, creates some holes.

MACCS keys

Another key-based fingerprint that isn't particularly effective and results in some misses:

In [9]:

fn = lambda x,v:rdMolDescriptors.GetMACCSKeysFingerprint(x)
testScreenoutSerial(mols[:500],leads,fn,verify=True)

Failure: CC(C)(C)c1cc2c(cc1SC1=C(O)OC(CCc3ccccc3)(c3ccccc3)CC1=O)CCCC2 C1=COCCC1
Failure: Cc1ccc(C)c(S(=O)(=O)c2nnn3c4ccsc4c(Nc4ccc(C)c(C)c4)nc23)c1 c1cn[nH]n1
Failure: CCOC(=O)C1=COC(C)(CCc2ccccc2)CC1=O C1=COCCC1
Failure: CCOC(=O)Cc1c(C)n(CC2CCCCC2)c2ccc(OC)cc12 Cc1cc2cc(O)ccc2[nH]1

Out[9]:

(250000, 134, 3, 12)

Improving things

A first pass at improving is to combine some selected patterns with an algorithmic fingerprinter like the pattern fingerprinter. Here's a thread describing an experiment:
http://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg02078.html
And a reproduction of that with this test harness:

In [15]:

qs="""0 2 times O largest 55458
1 2 times Ccc largest 29602
2 1 times CCN largest 16829
3 1 times cnc largest 11439
4 1 times cN largest 8998
5 1 times C=O largest 7358
6 1 times CCC largest 6250
7 1 times S largest 4760
8 1 times c1ccccc1 largest 4524
9 2 times N largest 2854
10 1 times C=C largest 2162
11 1 times nn largest 1840
12 2 times CO largest 1248
13 1 times Ccn largest 964
14 1 times CCCCC largest 857
15 1 times cc(c)c largest 653
16 3 times O largest 653
17 1 times O largest 466
18 2 times CNC largest 464
19 1 times s largest 457
20 1 times CC(C)C largest 335
21 1 times o largest 334
22 1 times cncnc largest 334
23 1 times C=N largest 321
24 2 times CC=O largest 238
25 4 times Ccc largest 238
26 1 times Cl largest 230
27 4 times O largest 149
28 2 times ccncc largest 149
29 6 times CCCCCC largest 76
30 2 times c1ccccc1 largest 76
31 1 times F largest 75
32 3 times CCOC largest 44
33 3 times N largest 44
34 1 times c(cn)n largest 44
35 1 times N largest 41
36 9 times C largest 41
37 1 times CC=C(C)C largest 33
38 1 times c1ccncc1 largest 26
39 1 times CC(C)N largest 26
40 1 times CC largest 26
41 4 times CCC(C)O largest 25
42 2 times ccc(cc)n largest 21
43 6 times C largest 21
44 1 times C1CCCC1 largest 18
45 1 times C largest 18
46 5 times O largest 18
47 2 times Ccn largest 14
48 1 times CNCN largest 13
49 3 times cncn largest 13
50 1 times CSC largest 13
51 3 times CC=O largest 11
52 1 times CCNCCCN largest 11
53 1 times CccC largest 11
54 3 times ccccc(c)c largest 10"""

def _initPatts():
    ssPatts=[]
    for q in qs.split('\n'):
        q = q.split(' ')
        count = int(q[1])
        patt = Chem.MolFromSmiles(q[3],sanitize=False)
        patt.UpdatePropertyCache(strict=False) # <- github #149, without this we cannot pickle the queries
        ssPatts.append((patt,count))
    return ssPatts
    
_ssPatts=None
def GetCombinedSubstructFP(m,ssPatts=None,fpSize=1024,verbose=False):
    if ssPatts is None:
        ssPatts = _initPatts()
    sz = len(ssPatts)
    lfp=Chem.PatternFingerprint(m,fpSize=fpSize)
    res = DataStructs.ExplicitBitVect(fpSize+sz)
    obls = [x+sz for x in lfp.GetOnBits()]
    res.SetBitsFromList(obls)
    for i,(p,count) in enumerate(ssPatts):
        matches = m.GetSubstructMatches(p,uniquify=True)
        if len(matches)>=count:
            res.SetBit(i)
    return res