This is a guest post from Tim Dudgeon.
This time he picked on (pun!) the MaxMinPicker algorithm in RDKit. This also has history, as it was the subject of a blog by Greg from 2014 (and also this one).
The most obvious approach to picking a diverse subset might be to start with clustering the dataset e.g. using Butina Clustering and then picking from each cluster. Indeed that's exactly what I've done here. But as Roger pointed out this does not scale as the full distance matrix is needed which, assuming a symmetric distance function, is (n^2 - n)/2 in size. Something smarter is needed for large datasets.
In fact the MaxMinPicker (see here for C++ or here for Python docs) also has the option to pass in the full distance matrix which will also have the same limitation. But the key to MaxMinPicker (described in Ashton, M. et. al., Quant. Struct.-Act. Relat., 21 (2002), 598-604) is that the full distance matrix does not need to be generated, only a subset of it, and it can be generated as needed in a lazy manner.
The algorithm works as follows.
- Generate descriptors (e.g. Morgan fingerprints) for all the molecules, both any initial seeds plus those to pick from (the candidate pool). For large sets this will be slow, but cannot be avoided.
- If there are no initial seeds select an initial molecule at random from the candidate pool and it becomes the sole member of the picked set.
- From the molecules in the candidate pool find the one that has the maximum value for its minimum distance to molecules in the picked set (hence the MaxMin name), calculating and recording the distances as required. This molecule is the most distant one to those already picked so is transferred to the picked set.
- Iterate back to step 3 until your are complete (e.g. you have picked the required number of molecules).
The key is to think of traversing the table column by column starting from the left. You move all the way down the first column to find the minimum value (3 in this case) and this becomes your first maximum value. Then for the remaining columns you move down the values until you reach a value that is less than or equal to your current maximum. If you hit such a value you know that molecule is more similar than your current candidate so can be discarded, so you can immediately move onto the next column. If you reach the bottom of a column and your minimum value for that column is greater than the minimum value for your current candidate then that compound now becomes your current candidate as it is more distant than the original one. As you traverse the columns you calculate and record the distances if they have not already been calculated. This lazy evaluation of the distances is key to the efficiency of the algorithm. A distance is only calculated if it is needed, and it is never calculated twice. This is the aspect that Roger spotted was not implemented in the old version, and he improved as described in his talk.
So what does this all mean? Roger's new version is present in the 2017_09_01 version of RDKit (now released), and the tests we show here use a beta version of this, and compare timings against the 2017_03_03 version that had the old implementation that performs many more unnecessary distance calculations.
The results are shown in this Jupyter notebook.
First we look at the LazyPick method:
picks = mmp.LazyPick(fn, len(benzodiazepines), start_with + how_many_to_pick, list(range(start_with)))
fn
: This is the distance functionlen(benzodiazepines)
: the total number of descriptors, the initial seeds plus those to pick fromstart_with + how_many_to_pick
: the number of indexes to return (including the initial seeds)list(range(start_with))
: a list of the indexes of those in the initial pick list. Typically [1,,2,3...], but can be None
We compare this to the LazyBitVectorPick function. This does not allow a distance function to be defined but instead uses Tamimoto distance between the bit vectors. Whilst this is a restriction, that's exactly what we are using, and what most people will want to use in most cases, so it's not too much of a limitation. The plus side is that this allows less too and fro between the C++ and Python layers so performs much better.
picks = mmp.LazyBitVectorPick(benzodiazepines, len(benzodiazepines), start_with + how_many_to_pick, list(range(start_with)))
Method | Seeds | Picks | Old time | New time | Improvement
LazyPick | 0 | 100 | 17.5 | 0.67 | 26
LazyPick | 0 | 1000 | 1523.9 | 13.94 | 109
LazyPick | 100 | 100 | 81.6 | 1.41 | 58
LazyPick | 1000 | 1000 | ND | 27.22 |
LazyBitVectorPick | 0 | 100 | 12.9 | 0.09 | 143
LazyBitVectorPick | 0 | 1000 | 1466.7 | 1.39 | 1055
LazyBitVectorPick | 100 | 100 | 72.1 | 0.18 | 401
LazyBitVectorPick | 1000 | 100 | 761.3 | 0.54 | 1410
LazyBitVectorPick | 100 | 1000 | 3105.8 | 1.38 | 2251
LazyBitVectorPick | 1000 | 1000 | ND | 3.03 |
LazyBitVectorPick | 5000 | 1000 | ND | 7.37 |
LazyBitVectorPick | 5000 | 5000 | ND | 7.37 |
Times are in seconds and averages of 3 replicates.Firstly you'll notice that the LazyBitVectorPick function is significantly faster than the LazyPick one especially for the new algorithm. As a result additional calculations were done with just LazyBitVectorPick.
The new algorithm is most remarkable in the way it scales, and the bigger the problem to solve the bigger the improvement. Large studies that are impractical with the old algorithm now run in seconds.
So how can this be used for some more realistically scaled studies? Compound collections will often be in the order of several 100,000's, and those for Pharma companies in the range of a few million. Vendor catalogs that can be picked from can often be in the range of 100,000 compounds. So we set up an study that tried to approach this sort of scale. We chose the NCI250 dataset (comprising 247,477 smiles that RDKit can handle) as the compounds that we already possess, and pick from the benzodiazepine set of 12386 compounds. This can realistically only be done with the new algorithm.
The Jupyter notebook shows how this can be done. Timings are as follows:
Picks | Time (sec)
1 | 8.4 8.5 8.7
2 | 137 133 128
3 | 222 218 223
10 | 1029 1006 1030
100 | 1201 1181 1199
1000 | 1256 1187 1245
Impressive. Large scale compound selection done of a modest laptop in minutes.Are there no limits? Well, for very large sets holding all the fingerprints and picking data in memory will become limiting, and generating the descriptors in the first place will take time. Roger informs me of a study he has done selecting 2000 diverse compounds from the ~90M in PubChem Compound that took 12.6 hours to read the molecules and generate fingerprints and then 2.5 hours for diversity selection. Memory used was ~25 gigabytes. So here we are a little beyond a trivial application on a laptop, but not massively so. Impressive!
Finally, when running some of these longer jobs it was noticeable that only one core of the machine was being used. Clearly the MxMinPicker is not written to be multi-threaded, though, in principle it seems that it could be sped up considerably by being able to utilise all the cores, though this would not be trivial. So there is still scope for further improvement. Maybe another challenge for Roger!
1 comment:
Hi,
I actually wrote a MinMax picker when I was at Schrödinger, for use in an internal project – though actually, I don't think it was ever used. I was sure someone must have done it before, so I asked around, and someone pointed me to the RDKit implementation. I suspected that there must be a more efficient computation, and thanks to Roger for creating one.
I just want to comment on what I think is a difference between the above and what RDKit implements. IIRC, RDKit uses the compound closest to the centroid of the distribution as the initial selected compound. Please correct me if I am wrong.
In contrast, I selected as the initial compound the one whose nearest neighbor is the most distant. That is, selection was initiated starting with the globally most extreme outlier. From there on, both algorithms select, iteratively, the remotest outlier from the most recent point. Though we think of the process as diversity selection, it's really outlier selection. Yet other reasonable choices are possible; for instance, one might use some known ligand as the initial selection, in order to find candidates as different from it as possible.
(As a side comment, the centroid of a distribution isn't always well defined; for example, you could carry out the outlier selection on the distance matrix of RMSDs among 3D conformations, but you can't find a centroid, except possibly in a higher-dimensional conformation space.)
-Peter Shenkin.
Post a Comment