Thursday, August 4, 2016

A question: RDKit performance on Windows

Updated 6 August 2016 to fix an incomplete sentence.

This one is a request for advice/expertise on performance tuning/compiler flag tweaking on Windows. The short story is that when the RDKit is built using Visual Studio on Windows it ends up being substantially slower than when it's built with g++ and run using the Windows Subsystem for Linux. This doesn't seem like it should be true, but I'm not an expert with either Visual Studio or Windows, so I'm asking for help.

Some more details:

When I've used the RDKit on Windows machines it has always seemed slower than it should. I've never really quantified that and so I've always just kind of shrugged and moved on. Now I've measured a real difference and I'd like to try and do something about it.

Some experiments that I did with Docker on Windows convinced me that the effect was real, but with the advent of Bash on Windows 10 (https://msdn.microsoft.com/en-us/commandline/wsl/install_guide) - an awesome thing, by the way - I have some real numbers.

The RDKit includes some code that I've used over the years to track the performance of some basic tasks. This script - https://github.com/rdkit/rdkit/blob/master/Regress/Scripts/timings.py -  looks at a broad subset of RDKit functionality.

The tests are:
  1. construct 1000 molecules from sdf
  2. construct 1000 molecules from smiles
  3. construct 823 fragment molecules from SMARTS (smiles really)
  4. 1000 x 100 HasSubstructMatch (100 from t3)
  5. 1000 x 100 GetSubstructMatches (100 from t3)
  6. construct 428 queries from RLewis_smarts.txt
  7. 1000 x 428 HasSubstructMatch
  8. 1000 x 428 GetSubstructMatches
  9. Generate canonical SMILES for 1000 molecules
  10. Generate mol blocks for 1000 molecules
  11. RECAP decomposition of the 1000 molecules
  12. Generate 2D coordinates for the 1000 molecules
  13. Generate 3D coordinates for the 1000 molecules
  14. Optimize those conformations using UFF
  15. Generate unique subgraphs of length 6 for the 1000 molecules
  16. Generate RDK fingerprints for the 1000 molecules
  17. Optimize the conformations above (test 13) using MMFF
Here are the results using the 2016.03 release conda builds (available from the rdkit channel in conda), the first line is the Windows build, the second is the Linux build, the tests were run directly after each other on the same laptop (a Dell XPS13 running Win10 Anniversary Edition):
0.8 || 0.4 || 0.1 || 1.2 || 1.3 || 0.0 || 4.2 || 4.2 || 0.2 || 0.3 || 7.4 || 0.3 || 7.5 || 18.5 || 2.2 || 1.4 || 41.7
0.6 || 0.3 || 0.1 || 0.9 || 1.0 || 0.0 || 3.1 || 3.2 || 0.1 || 0.2 || 6.3 || 0.3 || 6.2 || 15.2 || 2.1 || 1.0 || 29.8
that's a real difference.

The Windows build is done using build files generated by cmake. It's a release mode build with Visual Studio using the flags: "/MD /O2 /Ob2 /D NDEBUG" (those are the defaults that cmake creates).

It doesn't seem right to me that the code generated by Visual Studio and running under Windows should be so much slower than the code generated by g++ and running under the Windows Linux subsystem. I'm hoping, for the good of all of the users of the RDKit on Windows, to find a tweak for the Visual C++ command-line options that produces faster compild code.

For what it's worth, here's a different set of benchmarks, run on a larger set of molecules. The script is here (https://github.com/rdkit/rdkit/blob/master/Regress/Scripts/new_timings.py):

  1. construct 50K molecules from SMILES
  2. generate canonical SMILES for those
  3. construct 10K molecules from SDF
  4. construct 823 fragment molecules from SMARTS (smiles really)
  5. 60K x 100 HasSubstructMatch
  6. 60K x 100 GetSubstructMatches
  7. construct 428 queries from RLewis_smarts.txt
  8. 60K x 428 HasSubstructMatch
  9. 60K x 428 GetSubstructMatches
  10. Generate 60K mol blocks
  11. BRICS decomposition of the 60K molecules
  12. Generate 2D coords for the 60K molecules
  13. Generate RDKit fingerpirnts for the 60K molecules
  14. Generate Morgan (radius=2) fingerprints for the 60K molecules.

The timings show the same, at times dramatic, performance differences :
18.8 || 8.5 || 6.8 || 0.1 || 85.8 || 106.2 || 0.0 || 264.2 || 268.6 || 14.0 || 77.2 || 20.9 || 104.7 || 13.0
17.5 || 9.8 || 6.7 || 0.1 || 68.0 ||  74.2 || 0.0 || 204.6 || 208.2 ||  9.6 || 56.5 || 20.6 ||  89.0 ||  6.6

If you have thoughts about what's going on here, please comment here, reach out on twitter, google+, or linkedin, or post to the mailing list.
Thanks!

4 comments:

Brian Cole said...

Question to make sure we're not comparing apples to oranges. Are they both 64-bit x64 builds? I think MSVC still uses x87 floating point when compiling in 32-bit mode. While it will use SSE vector instructions for x64 builds.

Along those lines, what are the GCC flags being compared to? -ffast-math? Then /fp:fast should be used with MSVC.

What versions of Visual Studio versus GCC as well? That will help to know what kind of features the compilers will have available to them, for example C++11 move constructors to elide copy construction with the heavy use of the STL in RDKit could make a big difference.

greg landrum said...

The Windows builds are done with MS Visual Studio 2015 community edition and it is a 64bit build.

The linux build is using gcc 4.x (I believe 4.4 for the conda builds) and standard cmake release mode flags: "-O3 -NDEBUG".

Brian Cole said...

If the conda builds work on RedHat 5, it's likely GCC 4.1. With that comparison Visual Studio should actually be compiling in C++11 and gcc would still be doing more explicit copies.

So my next stab in the dark guess at why GCC is faster is because GCC 3.4 though 5.1 uses copy-on-write std::string: http://info.prelert.com/blog/cpp-stdstring-implementations That implementation can pseudo-emulate the std::string move semantics well before C++11.

Though honestly it may be time to choose one of the most egregious cases and throw it in a profiler. Often times optimizing for one platform yields performance improvements everywhere (but probably to a lesser degree). Happy to stare at Visual Studio profiling output if you post it.

samadamsthedog said...

You mentioned bash in your comment. Are you using bash or other *X-originated procedures and practices in the run-time stack? Process creation is *extremely* slow on Windows (or at least used to be not so long ago), and you'd be amazed how many processes deep some of these trees are when compiling from Makefiles or from bash scripts, where each (parenthesized command) and `backtick-quoted command` launches a new shell. Calls to system() and the like also do it, as does use of Windows ports of *X fork()/exec(). Even if you can't think of any of the above that you used, straightforward ports of a working *X stack to Windows can have a lot of these that, as I said, that you'd be amazed to discover.

There's a utility that can show the tree on Windows; I forget the name, though. Microsoft bought it from a third party who had developed it and freely distributed it. (I think it remains free at MS.) We saw impressive improvements in speed and robustness at Schrödinger when we eliminated the above practices on Windows. Big job though. We had a dedicated (in both senses of the word) and super-competent buildmeister who rebuilt our build system to avoid this. Our infrastructure code might have been fixed, too, but some of the runtime issues may remain.

-P.