Wednesday, October 30, 2019

Sharing conda environments

This isn't directly an RDKit topic, but it came up in a couple of recent conversations and certainly is relevant to RDKit users so I figured I'd do a short blog post.
I always encourage people to use the Anaconda Python distribution since conda does such a great job of managing binary dependencies and handling separate environments (well, and because then they won't have to build the RDKit themselves... that makes a huge difference for most folks). By this point I think/hope most people know this. What's perhaps less well known is how straightforward it is to share a conda environment with someone else. That's what this post is about.
Suppose you've created a script or Jupyter notebook that you would like to share with others. Perhaps you've written a paper and want to allow people to reproduce what you did or you've uploaded a package to GitHub and you want to make it straightforward for others to work with it. As you probably know already, manually capturing and communicating the dependencies for your code can be "a bit" of a pain. If you work inside of a conda environment it's easy.
Here's an example. I start by creating an environment named rdkit_demo that has a set of packages that I know I'm going to be using for my project:
(base) glandrum@otter:~/RDKit_blog$ conda create -n rdkit_demo python=3.7 jupyter psycopg2 matplotlib pandas scikit-learn rdkit::rdkit
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

   ...snip...

#
# To activate this environment, use
#
#     $ conda activate rdkit_demo
#
# To deactivate an active environment, use
#
#     $ conda deactivate
A quick note on that command: I specified that the rdkit package should be installed from the rdkit channel (that's the rdkit::rdkit syntax) so that I can be sure that I'm getting the rdkit build I did instead of the conda-forge build (There's nothing wrong with the conda-forge build! I'm just showing how to control what gets installed)
I work with this environment for a while and then realize that I also need dask, so I add that to my environment (notice from the prompt that I'm working in the rkdit_demo environment):
(rdkit_demo) glandrum@otter:~/RDKit_blog$ conda install dask
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

   ...snip...

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(rdkit_demo) glandrum@otter:~/RDKit_blog$
After some more work I am ready to share the notebook I was working on with others. I need to capture what I have installed so that they can also set up an environment that works.
The easiest and least error prone way for me to do this is to export the contents of the environment:
(rdkit_demo) glandrum@otter:~/RDKit_blog$ conda env export --from-history > rdkit_demo_env.yml
(rdkit_demo) glandrum@otter:~/RDKit_blog$ cat rdkit_demo_env.yml 
name: rdkit_demo
channels:
  - defaults
  - https://conda.anaconda.org/rdkit
  - conda-forge
dependencies:
  - scikit-learn
  - jupyter
  - rdkit::rdkit
  - pandas
  - python=3.7
  - psycopg2
  - matplotlib
  - dask
prefix: /home/glandrum/anaconda3/envs/rdkit_demo

(rdkit_demo) glandrum@otter:~/RDKit_blog$ 
I can create a new environment from the rdkit_demo_env.yml file like this:
(base) glandrum@otter:~/RDKit_blog$ conda env create -f rdkit_demo_env.yml 
Collecting package metadata (repodata.json): done
Solving environment: done
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate rdkit_demo
#
# To deactivate an active environment, use
#
#     $ conda deactivate

(base) glandrum@otter:~/RDKit_blog$ conda activate rdkit_demo
(rdkit_demo) glandrum@otter:~/RDKit_blog$ conda list | grep rdkit
# packages in environment at /home/glandrum/anaconda3/envs/rdkit_demo:
rdkit                     2019.09.1.0      py37hc20afe1_1    rdkit
Note that the environment file I created only includes the packages that I explicitly installed and only mentions package versions that I explicitly requested (in this case only the Python version was specified). If I want to include more detail, I have at least two options.
The first option is to edit the yml file and include version information for some/all of the packages I am installing. This is easy to do following the pattern seen above for the Python version. For example, if I want to specify that the most recent patch release from the previous RDKit release cycle should be used I would include this in the yml file:
  - rdkit::rdkit=2019.03.*
The second option is to have conda list full version information about all installed packages, including the dependencies of the manually installed packages:
(rdkit_demo) glandrum@otter:~/RDKit_blog$ conda env export
name: rdkit_demo
channels:
  - rdkit
  - defaults
  - https://conda.anaconda.org/rdkit
  - conda-forge
dependencies:
  - _libgcc_mutex=0.1=main
  - attrs=19.3.0=py_0
  - backcall=0.1.0=py37_0
  - blas=1.0=mkl
  - bleach=3.1.0=py37_0
  - bokeh=1.3.4=py37_0
  - bzip2=1.0.8=h7b6447c_0
  - ca-certificates=2019.10.16=0
  - cairo=1.14.12=h8948797_3
  - certifi=2019.9.11=py37_0

   ...snip...

  - rdkit=2019.09.1.0=py37hc20afe1_1
  - readline=7.0=h7b6447c_5
  - scikit-learn=0.21.3=py37hd81dba3_0

   ...snip...

  - zlib=1.2.11=h7b6447c_3
  - zstd=1.3.7=h0b5b093_0
prefix: /home/glandrum/anaconda3/envs/rdkit_demo

(rdkit_demo) glandrum@otter:~/RDKit_blog$
This includes full version and build information, which is helpful if you need to reproduce exactly the same environment. One significant limitation of this approach is that the environment is now almost certainly tied to the operating system on which it was created.
You can find more information on managing conda environments in the conda documentation: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
As an aside: If I'm updating or installing on a new machine, I almost always install miniconda instead of the full Anaconda distribution. Anaconda is useful if you want to have most everything installed at once, but you pay the price of a huge install. Miniconda is, as the name indicates, minimal - you get just the stuff you need to setup an environment and then can manually install the packages you are interested in using. There's more on how to decide between the two here.

1 comment:

Plopman said...

Cool to read a summed up post on this. Just a word to confirm that indeed if you export every detail to the yml you end up with cross-OS unusable environments....but at the same time it's still the only way to reproduce EXACTLY the same environment.

If the purpose is not development, but a production deployment of rdkit then I definitely would recommend setting up a docker image reproducing your exact environments.