Docker is a great resource for virtualizing environments. It makes deploying code and work to any environment super simple. In research, it also helps make projects repeatable and shareable through pre-defined environments.

In research, it's important to keep the following parts of your project separate:

  • Environment - libraries, packages and datasets that your research depends on
  • Code - controls what your research is project is going to do
  • Configuration - parameters and design choices you control for experiments
  • Results - the logs, outcomes, pre-trained models and performance metrics of your experiments

This tutorial will focus on preparation of an environment that easily enables development, debugging and large scale experimentation.

Docker is a virtual machine that contains a single application. Unlike tools like VirtualBox and VMware, the machines are defined through simple configuration scripts called Dockerfiles that describe what files, libraries and software are present. The virtual machines will bind some files and ports on your local machine or storage network to read and write files.

Installing Docker is easy and can be done by following the following instructions. There are extensions from nVidia that enable GPU passthrough here.

On HPC clusters, Singularity may be installed instead of Docker. It enables (almost) seamless integration so that your code can be scaled up and deployed without much hassle.

This tutorial will focus on getting your project running in a Docker environment, focusing on 5 key tips:

  1. Build a common base image if you have lots of projects
  2. Keep data separate
  3. Don't forget to mount paths for your cache directories
  4. Keep your images debuggable
  5. Keep your images archived and repeatable

Base environment

Depending on the project, it's useful to build your code on top of a predefined instance. This could be the AllenNLP docker image, Tensorflow image or a Miniconda instance.

In my research, I use the Miniconda template as a base image; it's a raw Python3 environment which I install all my packages and resources on top of.

Create a new directory and write add something similar to this in a file named Dockerfile

FROM continuumio/miniconda3

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

ENV TORCH_HOME=/local/cache
ENV ALLENNLP_CAHCE_ROOT=/local/cache

RUN apt-get update
RUN apt-get install -y --no-install-recommends \
    zip \
    gzip \
    make \
    automake \
    gcc \
    build-essential \
    g++ \
    cpp \
    libc6-dev \
    unzip \
    nano \
    rsync

RUN conda update -q conda

ADD requirements.txt /tmp
RUN pip install -r /tmp/requirements.txt

My base image contains some common programs and Python packages that I use in every project. The top line, beginning with FROM imports the miniconda image released from Dockerhub.

I also install some utils for zipping and managing files and building software using apt before installing my Python packages. These dependencies are useful for debugging and exporting results from your projects and save some headaches down the line.

My requirements.txt contains some packages that I use regularly that take a long time to install such as PyTorch and AllenNLP. You could adapt this specifically for your project. But the key thing here, is that these are common to pretty much all my experiments and I set this up once. Every project will still have its own requirements file and dockerfile too!

tqdm
torch
torchvision
allennlp>=0.9

The next step is to build your base image.  This is a single one-liner that you can run from your project folder which will build the docker image with the name my_base_image ready for use!

docker build -t my_base_image .

For versioning it's recommended to use some kind of software configuration management like SVN or Git. GitHub offers a number of free private repositories to help get you started.

Rather than building the docker image manually every time, you can auto-build it every time you push to Git using a continuous integration service like Travis or CircleCI. You can also push the newly-constructed image to Dockerhub so that you can download it on your HPC cluster, cloud server or local machine.

Add the following file to your project called .travis.yml to enable builds after setting up an account on Travis.  This will push your base image to Dockerhub if a commit is made to the master branch on GitHub. Dockerhub gives hosting for one free private docker image. I'd recommend using this for your project instance, not the base image (which is basically just a blank template).

language: python
services:
  - docker
python:
  - "3.6"
stages:
    - before_install
    - install
    - name: after_success
      if: branch = master
before_install:
  - docker login -u $DOCKER_USER -p $DOCKER_PASS
script:
  - echo "No script"
after_success:
  - docker build -t $DOCKER_ACCT/docker-base-image .
  - docker tag $DOCKER_ACCT/docker-base-image $DOCKER_ACCT/docker-base-image:build-$TRAVIS_BUILD_NUMBER
  - docker push $DOCKER_ACCT/docker-base-image:build-$TRAVIS_BUILD_NUMBER
  - docker tag $DOCKER_ACCT/docker-base-image $DOCKER_ACCT/docker-base-image:latest
  - docker push $DOCKER_ACCT/docker-base-image:latest
  - echo "Done"

Set the following secret environment variables in the project settings and you're ready to build!

DOCKER_USER=<name of docker user for login>
DOCKER_PASS=<password for login on docker>
DOCKER_ACCT=<name of docker account>

DOCKER_ACCT should be the same as your username unless you are using a project/team in on Dockerhub.

Project Image

Your project image can now extend the base image you've just published to Dockerhub using the FROM directive in the Dockerfile.

The Dockerfile does 4 things: first we create directories for source, scripts and configs to go; we then create volumes for work (where output from models will go) and cache (where pre-trained models and partially computed results will go); then we install our requirements; and finally copy the files from our project into these directories. Some of these steps are specific to my projects, such as the SpaCy download.

FROM mydockerhubaccount/docker-base-image

RUN mkdir -pv /local/src
RUN mkdir -pv /local/configs
RUN mkdir -pv /local/scripts

VOLUME /local/work
VOLUME /local/cache

WORKDIR /local/
ADD requirements.txt /local/
RUN pip install -r requirements.txt
RUN python -m spacy download en

ADD src /local/src
ADD configs /local/configs
ADD scripts /local/scripts

ENV PYTHONPATH=/local/src

It's important to order the steps in the Dockerfile correctly. Docker builds will cache the build – so if you're not careful, updating the scripts would re-install the dependencies and download SpaCy again. Setting the PYTHONPATH is an easy way to add the directory containing project source to the list of directories Python will search for code in. It's a lot simpler than a local pip install as this doesn't require you to maintain a setup.py script.

We add the src directory to the docker image so that we can archive everything necessary to run your experiments in the image and repeat experiments at a later date. Although for debugging, this can become a hassle, so in practice, we locally mount this directory through the docker command line to allow us to make small changes without rebuilding the entire docker image.

The same .travis.yml from the previous step can be copied for this step too, changing the name of the docker image file to the one you're using for this project.

Running your project locally

The project can be run locally using the docker command, you can run bash like this to get started for an interactive shell. The -it flag gives you an interactive terminal and the --rm flag will remove the shell after use. The --rm flag is important for experimental reproducibility as we always start from the same template state and we don't clog up our filesystem with containers with no useful info in them.  GPUs can be mounted by following the instructions for nVidia docker.

docker run --rm -it <myimagename> bash

While this allows us to poke around and check if your scripts work, we also need to mount the cache and work directories. This will allow us to use the share the files and logs generated by your script once it's finished running. These can be mounted through adding flags like -v $(pwd)/cache:/local/cache to map local directories containing cache files, datasets and our outputs.

In the following example we set up an alias that maps local directories data,  work and cache to those in the docker image. It is important not to include these in the docker image as large files can quickly eat space and cause docker push/pulls to be unacceptably slow.

To speed up development, we create an additional alias for a docker image that accepts live changes to your code, configs and scripts, you can also mount the src, configs, and scripts through additional mounts.

Typing out these long commands can be tedious: you can set up bash aliases to simplify things. The ones I use are (note the --gpus flag which will have to be removed if you don't have GPUs):

alias drun="docker run --rm -it -v $(pwd)/data:/local/data -v $(pwd)/cache:/local/cache -v $(pwd)/work:/local/work --gpus all"
alias dtest="docker run --rm -it -v $(pwd)/data:/local/data -v $(pwd)/cache:/local/cache -v $(pwd)/work:/local/work -v $(pwd)/src:/local/src -v $(pwd)/scripts:/local/scripts -v $(pwd)/configs:/local/configs --gpus all"

The drun and dtest commands differ with local mounts and for the src, configs and scripts directory. I typically use dtest when developing and drun when I'm using a frozen version on my docker image: for papers I will tag a build with a specific label so I can replicate results.

Now, we can just run the following command

dtest mydockerimagename command

You can set up bash scripts for experiments or invoke python directly here.

Running your project on an HPC cluster

Some HPC clusters have singularity installed which will run pre-built docker images allowing you to use your same environment on both local development machines and large scale compute clusters. The command has slightly different parameters to the docker command line interface and must also use a docker image that has been converted into a singularity image.

For the run command, the -v flags are replaced with -B , --gpus is replaced with --nv and we must also manually specify the working directory. Furthermore, singularity will mount the local root filesystem. You may have to check that your choice of working directory in the docker image doesn't clash (e.g. if you have scratch storage mounted to /local)

singularity run --nv --pwd /work -B $(pwd)/data:/local/data -B $(pwd)/cache:/local/cache -B $(pwd)/work:/local/work -B $(pwd)/src:/local/src -B $(pwd)/scripts:/local/scripts -B $(pwd)/configs:/local/configs myimage.simg command

The command is the same as above. The docker image must be compiled into a singularity image with the following command, replacing your dockerhub username and image name:

singularity pull docker://dockerhubuser/imagename myimage.simg

When running large-scale jobs, it advisable to have a small configuration that will run the head-node downloading any pre-trained models and embeddings prior to submitting the job to the queue. This will help prevent files from clobbering if you scripts start at once when trying to download models to the same place without good file locking.

Debugging your project

The first step to debugging your projects is to use the bash -x flag when running scripts which will echo commands back this is especially useful when you have lots of environment variables and script parameters generated by other scripts.

For debugging Python, the Pro versions of modern IDEs like PyCharm enable development in docker instances and can be debugged as an extra environment: see this link for details.

If you use PDB, you can simply instantiate pdb from the terminal as your entrypoint to the application instead of running it locally.