Use Devcontainers for safe and replicable research

Visual Studio Code
Reproducibility
Tutorial
Author

Vincent Grégoire, PhD, CFA

Published

June 14, 2024

Introduction

In empirical finance research, the tools and methodologies we employ play a crucial role in ensuring the integrity and reproducibility of our findings. One such powerful tool is containerization, which allows you to encapsulate your code and its dependencies into a standardized unit. Development containers, or devcontainers, provide a convenient way to create isolated environments for your data analysis tasks, ensuring that your code runs consistently across different systems.

Containers are essentially lightweight, portable environments that package up code and all its dependencies, ensuring that the software runs consistently across different computing environments. This consistency is particularly valuable in research settings where the reproducibility of results is paramount. By containerizing their code, researchers can avoid the “it works on my machine” problem, ensuring that their analyses can be replicated by others, regardless of the underlying system configurations. You might be familiar with the use of tools like poetry or pyenv for managing Python environments to separate dependencies for different projects. Containers take this concept a step further by encapsulating the entire environment, including the operating system, runtime, libraries, and configurations. This ensures that the code runs identically on any machine, making it easier for researchers to share their work and for others to verify their findings.

This encapsulation also provides an added level of safety and security that is essential in research settings. By isolating the code and its dependencies from the host system, containers protect the environment from potential security vulnerabilities or malware disguised as python librairies. This isolation ensures that the code runs in a controlled environment, reducing the risk of unintended interactions with the host system.

Finally, containers facilitate collaboration among researchers. When a project is containerized, collaborators can easily set up their environment by simply running the container, eliminating the often cumbersome process of manually installing and configuring dependencies. This ease of setup promotes a more efficient workflow and reduces the likelihood of errors, making collaborative research more streamlined and productive. Even if you work solo, containers improve the resiliency of your research workflows, making it easy to recover from a broken or stolen computer by reinstalling your project on a new computer without worrying about compatibility issues. You can even run your code in the cloud with services like GitHub Codespaces, ensuring that your analyses are not tied to a specific machine or operating system.

In this post, I provide a step-by-step guide on setting up devcontainers in VS Code, with a focus on supporting Python Poetry and mounting local directories for file storage.

All the code and configurations used in this tutorial are available in the GitHub repository.

Video Tutorial

This post is also available as a video tutorial on YouTube.

Setting Up Devcontainers in VS Code

Development containers are a feature in VS Code that leverages Docker1 to create and manage containers specifically for development purposes. This allows developers and researchers to work within a consistent environment, which is crucial for complex data analysis tasks. Setting up devcontainers in VS Code is straightforward, but there are a few prerequisites you need to have in place: Docker, Visual Studio Code, and the Dev Containers extension. If you’re on macOS, you can install Docker using Homebrew:

brew install --cask docker

To begin, you’ll need to create a .devcontainer folder in your project directory. Inside this folder, you should create a devcontainer.json file, which will define the configuration for your development container. This file specifies the base image for the container, any additional tools or libraries that need to be installed, and other settings related to the development environment. By configuring this file, you can tailor the container to meet the specific needs of your data analysis tasks. This can be done using the Dev Containers extension in VS Code, which provides a user-friendly interface for creating and managing devcontainers. To get started, simply invoke Dev Containers: Add Development Container Configuration Files... from the command palette in VS Code and follow the prompts to create your devcontainer.json file.

This will prompt you to select a base image for your container, configure any additional tools or libraries, and set up other environment settings. As default options, I like to use the following:

  • Setup location: in workspace
  • Base image: Latest Python 3.x image from Microsoft
  • Features: Poetry with pipx (more on this later)

Once you have configured your devcontainer.json file, you should get a prompt to reopen the project in the container. This will build the container based on the specified configuration and open your project within the containerized environment. You can verify that the container is running by checking the status bar in VS Code, which should indicate that you are working in a containerized environment.

You can always rebuild the container by invoking the Dev Containers: Rebuild and Reopen in Container command from the command palette. This will recreate the container based on the latest configuration settings, ensuring that your development environment is up-to-date and consistent with your project requirements.

Here is what a simple devcontainer.json file (minus the comments) looks like:

{
  "name": "Python 3",
  "image": "mcr.microsoft.com/devcontainers/python:1-3.12-bullseye",
  "features": {
    "ghcr.io/devcontainers-contrib/features/poetry:2": {}
  }
}

Images

Images are a crucial aspect of containerization, as they define the base environment for your development container. When setting up a devcontainer in VS Code, you can choose from a variety of pre-built images that provide different programming languages, tools, and libraries. These images serve as the foundation for your development environment, ensuring that the necessary dependencies are available for your data analysis tasks. By tying your devcontainer to a specific image, you can guarantee that your code runs consistently across different systems, making it easier to share and replicate your work.

Note: Most images are based on Linux distributions, so you may need to adjust your code or configurations if you are used to working on Windows or macOS. However, the differences are usually minimal and can be easily managed within the container.

Note 2: Most images specify the environment (i.e. Linux version, Python version, etc.), but not the architecture, which makes them compatible with most systems. For example, most PCs and older Macs use Intel or AMD CPUs with the x86_64 architecture, while newer Macs with Apple Silicon have the arm64 architecture. The images are usually compatible with both architectures, but that means that the environment will not be 100% identical if you run the container on different architectures. This is usually not a problem for data analysis tasks, but it’s something to keep in mind if you are having issues with your code running differently on different systems.

Features

Features are additional tools or libraries that can be installed in your development container to enhance its functionality. These features can include language-specific tools, package managers, or development environments that are tailored to your project requirements. By specifying features in your devcontainer.json file, you can extend the capabilities of your container and ensure that it is well-suited for your data analysis tasks. You can find a list of available features at containers.dev/features. In the example above, we are using the poetry feature to install the Python dependency manager Poetry in our development container.

Poetry

Poetry is a powerful tool for managing Python dependencies and project configurations. It simplifies the process of creating, managing, and sharing Python projects by providing a unified interface for dependency management, packaging, and publishing. Poetry uses a pyproject.toml file to define project dependencies, scripts, and configurations, making it easy to manage project settings and requirements. By integrating Poetry into your devcontainer setup, you can streamline your development workflow and ensure that your Python projects are well-organized and reproducible.

For example here is a pyproject.toml file that defines the dependencies for a Python project:

[tool.poetry]
name = "vcf-sample"
version = "0.1.0"
description = "Sample code for Vincent Codes Finance"
authors = ["Vincent Codes Finance <vincent@codes.finance>"]
license = "MIT"
readme = "README.md"
package-mode = false

[tool.poetry.dependencies]
python = "^3.12"
jupyter = "^1.0.0"
pandas = "^2.2.2"


[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

The package-mode = false line tells Poetry to only install the dependencies, that the project is not a package itself.

Typically, we would run poetry install to install the dependencies in the pyproject.toml file and then activate that environment with poetry shell (this is what I do in the video tutorial). However, since we only have one project in a devcontainer, we can simplify this by installing the dependencies directly in the base environment by setting the following poetry config:

poetry config virtualenvs.create false

We can automate this process by adding a postCreateCommand to our devcontainer.json file to ensure that all dependencies are installed when the container is created:

  "postCreateCommand": "poetry config virtualenvs.create false; poetry install"

Mounting Local Directories

By default, the development container is isolated from the host system, except for the workspace directory, which is mounted into the container. This ensures that your project files are accessible within the container, allowing you to work on your code seamlessly. However, there are cases where you may need to access files or directories outside the workspace, such as large data files. To achieve this, you can mount local directories into the development container, making them available within the container environment.

To mount a local directory, you can add the mounts property to your devcontainer.json file, specifying the source path on the host and the target path in the container.

Here is an example configuration for mounting a local directory:

  "mounts": ["source=/path/to/local/directory,target=/workspace/data,type=bind,consistency=cached"]

This setup ensures that the directory /path/to/local/directory on your host machine is accessible within the container at /workspace/data. This approach provides the flexibility to work with local files while benefiting from the isolated environment of the container, except for the mounted directories.

Note: This mounting process limits the flexibility of the container, as the source directory must be available on the host system. If you are working on a shared project or need to access files from different locations, you may need to consider alternative approaches, such as using a networked file system or cloud storage. As far as I know, there is no way to specify the source directory using an environment variable, so each collaborator would need to update the devcontainer.json file with their local path.

VS Code Extensions

You can also use the devcontainer.json file to configure default VS Code seettings and install VS Code extensions in your development container. When working with devcontainers, you will notice that not all the extensions you have installed on your host system are available in the container environment. Specifically, extensions that are mostly used for the host system (think UI), such as themes or language packs, will still be available. However, extensions that require access to the container environment, such as language servers or debuggers, will need to be installed. For example, the python image will install the Python extension, but you may need to install additional extensions for specific tasks, such as the Jupyter extension for working with Jupyter notebooks or with the interactive window. It can also be useful to make sure that all collaborators use the same formatting tools, such as Ruff.

To address this, you can specify the extensions and settings you want to install in the devcontainer.json file. For example, this will install Jupyter and Ruff, and configure Ruff as the default formatter:

    "customizations": {
        "vscode": {
            "extensions": [
                "ms-toolsai.jupyter",
                "charliermarsh.ruff"
            ],
            "vscode": {
                "settings": {
                    "[python]": {
                        "editor.defaultFormatter": "charliermarsh.ruff",
                        "editor.codeActionsOnSave": {
                            "source.fixAll": "explicit",
                            "source.organizeImports": "explicit"
                        }
                    },
                    "python.analysis.fixAll": [
                        "source.unusedImports"
                    ],
                    "editor.formatOnSave": true
                }
            }
        }
    }

You can find the extension IDs by looking at the extension in VS Code, then clicking on the gear icon and selecting “Copy Extension ID”.

Limitations

Devcontainers are a powerful tool for creating isolated development environments, but they do have some limitations. One of the main drawbacks is the overhead associated with running containers, which can slow down the development process, especially for large projects or resource-intensive tasks. It’s not an issue that I have found to be significant, but it’s something to keep in mind if you are working on a particularly demanding project or have limited system resources.

Additionally, the isolation provided by containers can make it challenging to use some resources from the host system, such as GPUs or hardware peripherals. While it is possible to pass through devices to the container using Docker, this process can be complex and may not be suitable for all use cases. For example, on Apple Silicon Macs, even if the Linux container could access the GPU, there are no Linux drivers available for the GPU, so it would not be able to use it.

Additional Resources

Footnotes

  1. Other alternatives such as Podman and Colima can be used, but the official documentation is centered around Docker. See the VS Code documentation for supported alternatives.↩︎

Reuse