Git and GitHub are powerful version control tools for managing your research projects and collaborating with others while keeping a detailed record of your work. In this tutorial, we will cover the basics of Git and GitHub, and how to use them to manage your research projects.
Over a decade ago, I started using a version control tool called Subversion with hosting on Bitbucket. I then switched to Git and GitHub, and I have been using them ever since. I use them to manage all my research projects, including my code, data, and even my academic papers. Not only does it help me keep track of my work, but it also makes it easy to collaborate with others.
My main motivation for starting to use a version control system was to make my research more reproducible. I wanted to be able to share my code and data with others so they could replicate my results. It turns out that not only was that desirable, but it is now a requirement for many journals and funding agencies. Git and GitHub make this process much easier.
Video tutorial
Part of this post is also available as a video tutorial on YouTube.
What is version control?
Version control, also known as source control, is a system that records changes to a file or set of files over time so that you can recall specific versions later. It’s one of the most important tools in the toolkit of any developer or data scientist. It’s also very useful for researchers, especially those working with code, but in practice, it is underused in academia. The idea behind version control is quite simple: it allows you to track and manage changes to your projects. Think of it as the “track changes” feature in Microsoft Word, but for all your files and turbocharged with features that make it easy to collaborate with others.
Imagine you’re working on a research paper and decide to delete a section. A few days later, you realize that section was crucial. Without version control, you’d have to rewrite that entire section. With version control, you can simply look at your previous versions, find the one that includes the section you need, and restore it. Most of us have some kind of version control in our lives. For example, when you write a paper, you might save different versions of the document as you work on it. This way, if you make a mistake or delete something important, you can go back to a previous version. However, this approach has limitations, and there are better ways to manage versions of your work.
In the context of coding, version control is even more important. As you add new features to your code or fix bugs, it’s essential to be able to track these changes. If something breaks, you need to know what was changed so you can figure out what went wrong and how to fix it. Additionally, version control systems allow multiple people to work on the same project simultaneously, making collaboration easier and more efficient while keeping a detailed record of who made what changes and when.
What is Git?
Git is the most widely used version control system in the world. It was created in 2005 by Linus Torvalds, the creator of the Linux operating system. Torvalds wanted a version control system that was fast, efficient, and capable of handling small to very large projects with ease. Unlike its predecessors, Git was designed to be decentralized, allowing multiple developers to work on the same project simultaneously without stepping on each other’s toes. Like Linux, Git is free and distributed under an open-source license.
At its core, Git allows users to keep a complete history of their project, noting every change made to every file. This feature is akin to having a detailed logbook that captures the evolution of a project over time. With Git, users can branch off from the main project to experiment or work on new features without disrupting the core project. Later, these branches can be merged back into the main project seamlessly. This ability to branch and merge is particularly powerful, preventing conflicts and maintaining the integrity of the original project. Git is also incredibly robust in managing project history, enabling users to revert to previous versions if needed, offering a safety net against errors or unintended consequences of new changes.
Git is a great tool for version control of any kind of file, especially text files. It turns out that if you mainly use LaTeX or Markdown for writing and presentations, you can use Git to track changes in your documents and collaborate with others. Gone are the days of sending around files with names like paper_v1_final_final_really_final.tex
and paper_v1_final_final_really_final_revised.tex
!
What is GitHub?
GitHub, launched in 2008 and acquired by Microsoft in 2018, quickly rose to become the de facto online platform for code management and collaboration. While Git is the engine, GitHub can be thought of as the sleek, user-friendly vehicle that houses this engine. It takes the core functionalities of Git and provides a web-based graphical interface that is intuitive and accessible. GitHub’s rise is not just due to its user-friendly nature but also because it functions like a social network for developers and researchers. Users can host their Git repositories, share their work with others, collaborate on projects, and even contribute to others’ projects.
Why use Git and GitHub for research?
For finance researchers, Git and GitHub offer a multitude of benefits. Git is an excellent tool for managing complex research projects. It allows researchers to track changes in their data analysis scripts, models, and even research papers, ensuring a clear audit trail of how the analysis was conducted and conclusions were reached. This level of transparency is crucial not just for personal record-keeping but also for collaborative projects where multiple researchers contribute to a single body of work. In a field where reputation is everything, Git can help researchers maintain a high level of integrity and accountability. The pull request system of GitHub is particularly beneficial for collaborative projects. It enables researchers to propose, discuss, and review changes before they are integrated into the main project. This not only ensures that every change is scrutinized for accuracy and relevance but also fosters a culture of peer review and collective improvement among collaborators as the project progresses. Furthermore, GitHub’s issue-tracking and project management features help researchers organize their tasks, track bugs, and manage project progress transparently.
Setup and installation
If you followed my tutorial on installing Python then you already have Git installed on your computer.
Git is available for Windows, Mac, and Linux operating systems. Git will be installed by default on most Linux distributions. If you are using a Mac, you can install Git using Homebrew. If you are using Windows, or are on Mac and prefer not to use Homebrew, you can download Git from the Git website.
You will also need to create a GitHub account. You can do so by visiting the GitHub website. GitHub offers free accounts for individuals and paid plans for teams and organizations. For most researchers, the free plan will be sufficient. If you are a student or educator, I recommend that you sign up for the GitHub Education program, which offers additional benefits including free access to GitHub Copilot, a ChatGPT-powered coding assistant.
Git is a command-line tool that you use in the terminal, but there are many graphical user interfaces available that make it easier to use, including GitHub Desktop. GitHub Desktop makes it easy to clone repositories, create branches, commit changes, and push your changes to GitHub. In this tutorial, I will use the built-in Git and GitHub integration in Visual Studio Code, but I still keep GitHub Desktop installed because it lets you easily clone repositories from GitHub in one click.
VS Code’s Git integration is available in the Source Control
tab on the left sidebar. To enable GitHub integration, you will need to sign in to your GitHub account. You should also install the GitHub Pull Requests and Issues extension to make it easier to work with pull requests and issues.
Additional GitHub features such as Copilot, Codespaces, and Actions are available in VS Code by installing the relevant extensions. You can also install the GitLens extension to get additional features for working with Git.
Git workflow
Understanding the Core Concepts of Git
Git’s power lies in its ability to manage and track changes in your projects, and this is achieved through a set of core functionalities. Let’s demystify these key terms:
1. Repository (Repo): The heart of any Git project, a repository is like a project folder but with superpowers. It contains all of your project files along with each file’s revision history. You can have local repositories on your computer and synchronize them with remote repositories on GitHub to share and collaborate.
2. Staging: Think of staging as a prep area. When you make changes to files, they don’t automatically get saved into your repository. Instead, you selectively add these changes to the staging area, indicating that you’ve marked these modifications for your next commit.
3. Commits: Committing is the act of saving your staged changes to the project’s history. A commit is like a snapshot of your repository at a particular point in time. Each commit has a unique ID and includes a message describing the changes, aiding future you or collaborators in understanding what was modified and why.
4. Push and Pull: These are the methods by which you interact with a remote repository. When you push, you are sending your committed changes to a remote repo. Conversely, when you pull, you are fetching the latest changes from the remote repo to your local machine.
5. Branching: Branching allows you to diverge from the main line of development and work independently without affecting the main project, often referred to as the main
branch.1 It’s perfect for developing new features or experimenting.
6. Merging: After you’ve finished working in a branch, you merge those changes back into the main project. Merging combines the changes in your branch with those in the main branch, creating a single, unified history.
7. Conflicts: Sometimes, when merging branches, Git encounters conflicts - changes that contradict each other. This can happen when two people make changes to the same file. These conflicts need to be manually resolved before completing the merge process.
We will explore these concepts in more detail in the next sections.
Creating a repository
A repository (or repo) is where all the magic happens – it’s where your code, documentation, and all other project-related files reside. To create a new repo, simply log into your GitHub account, click on the +
icon in the top right corner, and select New repository
.
Naming and Describing Your Repository
Choose a name that succinctly reflects your project. Keep in mind that this name will be part of the URL for your repository, and that it will be used as the default name for the folder when you clone the repository (make a local copy of the repository on your computer). The description field is an opportunity to briefly outline your project’s objective. This helps others understand the purpose of your repo at a glance.
Selecting a License
You can also define the license for your project. It is not necessary if you don’t intend on sharing this code publicly, but it is a good practice to include a license. When it comes to research code, transparency and accessibility are key. I recommend opting for a permissive license, like the MIT License. This license allows others to freely use, modify, and distribute your work – perfect for fostering open-source collaboration in the research community. GitHub makes it easy to include a license; just select the MIT License from the dropdown menu when creating your repository. Other permissive licenses include the BSD License and the Apache License.
Adding a .gitignore
file
Before you start adding files to your repo, consider setting up a .gitignore
file. This file tells Git which files or folders to ignore in a project. Typically, you’ll want to exclude certain files from being tracked – like temporary files, local configuration files, files containing sensitive information, or large data files. GitHub offers templates for .gitignore
files tailored to various programming languages and frameworks, which can be a great starting point. It is available as a dropdown menu option when creating the repository. gitignore.io is another useful resource for generating .gitignore
files.
Adding a README file
Finally, you’ll want to add a README.md
file to your repository. This file is the first thing visitors will see when they visit your repository on GitHub. It’s an essential component of your project, acting as the introduction and guide. Use the README to explain what your project does, how to set it up, and how to use it. This is important even if your project is not public, as it will help you remember how to use your project in the future and facilitate onboarding new collaborators. This file can be written in plain text or formatted using Markdown, a lightweight markup language that is easy to learn and use. GitHub automatically renders Markdown files, making them easy to read and navigate. You can also include images, links, and code snippets in Markdown files. GitHub offers a handy guide to help you get started with Markdown.
Cloning a repository
Cloning a repository creates a local copy of the remote repository on your computer. This allows you to work on the project locally and push your changes to the remote repository when you’re ready to share them with others. To clone a repository, you’ll need the URL of the remote repository. You can find this by clicking on the green Code
button on the repository’s homepage. If you are using GitHub Desktop, you can clone the repository by selecting Open with GitHub Desktop
, which will open the repository in GitHub Desktop. You can then select the location where you want to store the repository on your computer and click Clone
.
If you are not using GitHub Desktop, you can clone the repository using the command line. First, copy the URL of the repository from the repository’s homepage by clicking on the green Code
button, then copying the URL by clicking on the clipboard icon next to the URL. To clone the repository, open the terminal and navigate to the directory where you want to store the repository. Then, run the following command:
git clone <url>
This will create a new directory with the same name as the repository and download all the files from the remote repository into this directory. You can then open this directory in VS Code and start working on the project. From GitHub Desktop, you can open the repository in VS Code by selecting Open in Visual Studio Code
from the Repository
menu.
Tracking changes
Once you have cloned the repository, you can start making changes to the files in the repository. You can create new files, edit existing files, or delete files. You can also move files around or rename them. You can see all your changes in the Source Control
tab in VS Code. Files will be listed under Changes
with a U
if they are new (untracked), a M
if they have been modified, or a D
if they have been deleted. You can also see the changes you have made to each file by clicking on the file name.
When you create a new file, it will not be tracked by Git until you add it to the staging area. To add a file to the staging area, you use the Stage Changes
button in the Source Control
tab in VS Code (the little +
sign next to a file when you hover over it). You need to do this not only for new files, but for all files that you have modified or deleted since the last commit. Files in the staging area be included in the next commit.
Once you have added one or many changed files to the staging area, you can commit those changes to the repository. To commit changes, you need to enter a commit message describing the changes you have made and then click on the Commit
button in the Source Control
tab in VS Code (the checkmark icon). You can also use the keyboard shortcut (Command+Enter
on Mac or Ctrl+Enter
on Windows or Linux) to commit your changes. This will create a new commit, i.e., a new snapshot, with the changes you have staged.
You can see all your commits in the Source Control
tab under Commits
. You can click on a commit to see the changes that were made in that commit. You can also right-click on the commit to access the commit details, including the commit message, the author, and the date and time of the commit.
Syncing with the remote repository
After committing your changes locally in Visual Studio Code, the next step is to synchronize these changes with your remote repository on GitHub. This process involves two main actions: pulling changes from the remote repository and pushing your local changes to the remote.
Pulling changes from the remote repository
Before you push your changes, it’s a good practice to pull any updates that others might have made to the remote repository. This ensures that your local repository is up-to-date. In VS Code, you can pull changes by clicking on the ...
(more actions) button in the Source Control
tab and selecting Pull
. Alternatively, you can use the keyboard shortcut (Command+Shift+P
on Mac or Ctrl+Shift+P
on Windows/Linux) and type Git: Pull
in the command palette. Pulling changes will merge updates from the remote repository into your local branch. If there are no conflicts, the merge will happen automatically.
Pushing changes to the remote repository
Once your local branch is up-to-date and you’ve committed your changes, you’re ready to push these changes to the remote repository. In the Source Control
tab, click on the ...
button and select Push
. This will upload your commits to the remote repository on GitHub. You can also use the keyboard shortcut (Command+Shift+P
on Mac or Ctrl+Shift+P
on Windows/Linux) and type Git: Push
in the command palette. If you’re pushing to a branch that doesn’t exist on the remote, VS Code will automatically create this branch in the remote repository.
Resolving merge conflicts
Occasionally, when you pull changes from the remote repository, you may encounter merge conflicts. These occur when changes in the remote repository overlap with your local changes in a way that Git can’t automatically resolve. VS Code provides tools to help resolve these conflicts. Conflicted files will be marked in the Source Control
tab. You can open these files and choose which changes to keep. After resolving conflicts, you’ll need to stage and commit the merged files before pushing.
Regularly pulling and pushing changes will keep your local and remote repositories synchronized. This is crucial in collaborative projects to ensure everyone is working with the most current version of the project.
Branching and merging
Before using Git, whenever I wanted to try something new in my code, I would make a copy of the entire project folder and work on that copy. This was a tedious process, and it was easy to lose track of which version was the most recent. With Git, branching makes this process much easier. Branching allows you to create a copy of your project, called a branch, and work on that branch without affecting the main project. Once you’re satisfied with the changes you’ve made in your branch, you can merge those changes back into the main project. This process is much more efficient and less error-prone than manually copying and pasting files.
Creating a New Branch
In VS Code, you can create a new branch by clicking on the branch name in the bottom left corner, then selecting Create new branch...
. Give your branch a descriptive name that reflects its purpose. You can switch between branches by clicking on the branch name in the bottom left corner and selecting the branch you want to work on.
After creating and switching to your new branch, any changes you make are confined to that branch. You can stage and commit changes in this branch as you would in the main branch.
You can also choose to publish your branch to the remote repository. This will create a copy of your branch on GitHub. This is useful if you want to collaborate with others on this branch, or to use GitHub to backup the branch. Note that once the branch is published, others who have access to the repository will be able to see that branch. To publish your branch, click on the ...
button in the Source Control
tab and select Publish Branch...
.
Merging Branches
Once you’ve completed the work in your branch and you’re satisfied with the changes, you’ll want to merge these changes back into the main branch. Before merging, ensure your branch is up-to-date with the main branch. You can do this by checking out the main branch and pulling the latest changes, then switching back to your branch and merging the main branch into it. After that, you are ready to merge your branch into the main branch. After merging, you can delete your branch if you no longer need it. This avoids cluttering the repository with branches that are no longer needed.
In the Source Control
tab, click on the ...
button, select Merge Branch...
, and choose the branch you want to merge into your current branch. If there are no conflicts, VS Code will complete the merge. VS Code will also ask you if you want to delete the merged branch.
Merge conflicts happen when the same lines of code have been changed differently in both branches. VS Code will notify you if there are conflicts that need resolution. The first time, Git will also need you to confiure how you want to handle merge conflicts by entering one of the following commands in the terminal:
git config pull.rebase false
: This command sets the pull behavior to merge. When you pull from a remote repository, Git will merge any incoming commits with your current branch. This is the one I usually use.git config pull.rebase true
: This command sets the pull behavior to rebase. Instead of merging incoming commits, Git will reapply your local commits on top of the incoming commits, creating a linear commit history.git config pull.ff only
: This command sets the pull behavior to fast-forward only. Git will only update your branch if it can fast-forward, meaning the main branch has not changed since you created your branch. If the main branch has new commits, Git will not pull the changes and you’ll need to manually merge or rebase.
Conflicted files will be marked in the Source Control
tab. Open these files, and VS Code will highlight the conflicting changes. Choose which changes to keep, then save the file, stage, and commit the resolved files. Once all conflicts are resolved and changes are committed, the branches are successfully merged. If you’ve merged into your local main branch, don’t forget to push these changes to the remote repository to keep everything synchronized.
Pull requests and code reviews
A pull request (PR) is a method in GitHub to propose changes from one branch to another, typically from a feature into the main branch. It’s a request to pull in your changes. The name is a bit misleading because it’s not related to the pull
command in Git. You can think of it as a “merge request” instead. When you create a PR, you’re initiating a discussion about your proposed changes. Your collaborators can review the code, leave comments, request changes, or approve the PR.
Creating a Pull Request in GitHub
Once you have pushed your branch to the remote repository, you can create a PR. Navigate to the repository on GitHub.com. GitHub often shows a prompt to create a PR for recent branches. If not, go to the Pull Requests
tab and click New pull request
. Select your branch and the branch you want to merge into (usually the main
branch). When creating a PR, include a clear title and a detailed description of the changes. This helps reviewers understand the context and purpose of the changes. You can also assign reviewers to the PR, add labels, and set a milestone. Once you’re satisfied with the PR, click Create pull request
. Any assigned reviewers will be notified of the PR and can begin reviewing it.
Code Reviews
Collaborators can review the changes in a PR by navigating to the Files changed
tab within the PR. Reviewers can leave comments on specific lines of code, general comments on the PR, and suggest changes. They can also pull the branch locally and test the changes themselves. Once the review is complete, the reviewer can approve the PR, request changes, or leave a comment. If changes are requested, the PR author can make the requested changes and push them to the branch. The PR will be automatically updated with the new changes. Once the PR is approved, it can be merged into the target branch. Based on the feedback, you might need to make additional commits to your branch. These updates will automatically appear in the PR. This back-and-forth can continue until the changes are satisfactory.
Merging the Pull Request
Once the PR is approved and any conflicts are resolved, you can merge it into the target branch. This is typically done via the ‘Merge pull request’ button on GitHub. After merging, it’s a good practice to delete the feature branch from the remote repository to keep the branch list tidy.
Best Practices for Pull Requests and Code Reviews
- Small, Focused Changes: Aim for smaller, manageable PRs that focus on a specific feature or fix. This makes code reviews more efficient and less overwhelming.
- Clear Communication: Use clear, descriptive messages in both your PRs and commits. This helps reviewers understand your thought process and the changes made.
- Constructive Feedback: When reviewing, offer constructive and respectful feedback. Code reviews are not just about finding mistakes but also about sharing knowledge and improving the codebase collaboratively.
Pull requests and code reviews are vital for maintaining high-quality code and fostering collaboration in your finance research projects. While not yet commonly used in academia, I have found them the perfect tools for collaborating on research projects. They ensure that every change is scrutinized and understood by all collaborators, and they foster a culture of peer review and collective improvement as the project progresses.
GitHub offers many other features that can be useful for research projects. I list them at the end of this post and will cover them in a future post.
GitHub for research code
In empirical finance research, the ability to reproduce results is more important now than ever, especially that most top journals require authors to share their code and data. In this section, I will discuss some best practices I have adopted for using GitHub to manage research code, with an emphasis on reproducibility, documentation, and effective use of GitHub’s features.
The first thing to consider after creating a new repository is the structure of your project. A well-organized project is easier to navigate and understand, and it makes it easier for others to reproduce your work. There is no one-size-fits-all approach to organizing a project, but the following project structure is a good starting point:
project
├── data/
├── docs/
├── output/
├── figures/
└── tables/
├── src/
├── .gitignore
├── .env
├── .env-example
├── conf.yaml
├── LICENSE
├── poetry.lock
├── pyproject.toml
├── README.md
└── requirements.txt
So, which files should you commit to your repository? Here are some guidelines:
You should include:
- Configuration files: Files like
.json
,.yml
, or.ini
are crucial for ensuring that your project can be set up and run by others with the exact same parameters you used. In my example theconf.yaml
file contains the configuration parameters for the project and should be included in the repository. - Source code: Include all scripts and code files that are essential for your analysis or model. In my example, the
src
directory contains all the Python scripts used in the project. - Documentation: Any files that help explain your project, especially markdown files with notes. In my example, the
docs
directory contains the documentation for the project and should be included in the repository. If your documentation is generated from source files, such as Markdown or Latex, then you should include the source files in the repository, not the generated files. Your repository should also include aREADME
file at the root of the repository that provides an overview of your project, its purpose, and how to use it. - Dependencies: For projects in languages like Python, a file listing the dependencies is essential. This file lists all the external libraries and their specific versions needed for your project. This ensures that anyone cloning your repository can easily install the necessary dependencies and run your code in an environment identical to yours. In my example, I use Poetry to manage dependencies, so I include the
pyproject.toml
. The Poetry documentation recommends also including thepoetry.lock
file in the repository, unless you are writing a library for ditribution. I also include arequirements.txt
file for users who prefer to install dependencies usingpip
instead of Poetry. .gitignore
file: This file tells Git which files or folders to ignore in a project. Typically, you’ll want to exclude certain files from being tracked – like temporary files, local configuration files, files containing sensitive information, or large data files. GitHub offer templates, but they seem to be missing a few things. For example, if you are on Mac you will want to add.DS_Store
to your ignore file. gitignore.io is another useful resource for generating.gitignore
files that are much more comprehensive. Make sure to also add the.gitignore
file to your repository.
Finally, make sure that you include in the .gitignore
file all the files and directories that you should not include in the repository.
You should not include:
- Data files: While large datasets might not be feasible to store on GitHub, even small datasets can be problematic if they are updated often. Instead, consider including sample datasets or scripts that automatically fetch or generate data, or sharing your data among collaborators using a cloud storage service like Dropbox or Google Drive. There exist tools like DVC that can help you manage large datasets with version control, but I have not used them myself.
- Sensitive data and local configuration files: Do not include sensitive information like passwords or API keys, or computer-specific configuration parameters such as local paths. Instead, you should include an example file with the expected parameters that need to be set in the configuration file. In my example, I use a
.env
file to store sensitive and local information, and I include a.env-example
file that contains the name of the environment variable that needs to be set in the.env
file. I would then include the.env-example
file in the repository, but not the.env
file. I also include the.env
file in the.gitignore
file so that it is not included in the repository. - Output: You should not include output files in the repository. Instead, you should include the code that generates the output files. Every collaborator should be able to generate the results in his environment. In my example, the
output
directory contains the figures and tables generated by the code in thesrc
directory.
Git and Jupyter notebooks
Jupyter Notebooks are a popular tool for data analysis and visualization. They allow users to combine code, text, and visuals in one document, making it easy to share and collaborate on data science projects. However, Jupyter Notebooks have many shortcomings when it comes to replicability and using them with Git can be challenging.
Challenges with Git and Jupyter notebooks
Jupyter Notebooks, while an excellent tool for data analysis and visualization, present unique challenges when used with Git. The core issue lies in their format: Notebooks save both the input (code) and the output (results, graphs, etc.) in a single JSON file. This means that even small changes in the code can lead to large changes in the file, making it difficult for Git to handle diffs and merges effectively. The output sections, especially those with visual content, can create “noise” in version control. When different users run the same notebook, slight differences in output can appear, leading to unnecessary conflicts. Because Git keeps track of the full history of the notebook, the size of the repository can grow quickly, especially if the notebook contains large outputs such as images. This can make it difficult to share and collaborate on notebooks.
VS Code Notebook Diff Viewer
Recognizing these challenges, tools like Visual Studio Code have introduced features to help. The VS Code diff viewer (the tool that shows differences in files due to changes) supports Jupyter notebooks, allowing users to compare and understand changes between notebook versions more easily. This tool provides a clearer visualization of differences in the code, reducing the complexity involved in tracking changes in notebooks in Git.
Replicability concerns with notebooks
Replicability in research is crucial, and Jupyter Notebooks can sometimes hinder this. The linear, state-dependent nature of notebooks can lead to scenarios where code runs successfully in one instance but not in another, due to differences in the execution order or environment. Ensuring that notebooks are run from a clean state and in the right order is essential for replicability. Another concern is that code in notebooks is harder to test and debug than code in scripts. This can lead to errors that are difficult to detect and fix, especially in large notebooks. Finally, notebooks are not ideal for large-scale projects. As projects grow in size and complexity, notebooks can become unwieldy and difficult to manage. In these cases, it is better to use scripts or modules instead.
Using Notebooks with Online Platforms
Despite these challenges, Jupyter Notebooks remain a popular and powerful tool for data analysis and research. Their interactive nature and the ability to combine code, text, and visuals in one document make them invaluable.
Platforms like Binder and Google Colab integrate well with Jupyter Notebooks hosted on GitHub. These platforms can automatically create interactive, shareable environments from notebooks, making them more accessible for collaborative work and education. By using these platforms, researchers can share their notebooks in a more user-friendly and interactive format, ensuring that others can easily replicate and experiment with their findings.
GitHub for writing
GitHub is not just for code; it’s also an excellent platform for tracking your writing, especially if you are using formats based on plain-text files such as Markdown (like the Quarto publishing system) and LaTeX.
The same principles for organizing code projects apply to writing projects. You should include all the files that are essential to generate the output of your project, such as the source (e.g. .md
, .tex
, and .bib
) files, configuration files, and tables and figures. You should avoid including the output files, such as PDFs, or HTML files.
Tagging releases for milestones
There are times when you want to create a snapshot of your project, including the output, at a specific point in time. For example, when you submit a paper to a journal, you want to create a snapshot of the project at that point in time. This allows you to keep track of the changes made in between revisions. GitHub provides a way to do this using tags and releases.
When you reach a significant milestone in your writing – such as the completion of a draft, submission to a journal, or final revisions – you can create a tag and a corresponding release.
To create a tag and release, head to the repository on GitHub.com and click on Create a new release
under Releases
in the right sidebar. Enter a tag version number and a title for the release. You can also add release notes summarizing the changes or updates in this version. Finally, attach the output files (e.g. PDFs) to the release. Click Publish release
to create the release. You can then download the release files or share the release link with others.
Publishing your code on GitHub
In empirical finance academic research, sharing your code has become increasingly important. Publishing your code enhances the transparency and reproducibility of your research. It allows peers to review, replicate, and build upon your work, contributing to the collective knowledge of the field. Making your code available can also increase the citation and impact of your research, as it provides tangible artifacts that others can use and reference. Finally, it is also a requirement for publishing in many journals, including the top ones.
Journals will publish your code alongside your paper, so why should you also publish it on GitHub?
For me, the main reason is to keep control over my code. By publishing your code on GitHub, you retain control over it. You can continue to make changes to it, and update it as needed. Other researchers who visit your GitHub repository can also be exposed to your other work, increasing the visibility of your research. Finally, GitHub offers a platform for collaboration and feedback, allowing others to flag issues, contribute to your work, and build upon it.
To publish your code on GitHub, all you need to do is set the visibility of your repository to public. If you don’t want to share the full history of your code, you can create a new repository and upload the latest version of your code.
Make sure to include a README file that explains what your project does, how to set it up, and how to use it. Documentation is key to making your code accessible to others (and to reducing the number of questions you get about your code). You can also include instructions on how to cite your code in the README file. Finally, you should also include a LICENSE file to clearly state how others can use your code.
Once you have completed these steps, your code is published! If you want a DOI for your repository, you can use Zenodo, which allows you to mint a DOI for your GitHub repository.
Other GitHub features for academic researchers
In addition to the core features of GitHub, many other tools and functionalities can be useful for academic researchers. I plan on covering most of them in future posts. Here are the ones that I use the most:
Project Management Tools
GitHub offers several tools to help you manage your projects, including Projects, Issues, Discussions, and Wikis. These tools can be used to organize your work, track tasks, and collaborate with others.
GitHub Copilot
GitHub Copilot is an AI coding assistant. It can do code completion, suggest functions, and even generate code based on comments. There is also a Copilot Chat powered by GPT-4 that can answer questions about code while being aware of the context. When you allow it, it can consult your private repositories to provide more relevant suggestions.
Seriously, if you haven’t tried it yet, you should. It’s a game-changer, and new features are being added all the time. And it will work for text too if you write your Markdown or LaTeX files in VS Code.
GitHub Pages
GitHub Pages is a free service that allows you to host static websites directly from GitHub. This can be useful for hosting project websites, blogs, or personal websites. My personal website and this blog are both hosted on GitHub Pages.
Here are some tools that can be useful for generating static websites suitable for GitHub Pages:
- Quarto: Quarto is an open-source scientific and technical publishing system built on Pandoc. It is excellent for publishing computational notebooks, including Jupyter and Markdown notebooks. This blog and my personal website are built using Quarto.
- Academic Pages Template (Jekyll): This Jekyll-based template is designed for academic personal websites and can be hosted on GitHub Pages. It’s a great way to create a professional online presence showcasing your research work. It’s what I used for my personal website before switching to Quarto.
- Nikola and Pelican: If you want something all Python-based, Nikola and Pelican are two good options. I have used Nikola in the past and found it to be a great tool for generating static websites, but I found it simpler to consolidate everything in Quarto.
GitHub Classroom
GitHub Classroom simplifies the use of GitHub in classroom settings. It’s a toolset that automates the repetitive tasks involved in grading and feedback, making it easier to use GitHub for coursework and assignments in a research or academic context. I have been using it for three years and it has been a game-changer for me. While it’s not bug-free, it has saved me countless hours of grading and feedback. Automated grading has a monthly limit after which you need to pay, but the cost is minimal and well worth it.
GitHub Actions
GitHub Actions is a powerful tool that allows you to automate workflows. You can set up CI/CD pipelines2 to automate testing, building, and deploying your applications or research code. GitHub Actions are small scripts that run in response to events in your repository, such as commit or pull requests. For researchers, it can be used to automate the testing of code or even automate routine data processing tasks.
GitHub Codespaces
GitHub Codespaces provide a fully featured cloud development environment accessible directly from GitHub. Your code lives in a remote server, and you get a complete VS Code environment in your browser. This can be particularly useful for researchers who want to quickly experiment with code or collaborate without the need to set up a local development environment. It is also great to ensure maximum replicability of the code you distribute, as the environment is identical for everyone.
Footnotes
Historically, this branch was called
master
, but GitHub has recently changed the default branch name tomain
to avoid the racially charged connotations of the wordmaster
.↩︎Continuous integration and continuous development↩︎