This lesson is still being designed and assembled (Pre-Alpha version)

Odd things to know about Files

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Why are (large) binary files a problem in Git?

  • What is Git LFS?

  • What are the problems with Git LFS?

Objectives
  • Understanding that Git is not intended for (large) binary files

  • Learning about the git lfs commands

  • Understanding the disadvantages of git lfs

Sometimes, you might want to add non-textual data to your Git repositories. Examples for such uses cases in a software project are e.g.

However, such data is stored in binary formats most of the time. Git’s line-based approach of tracking changes is not suited for this type of data. While Git will work with binary data without any errors, it will internally treat each binary file as a file with one (very long) single line of content. Consequently, if you apply changes to such a file, Git will store the entire file in the commit even if there was a lot of similarity between the two versions of the file. As Git does not “forget” about previous versions of the file, doing this repeatedly and/or with very large files will quickly make your repository grow in size. At some point this will severely impact the performance of all your Git operations from git clone to even git status. It is therefore generally discouraged to use Git to track (large) binary files.

However, the problem of binary files in Git repositories cannot be fully neglected: There is a lot of value for a software project in keeping things together that belong together: Documentation assets belong to the documention they are part of. Therefore we will now explore some options on how to integrate large file handling into Git.

The git lfs subcommand is part of an extension to Git. LFS stands for Large File Storage. It allows you to mark individual files as being large. Git does not apply its normal, line-based approach to tracking changes to these large files, instead they are stored separately and only referenced in the Git data model. During push and pull operations, large files are transmitted separately - requiring the server to support this operation.

For the sake of demonstration, we create a file called report.pdf. We assume that it is a large, binary file in order to show how to handle it with git lfs:

echo "This is a very large report." > report.pdf

Next, we tell Git, that this file should be treated with LFS:

git lfs track report.pdf
Tracking "report.pdf"

Having done so, we can inspect the repository and we learn that a new file .gitattributes was added to the repository.

git status
On branch main

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	.gitattributes
	report.pdf
cat .gitattributes
report.pdf filter=lfs diff=lfs merge=lfs -text

Similar to .gitignore this file is part of the repository itself in order to share it with all your collaborators on this project. We therefore craft a commit that contains it:

git add .gitattributes
git commit -m "Setup LFS tracking"

Now, we are ready to add the large file to the repository the same way we would with any other file:

git add report.pdf
git commit -m "Add final report to the repository"

Pushing our commits to the remote repository, we can see in the console output, that our LFS data was transferred to the remote server separately.

git push origin main
Uploading LFS objects: 100% (1/1), 17 B | 0 B/s, done.                          

Tracking with wildcard patterns

LFS tracking is not limited to explicitly spelled out filenames. Instead, wildcard patterns can be passed to git lfs track. However, you should be careful to quote these patterns, as they might otherwise get expanded by to existing files by your shell. For example, tracking all PDFs with LFS could be achieved with the following command:

git lfs track "*.pdf"

Disadvantages of Git LFS

Although git lfs by design solves the problem of storing large files in Git repositories, there are some practical hurdles that you should consider before introducing LFS into your project:

  • The git lfs command is a separately maintained extension to the Git core. It is therefore not part of most Git distributions, but needs to be installed separately. Using it in your project will require you to educate your users about LFS and how to install it. Depending on your target audience, you should carefully consider whether the benefits outweigh this disadvantage.
  • Users that do not have git lfs installed will not be notified by Git. They will see the files, but the content will be Git metadata instead of the actual content. Trying to work with those files will typically produce cryptic error messages.
  • Some hosting providers - most notably GitHub - apply restrictive quotas to LFS storage. On the free plan, GitHub currently allows 1GB of storage and 1 GB bandwidth per month. As the band width quota counts every single clone by users, LFS should currently be considered unusable on the GitHub free plan.

Key Points

  • (Large) binary files can grow the repository size immensely and make it unusable

  • git lfs is an extension that stores large files outside the Git data model

  • Use of Git LFS is discouraged in many scenarios.