Git Tricks for Working with Large Repositories

doi:https://doi.org/10.59350/8exd0-b3y53

Tuesday, August 6, 2024 From rOpenSci (https://ropensci.org/blog/2024/08/06/git-tricks/). Except where otherwise noted, content on this site is licensed under the CC-BY license.

Git Tricks for Working with Large Repositories

By Mauro Lepore – Edited by Steffi LaZerte

Recently Yanina Bellini Saibene reminded us to update our Slack profile:

Friendly reminder: Let’s increase the value of our rOpenSci Slack community. Please add details to your profile, e.g., your photo, your favorite social media handle, what you do, your pronouns, and how to pronounce your name.

After doing that I went on to updating my profile photos on the rOpenSci website, which ended up teaching me a few git tricks I would like to share here. Thanks Maëlle Salmon for the encouragement, and Steffi LaZerte for reviewing this post.

🔗 Cloning as usual

When I tried to clone the source code of rOpenSci’s website I realized the repo was large and it would take me several minutes.

git clone https://github.com/ropensci/roweb3.git

I decided to stop the process and researched how to just pull the latest version of the specific files I needed.

🔗 Pulling the latest version of specific files

First I forked the rOpenSci website repository (roweb3). I used the gh CLI from the terminal, but also I could have forked it manually from Github.

# if not using `gh`, fork ropensci/roweb3 from GitHub
gh repo fork ropensci/roweb3

Then I created a local empty roweb3 directory and linked it to the fork.

git init roweb3
cd roweb3
git remote add origin [email protected]:maurolepore/roweb3.git

Now for the tricks! I avoided having to download the whole repository by first finding the specific files I needed on GitHub’s “Go to file” box, then:

Trick 1: Configured a sparse checkout matching just those files.

git config core.sparseCheckout true
echo "themes/ropensci/static/img/team/mauro*" >> .git/info/sparse-checkout

Trick 2: Pulled with --depth 1 to get only the latest version of those files.

git pull --depth=1 origin main

I explored the result with tree and it was just what I needed to modify:

tree
.
└── themes
    └── ropensci
        └── static
            └── img
                └── team
                    ├── mauro-lepore.jpg
                    └── mauro-lepore-mentor.jpg

🔗 But how large is it?

While those tricks were useful, I was still curious about the size of the repo, so I did clone it all and explored disk usage with du:

du --human-readable --max-depth=1 .
219M    ./themes
164K    ./.Rproj.user
56K     ./archetypes
628K    ./resources
168K    ./data
376M    ./.git
20K     ./static
12K     ./.github
40K     ./scripts
161M    ./content
24K     ./layouts
475M    ./public
1.3G    .

Indeed this is much larger than the source code I typically handle. But now I know a few more Git tricks (and even more about blogging on rOpenSci :-) ).

🔗 Conclusion

If all you have is a hammer, everything looks like a nail. — Abraham Maslow

Sometimes git clone is not the right tool for the job. A sparse checkout and a shallow pull can help you get just what you need.

If you enjoy learning from videos you may search “git” on my YouTube channel or explore the playlists git, git-from-the-terminal, and git-con-la-terminal (in Spanish).

Git Tricks for Working with Large Repositories

🔗 Cloning as usual

🔗 Pulling the latest version of specific files

🔗 But how large is it?

🔗 Conclusion

Our Newsletter