For different reasons Git users may put their binary files to the source control. They may do this by mistake, but any pushed binary will be stored in history forever. As a result we have a big size of Git repository.

Today we will talk about

  • how to clean repository from binaries
  • how to prevent a new binary to be added (using Bitbucket hook)

Clean binary files

Find problematic files

Firstly we analyze a git history to find biggest files (not only binaries)

This post can tell you how to find binaries, or you can just use my scripts from https://github.com/ivantikal/git-tools:

# clone repo with needed scripts:
git clone https://github.com/ivantikal/git-tools.git  

# clone demo repo with binaries
git clone https://github.com/ivantikal/repo_with_binaries.git

# find 50 biggest files in the local repo
./git-tools/clean-binaries/get_biggest_files_in_history.sh -r ./repo_with_binaries/ -n 50

# print result file
cat ./git-tools/clean-binaries/get_biggest_files_in_history.sh.tmp/bigtosmall.txt

Output

01d3243047ff74f03c6ea86ed49eccec25e6f149 1457312 procexp64.exe
9e7084fd1c11dc9e305f9af81da99a11d4ff8616 479832 ADExplorer.exe
c3025e51f242dfeff9bf741f8ce88d478487ae8c 249327 AdExplorer.zip
80eed2bec3db49219feb808dd456a8353ebcecb3 117 test.txt
7c4a4f398dcfcd2bee60984136a758f2d2a2e957 53 README.md
ddc566bf5b69f362b59fc816fd73d7edbbd6fd1a 27 test.txt

Now you know which files cause to your repository to be so big

Delete files from git history

We will use this great tool - BFG Repo-Cleaner, many thanks to Roberto Tyley

My demo repository has some binary files and number of them exists only in history (which should be deleted too).

wget http://repo1.maven.org/maven2/com/madgag/bfg/1.12.15/bfg-1.12.15.jar
git clone --mirror https://github.com/ivantikal/repo_with_binaries.git
cd repo_with_binaries.git
git remote remove origin
java -jar ../bfg-1.12.15.jar --no-blob-protection --delete-files '*.{exe,zip}'
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git remote add origin https://github.com/ivantikal/repo_without_binaries

You can use this clean binaries script as a template: it clones the repository, calls the BFG tool and replaces remote "origin" URL.

Notice: BFG Repo-Cleaner tool deletes files from Git history and it makes a new repository not compatible with an old one. The same commits in the old and in the new repositories may have different SHA1 (ID) now, specifically all commits which contained binaries before the cleanup and all their descendants will have new SHA1-s (ID-s)

Now you can push it to the new repository on your server.


Prevent new binaries to be added

To find binaries in specific commit we will use 'git log' command with '--numstat' key. It shows number of changed lines and in case of binary file it shows '-':

  git log -1 --numstat --pretty="" a12a83d76a06b426bb7858070f6f4ec14cbcfb32

Output:

  -       -       AdExplorer.zip
  3       1       test.txt

But it's not enough to check the latest commit. We should check all new commits. How to do this?

When 'pre-receive' hook is running, new commits are uploaded to the repository already, but they are not part of any branch. We use this fact in our recursion:

  contain_branches=`git branch --contain $sha1` || true
  [[ "${contain_branches}" != "" ]] && echo "We should check this commit for binaries!"

You are welcome to use a final version of the hook (two scripts are involved):

  • Find "external-hooks" directory on your Bitbucket server

  • Add the script find_binaries.sh to "external-hooks/lib" folder

  • Add the script pre-receive-hook.sh to "external-hooks" folder

  • Choose your repository settings -> Hooks -> External Pre Receive Hook (External Hooks Plugin should be installed) -> Set "Executable" = "pre-receive-hook.sh" (set checkbox "Safe mode")

Have Fun!