1.  ChangeLog

  • Apr. 19, 2013: Minor error message improvement.
  • Mar. 20, 2013: Version 1.0.1 with format change to handle non-ASCII filenames.
  • February, 2013: Initial public release of version 1.0.

2.  Download

NOTE: If you have a filesystem with non-ASCII filenames encoded not in UTF-8, the directory md5sum might not work across platform unless you use the Python 3 version of the script (use python3 on source code). There is no binary distribution for the Python3 version due to packaging difficulties.

3.  Rational

When we transfer or backup a large number of files, it is difficult to verify if the files have been copied correctly. Although it is possible to compare files and directories with original source using dedicated file/directory comparison tools, or commands such as rsync -acv, the source data is not always available and it can be very slow to compare two large directories.

This script tries to address this problem by extending the standard md5sum program to allow it to handle directories, and produce partial checksum during the handling of large files. The MD5 checksum of the directory is generated by calculating the MD5 checksum of all files and subdirectories, and generate a checksum from a manifest file from these values. Entries in the manifest file are sorted so that the order at which files are processed does not affect the directory checksum. Because the manifest file contains file size information, the choice to calculate MD5 checksum based only on 1G of data (not necessarily the first 1G) of large files should be safe.

If you are extremely impatient, you can skip the rest of this page and use command

% md5sumd * -v | gzip > .manifest.md5.gz

to generate a fingerprint for all files under a directory and save it to file .manifest.md5.gz, and use command

% md5sumd -c .manifest.md5.gz

to check if the content of the directory has been altered during file transferring, system failure, or unintentional changes.

4.  Usage

% md5sumd -h
usage: md5sumd [-h] [--version] [-c [CHECKSUM]] [-v]
               [FILE_OR_DIR [FILE_OR_DIR ...]]

A tool that calculates the MD5 checksum of files and directories, and use it
to check the integrity of these files and directories. It has a interface that
is similar to the md5sum command, with support for checksum of directories.

positional arguments:
  FILE_OR_DIR           Calculate MD5 signature of one or more files and
                        directories and print MD5 checksums to the standard
                        output.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -c [CHECKSUM], --check [CHECKSUM]
                        Check the content of one or more files and directories
                        using a file that contains the checksum of these files
                        and directories. Gippped checksum file is acceptable.
                        If a file is unspecified or is -, read from the
                        standard input.
  -v, --verbose         If specified, this program will output checksum for
                        all files when the checksum of a directory is
                        calculated. Such information will help the --check
                        command to figure out what files have been changed if
                        a directory checksum mismatch happens. This option
                        will also enable a progress bar for file scanning.

5.  Details

5.1  Comparing two directories

Let us calculate the MD5 of a directory:

% md5sumd vtools
b93f839744cd53fb87981c8254cc7511  vtools

If we copy the directory to somewhere else, we can see the signature is still the same

% cp -r vtools ~/Temp
% md5sumd ~/Temp/vtools
b93f839744cd53fb87981c8254cc7511  ~/Temp/vtools/

If we change anything in that directory, the signature will be different

% rm ~/Temp/vtools/*.pyc
% md5sumd ~/Temp/vtools
c71b7236b19feb1682f1c7039e5df8f2  ~/Temp/vtools/

5.2  Use directory md5 checksum to validate directory content

Now let us save the md5 checksum to a file,

% md5sumd vtools > vtools.md5
% md5sumd -c vtools.md5
vtools: OK

When we transfer the directory to another place, we can still use this command to validate its content as long as the directory name is not changed. Now, if we change the directory, and check again,

% rm -rf vtools/cache
% md5sumd --check vtools.md5
vtools: FAILED

5.3  Getting detailed information about directory changes.

It can be frustrating when a directory checksum mismatch happens but you have no idea what has been changed. An interesting feature of the md5sumd command is that it can generate and output detailed file-level MD5 information and use it to figure out what exactly have been changed to a directory.

% md5sumd vtools -v > vtools.md5
Scanning 34366 files: 100%[====================================] 2,095,623,045 48.6M/s in 00:00:430

As you can see, the --verbose option even enables a progress bar, which can be helpful for directories that contain a large number of files. The output of this command has a lot more information, and it is interesting to see that there are 1,843,428 files of a total size of 28,615,195,835 under this directory.

% head -5 vtools.md5
2efce10e113804fc8a6b4e81ffd54f2e  vtools
## MD5	type	num_files	num_dirs	filesize	total_num_files	total_filesize	name
#2efce10e113804fc8a6b4e81ffd54f2e	d	34366	2760	2095623045	1843428	28615195835	vtools
#d75dc7768044f85895001913ae2a19b1	-	1	0	191368	1	191368	vtools/MANIFEST
#80e0735f4483d04b6cd28cff95b9b28c	-	1	0	4260	1	4260	vtools/MANIFEST.in

Then, if we change the directory a little bit and check it with the --check option,

% rm -f vtools/*pyc
% rm vtools/source/*temp
% md5sumd -c vtools.md5
vtools/source: directory modified.
vtools/source/cgatools_wrap_py3.cpp_temp: file removed.
vtools/source/cgatools_py3.py_temp: file removed.
vtools/source/assoTests_wrap_py3.cpp_temp: file removed.
vtools/source/vt_sqlite3_py3.py_temp: file removed.
vtools/setup.pyc: file removed.
vtools/source/assoTests_py3.py_temp: file removed.
vtools: FAILED

5.4  Working with multiple files and directories

Although the main strength of md5sumd is its ability to calculate directory md5, it works well with files as well. For example, we can generate a md5 for all files and directories under a directory using command:

% cd vtools
% md5sumd -v * > vtools.md5
Scanning 65 files under annotation: 100%[============================] 194,919 17.0M/s in 00:00:000
Scanning 32123 files under boost_1_49_0: 100%[===================] 281,585,238 16.4M/s in 00:00:170
Scanning 700 files under build: 100%[============================] 180,236,646 97.0M/s in 00:00:010
Scanning 38 files under cgatools: 100%[===============================] 205,326 7.0M/s in 00:00:000
Scanning 10 files under dist: 100%[=============================] 103,912,176 265.9M/s in 00:00:000
Scanning 15 files under format: 100%[==================================] 33,932 5.0M/s in 00:00:000
Scanning 485 files under gsl: 100%[=================================] 1,758,005 7.2M/s in 00:00:000
Scanning 28 files under libplinkio: 100%[=============================] 138,531 9.9M/s in 00:00:000
Scanning 659 files under pyinstaller: 100%[========================] 9,993,226 33.3M/s in 00:00:000
Scanning 39 files under source: 100%[=============================] 2,637,686 156.2M/s in 00:00:000
Scanning 43 files under sqlite: 100%[=============================] 5,511,571 188.0M/s in 00:00:000
Scanning 121 files under test: 100%[==========================] 1,507,339,718 307.4M/s in 00:00:040

If anything has been changes, we can check the change of contents using command

% rm test/*DB*
% md5sumd --check vtools.md5
MANIFEST: OK
MANIFEST.in: OK
MANIFEST_local.txt: OK
README: OK
annotation: OK
boost_1_49_0: OK
build: OK
build_executable.py: OK
call_variants.py: OK
cgatools: OK
code_style.cfg: OK
dist: OK
format: OK
gsl: OK
libplinkio: OK
manage_resource.py: OK
pyinstaller: OK
release.py: OK
setup.py: OK
source: OK
sqlite: OK
test/dbSNP.DB: file removed.
test/gwasCatalog.DB: file removed.
test/evs.DB: file removed.
test/testNSFP-1.1_0.DB.gz: file removed.
test/testThousandGenomes.DB: file removed.
test/evs-hg19_20111107.DB.gz: file removed.
test/dbSNP.DB-journal: file removed.
test/evs-hg19_20111107.DB: file removed.
test/testNSFP.DB: file removed.
test: FAILED
vtools: OK
vtools.md5: FAILED
vtools.spec: OK
vtools_report: OK
vtools_report.log: OK
vtools_report.spec: OK

5.5  Support for gzipped checksum file

The md4sumd command can read directly from a gzipped checksum file. This is useful when the checksum file gets large when the --verbose option is used to list checksums of all files and directories under a large directory. For example, you can generate a checksum file using command

% md5sumd vtools gsl -v | gzip > checksum.gz

and check it directory using command

% md5sumd --check checksum.gz

5.6  Check md5 checksum read from standard input

The md5sumd --check command will read from standard input if no filename (python 2.7 or higher) or a filename with name - is specified. For example,

% md5sumd vtools gsl  | md5sumd -c -
vtools: OK
gsl: OK

Note that md5sumd -c - does not accept gzipped stream directly so if you have a gzipped manifest, you will need to pipe it through gzip -d before it is sent to the md5sumd -c - command.