On this page... (hide)
- 1. ChangeLog
- 2. Download
- 3. Rational
- 4. Usage
- 5. Details
- Apr. 19, 2013: Minor error message improvement.
- Mar. 20, 2013: Version 1.0.1 with format change to handle non-ASCII filenames.
- February, 2013: Initial public release of version 1.0.
- Source code: md5sumd (use python or python3 to run it)
- Windows executable: md5sumd.exe
- MacOSX executable: md5sumd_macosx.zip (decompress to get a binary file)
NOTE: If you have a filesystem with non-ASCII filenames encoded not in UTF-8, the directory md5sum might not work across platform unless you use the Python 3 version of the script (use python3 on source code). There is no binary distribution for the Python3 version due to packaging difficulties.
When we transfer or backup a large number of files, it is difficult to verify if the files have been copied correctly. Although it is possible to compare files and directories with original source using dedicated file/directory comparison tools, or commands such as
rsync -acv, the source data is not always available and it can be very slow to compare two large directories.
This script tries to address this problem by extending the standard
md5sum program to allow it to handle directories, and produce partial checksum during the handling of large files. The MD5 checksum of the directory is generated by calculating the MD5 checksum of all files and subdirectories, and generate a checksum from a manifest file from these values. Entries in the manifest file are sorted so that the order at which files are processed does not affect the directory checksum. Because the manifest file contains file size information, the choice to calculate MD5 checksum based only on 1G of data (not necessarily the first 1G) of large files should be safe.
If you are extremely impatient, you can skip the rest of this page and use command
% md5sumd * -v | gzip > .manifest.md5.gz
to generate a fingerprint for all files under a directory and save it to file .manifest.md5.gz, and use command
% md5sumd -c .manifest.md5.gz
to check if the content of the directory has been altered during file transferring, system failure, or unintentional changes.
% md5sumd -h
usage: md5sumd [-h] [--version] [-c [CHECKSUM]] [-v] [FILE_OR_DIR [FILE_OR_DIR ...]] A tool that calculates the MD5 checksum of files and directories, and use it to check the integrity of these files and directories. It has a interface that is similar to the md5sum command, with support for checksum of directories. positional arguments: FILE_OR_DIR Calculate MD5 signature of one or more files and directories and print MD5 checksums to the standard output. optional arguments: -h, --help show this help message and exit --version show program's version number and exit -c [CHECKSUM], --check [CHECKSUM] Check the content of one or more files and directories using a file that contains the checksum of these files and directories. Gippped checksum file is acceptable. If a file is unspecified or is -, read from the standard input. -v, --verbose If specified, this program will output checksum for all files when the checksum of a directory is calculated. Such information will help the --check command to figure out what files have been changed if a directory checksum mismatch happens. This option will also enable a progress bar for file scanning.
Let us calculate the MD5 of a directory:
% md5sumd vtools
If we copy the directory to somewhere else, we can see the signature is still the same
% cp -r vtools ~/Temp % md5sumd ~/Temp/vtools
If we change anything in that directory, the signature will be different
% rm ~/Temp/vtools/*.pyc % md5sumd ~/Temp/vtools
Now let us save the md5 checksum to a file,
% md5sumd vtools > vtools.md5 % md5sumd -c vtools.md5
When we transfer the directory to another place, we can still use this command to validate its content as long as the directory name is not changed. Now, if we change the directory, and check again,
% rm -rf vtools/cache % md5sumd --check vtools.md5
It can be frustrating when a directory checksum mismatch happens but you have no idea what has been changed. An interesting feature of the
md5sumd command is that it can generate and output detailed file-level MD5 information and use it to figure out what exactly have been changed to a directory.
% md5sumd vtools -v > vtools.md5
Scanning 34366 files: 100%[====================================] 2,095,623,045 48.6M/s in 00:00:430
As you can see, the
--verbose option even enables a progress bar, which can be helpful for directories that contain a large number of files. The output of this command has a lot more information, and it is interesting to see that there are 1,843,428 files of a total size of 28,615,195,835 under this directory.
% head -5 vtools.md5
2efce10e113804fc8a6b4e81ffd54f2e vtools ## MD5 type num_files num_dirs filesize total_num_files total_filesize name #2efce10e113804fc8a6b4e81ffd54f2e d 34366 2760 2095623045 1843428 28615195835 vtools #d75dc7768044f85895001913ae2a19b1 - 1 0 191368 1 191368 vtools/MANIFEST #80e0735f4483d04b6cd28cff95b9b28c - 1 0 4260 1 4260 vtools/MANIFEST.in
Then, if we change the directory a little bit and check it with the
% rm -f vtools/*pyc % rm vtools/source/*temp % md5sumd -c vtools.md5
vtools/source: directory modified. vtools/source/cgatools_wrap_py3.cpp_temp: file removed. vtools/source/cgatools_py3.py_temp: file removed. vtools/source/assoTests_wrap_py3.cpp_temp: file removed. vtools/source/vt_sqlite3_py3.py_temp: file removed. vtools/setup.pyc: file removed. vtools/source/assoTests_py3.py_temp: file removed. vtools: FAILED
Although the main strength of
md5sumd is its ability to calculate directory md5, it works well with files as well. For example, we can generate a md5 for all files and directories under a directory using command:
% cd vtools % md5sumd -v * > vtools.md5
Scanning 65 files under annotation: 100%[============================] 194,919 17.0M/s in 00:00:000 Scanning 32123 files under boost_1_49_0: 100%[===================] 281,585,238 16.4M/s in 00:00:170 Scanning 700 files under build: 100%[============================] 180,236,646 97.0M/s in 00:00:010 Scanning 38 files under cgatools: 100%[===============================] 205,326 7.0M/s in 00:00:000 Scanning 10 files under dist: 100%[=============================] 103,912,176 265.9M/s in 00:00:000 Scanning 15 files under format: 100%[==================================] 33,932 5.0M/s in 00:00:000 Scanning 485 files under gsl: 100%[=================================] 1,758,005 7.2M/s in 00:00:000 Scanning 28 files under libplinkio: 100%[=============================] 138,531 9.9M/s in 00:00:000 Scanning 659 files under pyinstaller: 100%[========================] 9,993,226 33.3M/s in 00:00:000 Scanning 39 files under source: 100%[=============================] 2,637,686 156.2M/s in 00:00:000 Scanning 43 files under sqlite: 100%[=============================] 5,511,571 188.0M/s in 00:00:000 Scanning 121 files under test: 100%[==========================] 1,507,339,718 307.4M/s in 00:00:040
If anything has been changes, we can check the change of contents using command
% rm test/*DB* % md5sumd --check vtools.md5
MANIFEST: OK MANIFEST.in: OK MANIFEST_local.txt: OK README: OK annotation: OK boost_1_49_0: OK build: OK build_executable.py: OK call_variants.py: OK cgatools: OK code_style.cfg: OK dist: OK format: OK gsl: OK libplinkio: OK manage_resource.py: OK pyinstaller: OK release.py: OK setup.py: OK source: OK sqlite: OK test/dbSNP.DB: file removed. test/gwasCatalog.DB: file removed. test/evs.DB: file removed. test/testNSFP-1.1_0.DB.gz: file removed. test/testThousandGenomes.DB: file removed. test/evs-hg19_20111107.DB.gz: file removed. test/dbSNP.DB-journal: file removed. test/evs-hg19_20111107.DB: file removed. test/testNSFP.DB: file removed. test: FAILED vtools: OK vtools.md5: FAILED vtools.spec: OK vtools_report: OK vtools_report.log: OK vtools_report.spec: OK
md4sumd command can read directly from a gzipped checksum file. This is useful when the checksum file gets large when the
--verbose option is used to list checksums of all files and directories under a large directory. For example, you can generate a checksum file using command
% md5sumd vtools gsl -v | gzip > checksum.gz
and check it directory using command
% md5sumd --check checksum.gz
md5sumd --check command will read from standard input if no filename (python 2.7 or higher) or a filename with name
- is specified.
% md5sumd vtools gsl | md5sumd -c -
vtools: OK gsl: OK
md5sumd -c - does not accept gzipped stream directly so if you have a gzipped manifest, you will need to pipe it through
gzip -d before it is sent to the
md5sumd -c - command.