ExactFile File Verification for Backups, Archives
Studies show that 70% of businesses that suffer a major data loss, never recover, and go out of business. For families storing precious memories on photos or videos, and confidential records, loss of irreplaceable files causes emotional distress.
A comprehensive data backup plan includes file verification to detect problems early so they can be corrected before data loss or use of incorrect data occurs.
One method of file verification is to compare each byte of the master file with the backup copy. With a huge amount of backup data, a complete file compare can take a long time. It may be part of the file verification process during the initial backup, but often more efficient verification methods are used. Always use file backup software that includes file verification after the backup is made, to detect errors at the time of the initial backup.
Some data archive procedures only check data integrity by doing a random check of a small sample of stored files. But, manual inspection of a few files will not detect all changes in data, such as a single character or number change in a text document, or a single or multiple pixel change in a photo file. It is useful for basic testing of file and device operation, but does not verify data integrity of all data, in all files, in the archive.
File Checksum or Hash Value
Rather than doing a file compare, or sample manual inspection, a file may be read and a checksum or hash value may be calculated and saved for the file. When the file is copied to the backup location, a hash value may be calculated and saved in the destination directory, or elsewhere. Then the hash of the source file is compared against the hash of the destination file. If a single bit, byte or more of data is incorrect, a data error will be detected. The system operator can investigate and correct the problem. This will detect errors during data transfers (data in motion), from the source to destination.
Once a data transfer is successful, stored data (data at rest), must be tested periodically for data integrity. Data loss might occur due to equipment, component, or media error or failure, virus, malware, hacker tampering, power surge, noise, radiation, magnetic field, ultra-violet light, static electricity, bit rot, organic dye decay on optical disk (CD, DVD, Blu-ray), scratches or fingerprints on optical disk, flash memory storage cell discharge, moisture, spills, temperature, shock, bad connection, accidental file editing, overwriting, or deletion.
Many errors are soft errors (temporary errors). Once they are detected, the file may be retrieved from a known good copy and rewritten to the destination and verified. If the error is corrected by the rewrite, it was a soft error. This cleaning process is known as data scrubbing. If the error remains, then it is a hard error (permanent error), and the storage device or media needs to be repaired or replaced.
Due to the avalanche effect in the algorithm, a hash value calculated on a file that has had only a single bit or byte change, will create a major change in many digits of the hash value. So, minor changes in data do not go undetected. Hash values are one way. A given data file will produce a specific hash value, or message digest. But, a hash value cannot be used to re-create the original data file. The hash is only for error detection, not error correction.
Security vulnerabilities found in some older, weaker hash algorithms have shown that sometimes different data files or messages can generate the same hash value, known as a collision. Newer, stronger hash algorithms have a high collision resistance, making calculating a collision message unfeasible with current computing technology.
ExactFile can generate and verify hash values on files, folders and sub-folders. Choose Create Digest from the menu and then select the target directory to generate a hash value for each file and store the results in a readable text hash digest file in a specified directory.
Once the hash digest file is created, choose Test Digest to scan each file listed in the digest file, calculate a hash value, and compare the calculated hash value with the stored expected hash value for each file.
In verbose mode, it will generate a report listing the number of files tested, the hash value, file name and the test result, success or failure, for each file. With verbose mode turned off, only detected errors are reported.
The file scanning and hash verification process is quick and can use multi-core processors to run faster. If you have a dual-core, quad-core or other multi-core processor, it will speed up the process by parallel processing.
Fixity in Data Backups and Archives
In a data backup or archive, it is important to pay attention to the total number of files, file name, file location, file size, and the data content of the files. For permanent data storage, this is known as fixity of a file or archive. Make sure no files or folders have been added, deleted, modified, moved or renamed.
Choose a Strong Hash Algorithm
ExactFile supports many types of hash values. MD5 (Merkle–Damgård Message Digest algorithm 5, created in 1991 by Professor Ronald Rivest of MIT) is a popular, fast, 128-bit hash value, but has known security vulnerabilities. It was replaced by SHA-1, Secure Hash Algorithm 1, 160-bits, developed by NSA in 1995, but it also has security weaknesses. SHA-1 is being retired from most government uses.
SHA-2, published by NIST in 2002, includes SHA-256 (256 bit) and SHA-512 (512 bit). Since 2014, SHA-2, with a minimum of 112 bits, is required for federal agencies by the Secure Hash Standard, found in NIST Federal Information Processing Standard (FIPS 180-4). For very strong data integrity, businesses can choose to use the same SHA-2 hash algorithms used by federal agencies.
ExactFile is freeware, written by Brandon Staggs. Download a copy from StudyLamp Software LLC at http://www.exactfile.com/downloads/
To learn how to use it, create a test folder with a few text files in it. Run ExactFile on the test folder and study the generated hash digest file. Then run a test using the digest file to scan the test files and check the report. There should be no errors. It is easy to use and there is built-in help information.
Scheduling File Verification
Running ExactFile after copying files to backup storage will detect errors and give confidence that the backup was successful. ExactFile does not include an automatic scheduler. A calendar reminder should be set up to remember to manually run the file verification process on a schedule, such as once per month, quarter or year. A single click in the Test Digest window can quickly test an entire folder and its sub-folders.
Benefits of an Active Data Archive
Testing backups periodically over the long term is a best practice for an archivist. It creates an active archive with verified, reliable data integrity. If soft errors begin to increase, it is time to replace the storage drive or media before it fails permanently, with hard errors. An active archive prevents surprises of finding corrupt data or missing files during a crisis, and allows the system operator to find and correct problems during a scheduled data archive maintenance check.
A passive archive, with no periodic checking, allows undetected data errors to remain dormant and accumulate, and may result in the retrieval, copying and use of bad data, or the total loss of important files.
ExactFile is a free, useful, important, easy to use file verification tool.
Update: Another free choice for file verification, with an automatic scheduler, email notification, and fixity checking, is Fixity software from AudioVisual Preservation Solutions, Inc.