Data corruption checking - Flag clarification

In the interest of checking that the file backup is free from data corruption due to bad RAM etc, what is the best practice for backup flags and verification flags to make sure the files are correct and checksums are consistent? i.e.
Should I use --force-hash 100 or is 0 faster but with less integrity checking?

	kopia snapshot create --force-hash 100

I run the verify command with the flag --verify-files-percent 100

	kopia snapshot verify --verify-files-percent 100 --all-sources

You should set force-hash to some small value (say 2) which means that statistically every 50 snapshots you will rehash all the source files.

Verify-files-percent is similar - set it to some small value(say 2) and when you run verify periodically, it will re-download a small portion of the repository.

Alternatively you can run snapshot verify infrequently and use 100 percent, but that’s more bursty and could be quite taxing on your internet pipe.

Now that I said that, I think there’s a bug with --force-hash it’s unnecessarily deterministic and will only work correctly when it’s set to 100. Setting it to a lower value (say 50) will re-hash 50% of files, but it will be the same 50% on each run, which will leave the other 50% permanently unverified.

I filed a github issue

Okay, that seems very proficient. I have previously tried borg and restic which only have 100% hash checking on the backup side although they do have partial checking on the verification side. Even then the data corruption I found was not fixed so I gave up.

@ted To clarify a bit, snapshot --force-hash pct during snapshot and verify --verify-files-percent pct have slightly different purposes and behavior.

kopia snapshot verify --verify-files-percent 100 --all-sources does what you’d expect. It verifies a portion of the snapshot contents in the repository, or all in this case.

kopia snapshot create --force-hash 100 is NOT about verifying the contents that got backed up. Instead, it is about using file contents or just metadata (modification time, size, …) to determine whether or not a file has changed since the previous snapshot. This is similar to rsync’s --checksum flag. Assume a file f has the same metadata attributes as a file f' with the same path name in the prior snapshot. When --force-hash 100 is specified, then the contents of f are hashed and compared against the hash (checksum) of the file f' in the prior snapshot. When --force-hash 0 is specified, kopia assumes that the contents have not change since f's metadata has not changed. Choosing some other value, for example 50, means that the probability of checking whether the actual contents of f have changed is 0.5. Specifying --force-hash 100 is similar to setting --checksum in rsync. Conversely, --force-hash 0 is similar to not specifying --checksum in rsync.

I hope this helps.

1 Like

Yes thanks that helps and I’m somewhat relieved that you are actually allowing the user to hash in this way i.e. kopia snapshot create --force-hash 100 and not just rely on metadata. Other packages do not have this feature and it opens up much discussion on whether it has actually done its job in detecting change when only relying on metadata. So +1 to kopia.

Although, this comes with performance tradeoffs. --force-hash 100 requires (re-)reading all the files and hashing them. So, it incurs higher I/O and CPU costs.

If one was to run snapshots regularly, then at some point a kopia snapshot verify --verify-files-percent 100 --all-sources was completed. Then say the verify command detected some mismatch checksum in the file data between some source path and repo path, how would it alert me to that issue? (This might occur due to say RAM failing errors) Would it be shown in the stdout to screen or in the log file, or both?

Then to correct this, if I ran snapshot --force-hash 100 (or some % value) would I then expect the snapshot error to be corrected? (assuming RAM or root cause was fixed)