I’m evaluating some backup solutions for long term preservation of audio files.
Some files are derived from others (corrupted files with repackaging), and I was investigating the final storage sizes and noticed Kopia produces bigger repo. The estimated duplicated size is circa 2,00 GiB.
Difference to Original
Given the incompressible nature of the files, I guess the compression method doesn’t matter.
Any ideas how to improve deduplication, or that’s on an acceptable margin for the software?
Our of curiosity, did you test if restore was successful after each of these backups, specifically comparing if the restored files are the exact duplicates of the original source files? I use WinMerge for comparisons when doing test restores, but not use what can be used on Linux.
The reason I ask is, lower is not always better for deduplication in the sense that some software could be erroneously be marking different files as being the same.
Also, please make sure compression is disabled. The files may be largely incompressible, but compression could still reduce the file size by small amounts. You say the compression method doesn’t matter, but it may when you are dealing with such small size differences. Kopia by default does not use compression, so you may be comparing file sizes of uncompressed backups (Kopia) to compressed backups (such as Borg). The compression algorithm likewise may matter.
Assuming all is well, then it looks like Borg has a better deduplication algorithm than Kopia. The others, I am not sure the difference is that large. It may even vary between source files.
Due to time constraints (the complete repo costs me circa 20 min for each operation), I made a test directory with only the guaranteed similar files (original and after repacking). All files restored without byte errors, comparing with the original (using diff tool).
Still Kopia underperforms comparing with Restic and Borg.
Backup - Original Size
$ kopia blob stats
Total: 3.3 GB
Average: 21.4 MB
0 between 0 B and 10 B (total 0 B)
1 between 10 B and 100 B (total 30 B)
3 between 100 B and 1 KB (total 1.5 KB)
2 between 1 KB and 10 KB (total 5.4 KB)
6 between 10 KB and 100 KB (total 261.1 KB)
0 between 100 KB and 1 MB (total 0 B)
0 between 1 MB and 10 MB (total 0 B)
143 between 10 MB and 100 MB (total 3.3 GB)
$ kopia content stats
Total Bytes: 3.3 GB
Total Packed: 3.3 GB (compression 0.0%)
(uncompressed) count: 1293 size: 3.3 GB
zstd-fastest count: 185 size: 288.6 KB packed: 88.3 KB compression: 69.4%
Average: 2.2 MB
0 between 0 B and 10 B (total 0 B)
0 between 10 B and 100 B (total 0 B)
302 between 100 B and 1 KB (total 64.6 KB)
1 between 1 KB and 10 KB (total 3.2 KB)
5 between 10 KB and 100 KB (total 186.6 KB)
367 between 100 KB and 1 MB (total 193.8 MB)
803 between 1 MB and 10 MB (total 3.1 GB)
0 between 10 MB and 100 MB (total 0 B)
Splitter algorithms aren’t well documented in Kopia and I haven’t experimented with it myself, but the documentation indicates that you can set a specific splitter algorithm when creating a repository by using the --object-splitter flag.
According to the kopia benchmark splitter output, there are 14 splitter algorithms available:
You could test some splitter algorithms and see if it improves deduplication in your case.
Why would you want to use different splitters across different repos? The default in Kopia is DYNAMIC-4M-BUZHASH but DYNAMIC-1M-BUZHASH will almost always give better deduplication, regardless of repo, because it uses smaller blocks. The tradeoff is that DYNAMIC-1M-BUZHASH will be a bit slower. Here is what kopia benchmark splitter says:
kopia benchmark splitter benchmarks splitting speed, doesn’t it? For now, I’m trying to understand what would be the optimal parameters for storage size and later trying to correlate with time and memory usage. At minimal, I guess a storage with FIXED for VMs and one or more storage with different parameters according to a balance of memory vs time vs deduplication.
Very interesting! So, there’s no valid reasons to not use 1M variant of buzhash? I’m wonder, why this setting is not a default one? BTW, FWIU Duplicati uses 100KiB data chunks for deduplication purpose, but this causes big grow of its database roughly 10 GiB database per 1 TiB data, so in fact, recommended setting for 2TiB backup is a 2MiB deduplication block size. If these things are comparable a bit, maybe Kopia’s 1M creates somewhat bigger indexes or something (?) in compare to 4M (?).
FWIW, just for kicks and giggles, I tested 775 GB of source files with DYNAMIC-4M-BUZHASH and DYNAMIC-1M-BUZHASH. All settings were kept consistent between the two snapshots (compression = ztd). I did not track CPU, RAM, or disk usage or backup speed.
With DYNAMIC-4M-BUZHASH, total repo size = 90.9 GB
With DYNAMIC-1M-BUZHASH, total repo size = 94.2 GB
This is what I get for trying to disagree with the Kopia recommended splitter