I’m evaluating some backup solutions for long term preservation of audio files.
Some files are derived from others (corrupted files with repackaging), and I was investigating the final storage sizes and noticed Kopia produces bigger repo. The estimated duplicated size is circa 2,00 GiB.
Size (bytes)
Size (human)
Difference to Original
Original
49609954433
46,20 GiB
Kopia
48541809930
45,21 GiB
0,99 GiB
Duplicacy
48162947709
44,86 GiB
1,35 GiB
Restic
47959911802
44,67 GiB
1,54 GiB
Borg
47362916819
44,11 GiB
2,09 GiB
Given the incompressible nature of the files, I guess the compression method doesn’t matter.
Any ideas how to improve deduplication, or that’s on an acceptable margin for the software?
Our of curiosity, did you test if restore was successful after each of these backups, specifically comparing if the restored files are the exact duplicates of the original source files? I use WinMerge for comparisons when doing test restores, but not use what can be used on Linux.
The reason I ask is, lower is not always better for deduplication in the sense that some software could be erroneously be marking different files as being the same.
Also, please make sure compression is disabled. The files may be largely incompressible, but compression could still reduce the file size by small amounts. You say the compression method doesn’t matter, but it may when you are dealing with such small size differences. Kopia by default does not use compression, so you may be comparing file sizes of uncompressed backups (Kopia) to compressed backups (such as Borg). The compression algorithm likewise may matter.
Assuming all is well, then it looks like Borg has a better deduplication algorithm than Kopia. The others, I am not sure the difference is that large. It may even vary between source files.
Due to time constraints (the complete repo costs me circa 20 min for each operation), I made a test directory with only the guaranteed similar files (original and after repacking). All files restored without byte errors, comparing with the original (using diff tool).
Still Kopia underperforms comparing with Restic and Borg.
Size (bytes)
Size (human)
Compression
Backup - Original Size
Original
4265474934
3,97 GiB
Duplicacy
4212839340
3,92 GiB
LZ4 (forced)
50,20 MiB
Kopia
3322265078
3,09 GiB
No
899,52 MiB
Burpstash
2946638743
2,74 GiB
zstd (forced)
1,23 GiB
Borg
2785953655
2,59 GiB
No
1,38 GiB
Restic
2632905110
2,45 GiB
No
1,52 GiB
$ kopia blob stats
Count: 155
Total: 3.3 GB
Average: 21.4 MB
Histogram:
0 between 0 B and 10 B (total 0 B)
1 between 10 B and 100 B (total 30 B)
3 between 100 B and 1 KB (total 1.5 KB)
2 between 1 KB and 10 KB (total 5.4 KB)
6 between 10 KB and 100 KB (total 261.1 KB)
0 between 100 KB and 1 MB (total 0 B)
0 between 1 MB and 10 MB (total 0 B)
143 between 10 MB and 100 MB (total 3.3 GB)
$ kopia content stats
Count: 1478
Total Bytes: 3.3 GB
Total Packed: 3.3 GB (compression 0.0%)
By Method:
(uncompressed) count: 1293 size: 3.3 GB
zstd-fastest count: 185 size: 288.6 KB packed: 88.3 KB compression: 69.4%
Average: 2.2 MB
Histogram:
0 between 0 B and 10 B (total 0 B)
0 between 10 B and 100 B (total 0 B)
302 between 100 B and 1 KB (total 64.6 KB)
1 between 1 KB and 10 KB (total 3.2 KB)
5 between 10 KB and 100 KB (total 186.6 KB)
367 between 100 KB and 1 MB (total 193.8 MB)
803 between 1 MB and 10 MB (total 3.1 GB)
0 between 10 MB and 100 MB (total 0 B)
Looks like there is some room to improve Kopia’s deduplication algorithm. I wonder if this is across the board, or Kopia may perform better with different file types.
Do you know what is the size of the original files and the size of the similar files? Just so we have an appropriate benchmark.
Splitter algorithms aren’t well documented in Kopia and I haven’t experimented with it myself, but the documentation indicates that you can set a specific splitter algorithm when creating a repository by using the --object-splitter flag.
According to the kopia benchmark splitter output, there are 14 splitter algorithms available:
FIXED-4M
FIXED-1M
FIXED-8M
FIXED-2M
FIXED
DYNAMIC
DYNAMIC-1M-BUZHASH
DYNAMIC-4M-BUZHASH
DYNAMIC-2M-BUZHASH
DYNAMIC-8M-BUZHASH
DYNAMIC-8M-RABINKARP
DYNAMIC-2M-RABINKARP
DYNAMIC-4M-RABINKARP
DYNAMIC-1M-RABINKARP
You could test some splitter algorithms and see if it improves deduplication in your case.
I guess the recommended approach would be backup different repositories into different storages with different splitters with some config-file hackery?
Why would you want to use different splitters across different repos? The default in Kopia is DYNAMIC-4M-BUZHASH but DYNAMIC-1M-BUZHASH will almost always give better deduplication, regardless of repo, because it uses smaller blocks. The tradeoff is that DYNAMIC-1M-BUZHASH will be a bit slower. Here is what kopia benchmark splitter says:
kopia benchmark splitter benchmarks splitting speed, doesn’t it? For now, I’m trying to understand what would be the optimal parameters for storage size and later trying to correlate with time and memory usage. At minimal, I guess a storage with FIXED for VMs and one or more storage with different parameters according to a balance of memory vs time vs deduplication.
Very interesting! So, there’s no valid reasons to not use 1M variant of buzhash? I’m wonder, why this setting is not a default one? BTW, FWIU Duplicati uses 100KiB data chunks for deduplication purpose, but this causes big grow of its database roughly 10 GiB database per 1 TiB data, so in fact, recommended setting for 2TiB backup is a 2MiB deduplication block size. If these things are comparable a bit, maybe Kopia’s 1M creates somewhat bigger indexes or something (?) in compare to 4M (?).
Thanks for link. So, 4M seems like a good - overall - compromise, going towards bigger (TiB’s rather than GiB’s) backups. For smaller (a few GiB’s), 1M is probably a better choice.
FWIW, just for kicks and giggles, I tested 775 GB of source files with DYNAMIC-4M-BUZHASH and DYNAMIC-1M-BUZHASH. All settings were kept consistent between the two snapshots (compression = ztd). I did not track CPU, RAM, or disk usage or backup speed.
With DYNAMIC-4M-BUZHASH, total repo size = 90.9 GB
With DYNAMIC-1M-BUZHASH, total repo size = 94.2 GB
This is what I get for trying to disagree with the Kopia recommended splitter