Deduplication efficiency

I’m evaluating some backup solutions for long term preservation of audio files.
Some files are derived from others (corrupted files with repackaging), and I was investigating the final storage sizes and noticed Kopia produces bigger repo. The estimated duplicated size is circa 2,00 GiB.

Size (bytes) Size (human) Difference to Original
Original 49609954433 46,20 GiB
Kopia 48541809930 45,21 GiB 0,99 GiB
Duplicacy 48162947709 44,86 GiB 1,35 GiB
Restic 47959911802 44,67 GiB 1,54 GiB
Borg 47362916819 44,11 GiB 2,09 GiB

Given the incompressible nature of the files, I guess the compression method doesn’t matter.

Any ideas how to improve deduplication, or that’s on an acceptable margin for the software?

Our of curiosity, did you test if restore was successful after each of these backups, specifically comparing if the restored files are the exact duplicates of the original source files? I use WinMerge for comparisons when doing test restores, but not use what can be used on Linux.

The reason I ask is, lower is not always better for deduplication in the sense that some software could be erroneously be marking different files as being the same.

Also, please make sure compression is disabled. The files may be largely incompressible, but compression could still reduce the file size by small amounts. You say the compression method doesn’t matter, but it may when you are dealing with such small size differences. Kopia by default does not use compression, so you may be comparing file sizes of uncompressed backups (Kopia) to compressed backups (such as Borg). The compression algorithm likewise may matter.

Assuming all is well, then it looks like Borg has a better deduplication algorithm than Kopia. The others, I am not sure the difference is that large. It may even vary between source files.

Due to time constraints (the complete repo costs me circa 20 min for each operation), I made a test directory with only the guaranteed similar files (original and after repacking). All files restored without byte errors, comparing with the original (using diff tool).
Still Kopia underperforms comparing with Restic and Borg.

Size (bytes) Size (human) Compression Backup - Original Size
Original 4265474934 3,97 GiB
Duplicacy 4212839340 3,92 GiB LZ4 (forced) 50,20 MiB
Kopia 3322265078 3,09 GiB No 899,52 MiB
Burpstash 2946638743 2,74 GiB zstd (forced) 1,23 GiB
Borg 2785953655 2,59 GiB No 1,38 GiB
Restic 2632905110 2,45 GiB No 1,52 GiB
$ kopia blob stats                                                                                                                                                                                                   
Count: 155
Total: 3.3 GB
Average: 21.4 MB

        0 between 0 B and 10 B (total 0 B)
        1 between 10 B and 100 B (total 30 B)
        3 between 100 B and 1 KB (total 1.5 KB)
        2 between 1 KB and 10 KB (total 5.4 KB)
        6 between 10 KB and 100 KB (total 261.1 KB)
        0 between 100 KB and 1 MB (total 0 B)
        0 between 1 MB and 10 MB (total 0 B)
      143 between 10 MB and 100 MB (total 3.3 GB)
$ kopia content stats                                                                                                                                                                                                
Count: 1478
Total Bytes: 3.3 GB
Total Packed: 3.3 GB (compression 0.0%)
By Method:
  (uncompressed)         count: 1293 size: 3.3 GB
  zstd-fastest           count: 185 size: 288.6 KB packed: 88.3 KB compression: 69.4%
Average: 2.2 MB

        0 between 0 B and 10 B (total 0 B)
        0 between 10 B and 100 B (total 0 B)
      302 between 100 B and 1 KB (total 64.6 KB)
        1 between 1 KB and 10 KB (total 3.2 KB)
        5 between 10 KB and 100 KB (total 186.6 KB)
      367 between 100 KB and 1 MB (total 193.8 MB)
      803 between 1 MB and 10 MB (total 3.1 GB)
        0 between 10 MB and 100 MB (total 0 B)

Looks like there is some room to improve Kopia’s deduplication algorithm. I wonder if this is across the board, or Kopia may perform better with different file types.

Do you know what is the size of the original files and the size of the similar files? Just so we have an appropriate benchmark.

Splitter algorithms aren’t well documented in Kopia and I haven’t experimented with it myself, but the documentation indicates that you can set a specific splitter algorithm when creating a repository by using the --object-splitter flag.

According to the kopia benchmark splitter output, there are 14 splitter algorithms available:

  • FIXED-4M
  • FIXED-1M
  • FIXED-8M
  • FIXED-2M

You could test some splitter algorithms and see if it improves deduplication in your case.

The target file sizes would be 30 MiB m4a files, but from the test data: we have this distribution (using this calc)

128k:     16
256k:    170
512k:    148
  1M:     82
  2M:     90
  4M:     94
  8M:    183
 16M:     51

By splitter algorithm comparison:

Object-Splitter Size (b) Size (GiB) Difference (GiB)
Original 4265474934 3,97
DYNAMIC-1M-BUZHASH 2610210764 2,43 1,54
DYNAMIC-1M-RABINKARP 2636681550 2,46 1,52
DYNAMIC-2M-BUZHASH 2986004890 2,78 1,19
DYNAMIC-2M-RABINKARP 2993709563 2,79 1,18
DYNAMIC-4M-BUZHASH 3322285587 3,09 0,88
DYNAMIC 3322306237 3,09 0,88
DYNAMIC-4M-RABINKARP 3363882420 3,13 0,84
DYNAMIC-8M-BUZHASH 3713374550 3,46 0,51
DYNAMIC-8M-RABINKARP 3763816647 3,51 0,47
FIXED-1M 4262464973 3,97 0,00
FIXED-8M 4266992279 3,97 0,00
FIXED-4M 4267145272 3,97 0,00
FIXED 4267157262 3,97 0,00
FIXED-2M 4267354503 3,97 0,00

I guess the recommended approach would be backup different repositories into different storages with different splitters with some config-file hackery?

Why would you want to use different splitters across different repos? The default in Kopia is DYNAMIC-4M-BUZHASH but DYNAMIC-1M-BUZHASH will almost always give better deduplication, regardless of repo, because it uses smaller blocks. The tradeoff is that DYNAMIC-1M-BUZHASH will be a bit slower. Here is what kopia benchmark splitter says:

kopia benchmark splitter benchmarks splitting speed, doesn’t it? For now, I’m trying to understand what would be the optimal parameters for storage size and later trying to correlate with time and memory usage. At minimal, I guess a storage with FIXED for VMs and one or more storage with different parameters according to a balance of memory vs time vs deduplication.

Like the memory usage:

Interestingly, I would expect DYNAMIC-1M-BUZHASH to consume more, not less memory.

1 Like

Very interesting! So, there’s no valid reasons to not use 1M variant of buzhash? I’m wonder, why this setting is not a default one? BTW, FWIU Duplicati uses 100KiB data chunks for deduplication purpose, but this causes big grow of its database roughly 10 GiB database per 1 TiB data, so in fact, recommended setting for 2TiB backup is a 2MiB deduplication block size. If these things are comparable a bit, maybe Kopia’s 1M creates somewhat bigger indexes or something (?) in compare to 4M (?).

Someone made some calculations comparing various chunk sizes and time x delta tradeoffs

Thanks for link. So, 4M seems like a good - overall - compromise, going towards bigger (TiB’s rather than GiB’s) backups. For smaller (a few GiB’s), 1M is probably a better choice.

FWIW, just for kicks and giggles, I tested 775 GB of source files with DYNAMIC-4M-BUZHASH and DYNAMIC-1M-BUZHASH. All settings were kept consistent between the two snapshots (compression = ztd). I did not track CPU, RAM, or disk usage or backup speed.

  • With DYNAMIC-4M-BUZHASH, total repo size = 90.9 GB
  • With DYNAMIC-1M-BUZHASH, total repo size = 94.2 GB

This is what I get for trying to disagree with the Kopia recommended splitter :smiley: