Deduplication efficiency

Osmano807 · July 24, 2022, 2:39pm

I’m evaluating some backup solutions for long term preservation of audio files.
Some files are derived from others (corrupted files with repackaging), and I was investigating the final storage sizes and noticed Kopia produces bigger repo. The estimated duplicated size is circa 2,00 GiB.

	Size (bytes)	Size (human)	Difference to Original
Original	49609954433	46,20 GiB
Kopia	48541809930	45,21 GiB	0,99 GiB
Duplicacy	48162947709	44,86 GiB	1,35 GiB
Restic	47959911802	44,67 GiB	1,54 GiB
Borg	47362916819	44,11 GiB	2,09 GiB

Given the incompressible nature of the files, I guess the compression method doesn’t matter.

Any ideas how to improve deduplication, or that’s on an acceptable margin for the software?

basldfalksjdf · July 24, 2022, 5:22pm

Our of curiosity, did you test if restore was successful after each of these backups, specifically comparing if the restored files are the exact duplicates of the original source files? I use WinMerge for comparisons when doing test restores, but not use what can be used on Linux.

The reason I ask is, lower is not always better for deduplication in the sense that some software could be erroneously be marking different files as being the same.

Also, please make sure compression is disabled. The files may be largely incompressible, but compression could still reduce the file size by small amounts. You say the compression method doesn’t matter, but it may when you are dealing with such small size differences. Kopia by default does not use compression, so you may be comparing file sizes of uncompressed backups (Kopia) to compressed backups (such as Borg). The compression algorithm likewise may matter.

Assuming all is well, then it looks like Borg has a better deduplication algorithm than Kopia. The others, I am not sure the difference is that large. It may even vary between source files.

Osmano807 · July 25, 2022, 12:23am

Due to time constraints (the complete repo costs me circa 20 min for each operation), I made a test directory with only the guaranteed similar files (original and after repacking). All files restored without byte errors, comparing with the original (using diff tool).
Still Kopia underperforms comparing with Restic and Borg.

	Size (bytes)	Size (human)	Compression	Backup - Original Size
Original	4265474934	3,97 GiB
Duplicacy	4212839340	3,92 GiB	LZ4 (forced)	50,20 MiB
Kopia	3322265078	3,09 GiB	No	899,52 MiB
Burpstash	2946638743	2,74 GiB	zstd (forced)	1,23 GiB
Borg	2785953655	2,59 GiB	No	1,38 GiB
Restic	2632905110	2,45 GiB	No	1,52 GiB

$ kopia blob stats                                                                                                                                                                                                   
Count: 155
Total: 3.3 GB
Average: 21.4 MB
Histogram:

        0 between 0 B and 10 B (total 0 B)
        1 between 10 B and 100 B (total 30 B)
        3 between 100 B and 1 KB (total 1.5 KB)
        2 between 1 KB and 10 KB (total 5.4 KB)
        6 between 10 KB and 100 KB (total 261.1 KB)
        0 between 100 KB and 1 MB (total 0 B)
        0 between 1 MB and 10 MB (total 0 B)
      143 between 10 MB and 100 MB (total 3.3 GB)

$ kopia content stats                                                                                                                                                                                                
Count: 1478
Total Bytes: 3.3 GB
Total Packed: 3.3 GB (compression 0.0%)
By Method:
  (uncompressed)         count: 1293 size: 3.3 GB
  zstd-fastest           count: 185 size: 288.6 KB packed: 88.3 KB compression: 69.4%
Average: 2.2 MB
Histogram:

        0 between 0 B and 10 B (total 0 B)
        0 between 10 B and 100 B (total 0 B)
      302 between 100 B and 1 KB (total 64.6 KB)
        1 between 1 KB and 10 KB (total 3.2 KB)
        5 between 10 KB and 100 KB (total 186.6 KB)
      367 between 100 KB and 1 MB (total 193.8 MB)
      803 between 1 MB and 10 MB (total 3.1 GB)
        0 between 10 MB and 100 MB (total 0 B)

basldfalksjdf · July 25, 2022, 2:35am

Looks like there is some room to improve Kopia’s deduplication algorithm. I wonder if this is across the board, or Kopia may perform better with different file types.

Do you know what is the size of the original files and the size of the similar files? Just so we have an appropriate benchmark.

dimejo · July 25, 2022, 7:49am

Splitter algorithms aren’t well documented in Kopia and I haven’t experimented with it myself, but the documentation indicates that you can set a specific splitter algorithm when creating a repository by using the --object-splitter flag.

According to the kopia benchmark splitter output, there are 14 splitter algorithms available:

FIXED-4M
FIXED-1M
FIXED-8M
FIXED-2M
FIXED
DYNAMIC
DYNAMIC-1M-BUZHASH
DYNAMIC-4M-BUZHASH
DYNAMIC-2M-BUZHASH
DYNAMIC-8M-BUZHASH
DYNAMIC-8M-RABINKARP
DYNAMIC-2M-RABINKARP
DYNAMIC-4M-RABINKARP
DYNAMIC-1M-RABINKARP

You could test some splitter algorithms and see if it improves deduplication in your case.

Osmano807 · July 25, 2022, 2:15pm

The target file sizes would be 30 MiB m4a files, but from the test data: we have this distribution (using this calc)

128k:     16
256k:    170
512k:    148
  1M:     82
  2M:     90
  4M:     94
  8M:    183
 16M:     51

By splitter algorithm comparison:

Object-Splitter	Size (b)	Size (GiB)	Difference (GiB)
Original	4265474934	3,97
DYNAMIC-1M-BUZHASH	2610210764	2,43	1,54
DYNAMIC-1M-RABINKARP	2636681550	2,46	1,52
DYNAMIC-2M-BUZHASH	2986004890	2,78	1,19
DYNAMIC-2M-RABINKARP	2993709563	2,79	1,18
DYNAMIC-4M-BUZHASH	3322285587	3,09	0,88
DYNAMIC	3322306237	3,09	0,88
DYNAMIC-4M-RABINKARP	3363882420	3,13	0,84
DYNAMIC-8M-BUZHASH	3713374550	3,46	0,51
DYNAMIC-8M-RABINKARP	3763816647	3,51	0,47
FIXED-1M	4262464973	3,97	0,00
FIXED-8M	4266992279	3,97	0,00
FIXED-4M	4267145272	3,97	0,00
FIXED	4267157262	3,97	0,00
FIXED-2M	4267354503	3,97	0,00

I guess the recommended approach would be backup different repositories into different storages with different splitters with some config-file hackery?

basldfalksjdf · July 25, 2022, 2:57pm

Why would you want to use different splitters across different repos? The default in Kopia is DYNAMIC-4M-BUZHASH but DYNAMIC-1M-BUZHASH will almost always give better deduplication, regardless of repo, because it uses smaller blocks. The tradeoff is that DYNAMIC-1M-BUZHASH will be a bit slower. Here is what kopia benchmark splitter says:

Osmano807 · July 25, 2022, 4:21pm

kopia benchmark splitter benchmarks splitting speed, doesn’t it? For now, I’m trying to understand what would be the optimal parameters for storage size and later trying to correlate with time and memory usage. At minimal, I guess a storage with FIXED for VMs and one or more storage with different parameters according to a balance of memory vs time vs deduplication.

Like the memory usage:

Interestingly, I would expect DYNAMIC-1M-BUZHASH to consume more, not less memory.

ZebCorp · July 25, 2022, 5:05pm

Very interesting! So, there’s no valid reasons to not use 1M variant of buzhash? I’m wonder, why this setting is not a default one? BTW, FWIU Duplicati uses 100KiB data chunks for deduplication purpose, but this causes big grow of its database roughly 10 GiB database per 1 TiB data, so in fact, recommended setting for 2TiB backup is a 2MiB deduplication block size. If these things are comparable a bit, maybe Kopia’s 1M creates somewhat bigger indexes or something (?) in compare to 4M (?).

Osmano807 · July 25, 2022, 8:18pm

Someone made some calculations comparing various chunk sizes and time x delta tradeoffs

github.com/restic/restic

Tweaking the chunker block size targets

opened 06:38AM - 01 Jul 17 UTC

wrouesnel

category: backup type: feature suggestion

I've been chasing why one of my servers can't complete it's backups on time, and… finally discovered the 1mb average chunk size that restic aims for. When backing up a file with changed content like an MS Access database (~1+GB) the resultant diff set is way too big compared to what I get with rsync or bup - both of which average an 8kb block size, and were able to handle this use case very efficiently. This problem kind of promulgates throughout this particular backup job as well - i.e. most of the documents and small files are definitely smaller then 1mb, so any change will result in the whole file being committed as a new block rather then the efficiently diffed. It would be nice to be able to configure some of the parameters of the chunking algorithm, at least globally, to tweak restic's assumptions during backups to handle situations like this where it's clearly working sub-optimally.

ZebCorp · July 25, 2022, 8:52pm

Thanks for link. So, 4M seems like a good - overall - compromise, going towards bigger (TiB’s rather than GiB’s) backups. For smaller (a few GiB’s), 1M is probably a better choice.

basldfalksjdf · July 25, 2022, 9:33pm

FWIW, just for kicks and giggles, I tested 775 GB of source files with DYNAMIC-4M-BUZHASH and DYNAMIC-1M-BUZHASH. All settings were kept consistent between the two snapshots (compression = ztd). I did not track CPU, RAM, or disk usage or backup speed.

With DYNAMIC-4M-BUZHASH, total repo size = 90.9 GB
With DYNAMIC-1M-BUZHASH, total repo size = 94.2 GB

This is what I get for trying to disagree with the Kopia recommended splitter

Topic		Replies	Views
Questions i cannot find an answer to General	20	2478	May 4, 2021
My benchmark of kopia and bupstash General	2	908	July 2, 2021
Maximum usable size of the repository? Petabyte scale possible? General Topics	16	2511	September 26, 2023
Kopia re-uploads data Support	11	1032	April 3, 2021
How to use deduplication for cleaning up several harddiscs? General Topics	1	52	February 4, 2025

Deduplication efficiency

Related topics