I saw a post on lobster saying that restic used content defined chunking. Does kopia use something similar?
Not only rolling hashes, but also content defined chunking GitHub - restic/chunker: Implementation of Content Defined Chunking (CDC) in Go which is just magic really. Deduplicating segments not at block boundaries.
Yes, Kopia does indeed use something similar. Restic calls them chunking, Kopia calls them ‘Splitters’. Both tools use so called rolling hash, I am not exactly sure what Restic uses, but I think it is Rabin-Karp algorithm. Kopia also provides you with a variety of splitters, though this is not well documented, if you run
kopia benchmark splitter, you might see something like this:
splitting 16 blocks of 32MiB each, parallelism 1
DYNAMIC 307.8 MB/s count:107 min:9467 10th:2277562 25th:2971794 50th:4747177 75th:7603998 90th:8388608 max:8388608
DYNAMIC-1M-BUZHASH 319 MB/s count:428 min:9467 10th:612999 25th:766808 50th:1158068 75th:1744194 90th:2097152 max:2097152
DYNAMIC-1M-RABINKARP 96.2 MB/s count:391 min:213638 10th:623062 25th:813953 50th:1328673 75th:2097152 90th:2097152 max:2097152
DYNAMIC-2M-BUZHASH 301.5 MB/s count:204 min:64697 10th:1210184 25th:1638276 50th:2585985 75th:3944217 90th:4194304 max:4194304
DYNAMIC-2M-RABINKARP 96.7 MB/s count:204 min:535925 10th:1235674 25th:1675441 50th:2525341 75th:3658905 90th:4194304 max:4194304
DYNAMIC-4M-BUZHASH 296.4 MB/s count:107 min:9467 10th:2277562 25th:2971794 50th:4747177 75th:7603998 90th:8388608 max:8388608
DYNAMIC-4M-RABINKARP 98 MB/s count:110 min:535925 10th:2242307 25th:2767610 50th:4400962 75th:6813401 90th:8388608 max:8388608
DYNAMIC-8M-BUZHASH 294.2 MB/s count:49 min:677680 10th:5260579 25th:6528562 50th:11102775 75th:16777216 90th:16777216 max:16777216
DYNAMIC-8M-RABINKARP 100.2 MB/s count:60 min:1446246 10th:4337385 25th:5293196 50th:8419217 75th:12334953 90th:16777216 max:16777216
FIXED 94 TB/s count:128 min:4194304 10th:4194304 25th:4194304 50th:4194304 75th:4194304 90th:4194304 max:4194304
FIXED-1M 56.9 TB/s count:512 min:1048576 10th:1048576 25th:1048576 50th:1048576 75th:1048576 90th:1048576 max:1048576
FIXED-2M 112.8 TB/s count:256 min:2097152 10th:2097152 25th:2097152 50th:2097152 75th:2097152 90th:2097152 max:2097152
FIXED-4M 159 TB/s count:128 min:4194304 10th:4194304 25th:4194304 50th:4194304 75th:4194304 90th:4194304 max:4194304
FIXED-8M 320.9 TB/s count:64 min:8388608 10th:8388608 25th:8388608 50th:8388608 75th:8388608 90th:8388608 max:8388608
Notice the names on the splitters: BUZHASH and RABINKARP. Both of these are rolling hashes (see the linked Wiki article above). Kopia allows you to select how long each split is: from 1M to 8M; and also how to compute them: fixed, BUZHASH, RABINKARP. Based on your benchmarks that you use on your data, you can select the one that suits you the best.