Questions i cannot find an answer to

qwuy1290 · April 29, 2021, 8:29am

Hi.
I have recently discovered Kopia and it looks extremely promising!
I am currently using Duplicacy right now. It has worked mostly fine, but it lacks some important features that Kopia instead have (mounting snapshots, zstd compression, better cli).
However I cannot find any information about the following four things that keep me from switching:

will RSA encryption ever be supported? Duplicacy allows to use both password and rsa key encryption. Though, I am not really sure it is needed, since in some years RSA will be surely broken by quantum computers.
does it handle partial snapshots? Let’s say that the PC turn off during a snapshot. I can think of three thinks possibly happening: repository ruined, the partial snapshot is deleted on next run, the partial snapshot can be resumed on next run. On which one is Kopia?
Does it support lock-free backups? This is the main selling point of duplicacy and seems quite revolutionary to me, since other backup tools either don’t support it, need a server running or use exclusive locking.
Speed of pruning. I have used Restic before Duplicacy, but switched over mainly because pruning was extremely slow on Restic. I cannot find any benchmark comparing Kopia with the other two. If anyone have used them, how do they compare with Kopia?

jkowalski · April 29, 2021, 1:22pm

Kopia uses symmetric encryption today (AES-256-GCM and CHACHA20POLY1305) but any authenticated encryption scheme is easily pluggable.
Yes, Kopia is designed to handle all kinds of crashes and ideally not redo work that has been done. During long snapshots Kopia will write checkpoints every 45 minutes or so, which will be reused on next snapshot attempt to not only avoid uploading data again but in many cases also avoid hashing. The partial snapshot is transparently merged with last full snapshot from the history to get good incremental performance.
Yes. Kopia runs lock free in all situations. It has optional server mode but that does not introduce locks and is primarily for better access control and to avoid storing low-level repository credentials on client machines. Instead of locks, kopia relies on passage of time for safety of its maintenance operations so it requires somewhat reasonably synchronized clocks (drift of several seconds to minutes is fine, but hours not so much)
Kopia uses multi-stage maintenance routine to perform what purge does in Restic and others. It’s described here Details of maintenance command - #8 by jkowalski and I think it’s quite fast:

On my main personal repository of 730GB and 1.5M contents (file chunks), I’m running full maintenance every 4 hours and it currently takes less than 40 seconds to complete the full cycle (on my home internet which is 500Mbps symmetrical). Full maintenance performs a walk of the snapshot tree and deletes unreachable contents. This is possible through efficient index structures and separate cache for metadata and data and lots of heavy parallelization and sharding to efficiently use all local machine and network resources.

The following stats show Kopia maintenance repackages virtually all data into pack blobs of around 22.5MB each.

$ kopia blob stats
Count: 32514
Total: 729.9 GB
Average: 22.4 MB
Histogram:

       70 between 100 B and 1 KB (total 12.6 KB)
      160 between 1 KB and 10 KB (total 689.2 KB)
        1 between 10 KB and 100 KB (total 49.5 KB)
       39 between 100 KB and 1 MB (total 14.2 MB)
        2 between 1 MB and 10 MB (total 12.4 MB)
    32242 between 10 MB and 100 MB (total 729.8 GB)

This shows that the total size of live data is very close to the physical storage size: 729 GB blobs (physical) vs 722 GB contents (logical)

$ kopia content stats
kopia content stats
Count: 1493817
Total: 722.8 GB
Average: 483.8 KB
Histogram:

    83096 between 10 B and 100 B (total 4.3 MB)
   618051 between 100 B and 1 KB (total 267 MB)
   352019 between 1 KB and 10 KB (total 1.1 GB)
   111774 between 10 KB and 100 KB (total 3.6 GB)
    78833 between 100 KB and 1 MB (total 35 GB)
   250044 between 1 MB and 10 MB (total 682.7 GB)

Note that content sizes are not related to source file sizes.

I know folks have Kopia for repositories of 10s of TBs, I’d be curious to know their stats as well.

jkowalski · April 29, 2021, 1:28pm

One more caveat to the “lock-free” statement:

While snapshots are performed completely lock-free, Kopia does currently require certain parts of maintenance to be performed by a single, dedicated user. Running maintenance in parallel to ongoing snapshots is ok and supported.

Kopia manages maintenance lock file on a local machine to ensure no two local sessions can run maintenance in parallel and ensures that only the designated username@hostname is allowed to run the maintenance across all machines.

qwuy1290 · April 29, 2021, 1:39pm

First of all, thanks a lot for the very detailed answer and for all the work you have done, it’s seriously incredible!
Yeah duplicacy allows to do “maintenance” completely lock-free. Though, for now I don’t really need it, and Kopia is way more feature-complete.
I’m very happy to hear that snapshots can be recovered, awesome!
Will switch to Kopia asap!

Regarding encryption… of the two methods available, what is the preferred one?

qwuy1290 · April 29, 2021, 1:45pm

Though, what does this mean exactly? I mean, does Kopia download blobs and repackages them at each maintenance??
If yes that would become expensive with object storages like s3… iirc duplicacy packs everything in 1mb chunks, so that they can easily be deleted on snapshots “pruning”

jkowalski · April 29, 2021, 1:55pm

The default is AES256-GCM encryption with BLAKE2B-256-128 hash which tends to be faster on modern 64-bit Intel/AMD CPUs, but this is very machine dependent: You can run the benchmark yourself using:

$ kopia benchmark crypto

On my 2020 Mac Laptop AES256-GCM wins over CHACHA20-POLY1305 by a lot:

     Hash                 Encryption           Throughput
-----------------------------------------------------------------
  0. BLAKE3-256           AES256-GCM-HMAC-SHA256 1.9 GiB / second
  1. BLAKE3-256-128       AES256-GCM-HMAC-SHA256 1.8 GiB / second
  2. BLAKE3-256-128       CHACHA20-POLY1305-HMAC-SHA256 1.1 GiB / second
  3. BLAKE3-256           CHACHA20-POLY1305-HMAC-SHA256 867.9 MiB / second
...

On my Apple Silicon (ARM) Mac Mini the results are different, AES256 wins hands down over CHACHA20-POLY1305, but BLAKE is slower and SHA hashes are much faster than on Intel:

  0. HMAC-SHA256-128      AES256-GCM-HMAC-SHA256 1.6 GiB / second
  1. HMAC-SHA256          AES256-GCM-HMAC-SHA256 1.6 GiB / second
  2. HMAC-SHA224          AES256-GCM-HMAC-SHA256 1.6 GiB / second
  3. BLAKE2B-256-128      AES256-GCM-HMAC-SHA256 776.6 MiB / second
  4. BLAKE2B-256          AES256-GCM-HMAC-SHA256 684.2 MiB / second
  5. HMAC-SHA256          CHACHA20-POLY1305-HMAC-SHA256 624 MiB / second
  6. HMAC-SHA224          CHACHA20-POLY1305-HMAC-SHA256 622.7 MiB / second
  7. HMAC-SHA256-128      CHACHA20-POLY1305-HMAC-SHA256 622.4 MiB / second

On Raspberry PI 4 (low-end ARM64) the story is completely different with CHACHA20 winning by a lot:

  0. BLAKE2B-256-128      CHACHA20-POLY1305-HMAC-SHA256 73.1 MiB / second
  1. BLAKE2B-256          CHACHA20-POLY1305-HMAC-SHA256 72.6 MiB / second
  2. BLAKE3-256           CHACHA20-POLY1305-HMAC-SHA256 70.3 MiB / second
  3. BLAKE3-256-128       CHACHA20-POLY1305-HMAC-SHA256 70.2 MiB / second
  4. HMAC-SHA3-224        CHACHA20-POLY1305-HMAC-SHA256 57.7 MiB / second
  5. HMAC-SHA3-256        CHACHA20-POLY1305-HMAC-SHA256 56.1 MiB / second
  6. BLAKE2S-256          CHACHA20-POLY1305-HMAC-SHA256 53.5 MiB / second
  7. BLAKE2S-128          CHACHA20-POLY1305-HMAC-SHA256 53.4 MiB / second
  8. HMAC-SHA256-128      CHACHA20-POLY1305-HMAC-SHA256 30.2 MiB / second
  9. HMAC-SHA256          CHACHA20-POLY1305-HMAC-SHA256 30.2 MiB / second
 10. HMAC-SHA224          CHACHA20-POLY1305-HMAC-SHA256 30.2 MiB / second
 11. BLAKE2B-256-128      AES256-GCM-HMAC-SHA256 19.3 MiB / second
 12. BLAKE3-256           AES256-GCM-HMAC-SHA256 19.1 MiB / second
 13. BLAKE3-256-128       AES256-GCM-HMAC-SHA256 19.1 MiB / second
 14. BLAKE2B-256          AES256-GCM-HMAC-SHA256 19 MiB / second
 15. HMAC-SHA3-224        AES256-GCM-HMAC-SHA256 18.1 MiB / second
 16. HMAC-SHA3-256        AES256-GCM-HMAC-SHA256 17.9 MiB / second
 17. BLAKE2S-256          AES256-GCM-HMAC-SHA256 17.6 MiB / second
 18. BLAKE2S-128          AES256-GCM-HMAC-SHA256 17.6 MiB / second
 19. HMAC-SHA256-128      AES256-GCM-HMAC-SHA256 14 MiB / second
 20. HMAC-SHA224          AES256-GCM-HMAC-SHA256 14 MiB / second
 21. HMAC-SHA256          AES256-GCM-HMAC-SHA256 14 MiB / second

On the same Raspberry PI hardware but in 32-bit mode (ARMHF) the results are different still:

  0. BLAKE3-256           CHACHA20-POLY1305-HMAC-SHA256 25.6 MiB / second
  1. BLAKE3-256-128       CHACHA20-POLY1305-HMAC-SHA256 25.5 MiB / second
  2. BLAKE2S-256          CHACHA20-POLY1305-HMAC-SHA256 22.6 MiB / second
  3. BLAKE2S-128          CHACHA20-POLY1305-HMAC-SHA256 22.5 MiB / second
  4. BLAKE2B-256-128      CHACHA20-POLY1305-HMAC-SHA256 19.6 MiB / second
  5. BLAKE2B-256          CHACHA20-POLY1305-HMAC-SHA256 19.5 MiB / second
  6. HMAC-SHA256-128      CHACHA20-POLY1305-HMAC-SHA256 19.2 MiB / second
  7. HMAC-SHA256          CHACHA20-POLY1305-HMAC-SHA256 19.2 MiB / second
  8. HMAC-SHA224          CHACHA20-POLY1305-HMAC-SHA256 19.2 MiB / second
  9. HMAC-SHA3-224        CHACHA20-POLY1305-HMAC-SHA256 17 MiB / second
 10. HMAC-SHA3-256        CHACHA20-POLY1305-HMAC-SHA256 16.6 MiB / second
 11. BLAKE3-256           AES256-GCM-HMAC-SHA256 14.6 MiB / second
 12. BLAKE3-256-128       AES256-GCM-HMAC-SHA256 14.6 MiB / second
 13. BLAKE2S-256          AES256-GCM-HMAC-SHA256 13.6 MiB / second
 14. BLAKE2S-128          AES256-GCM-HMAC-SHA256 13.6 MiB / second
 15. BLAKE2B-256          AES256-GCM-HMAC-SHA256 12.4 MiB / second
 16. BLAKE2B-256-128      AES256-GCM-HMAC-SHA256 12.4 MiB / second
 17. HMAC-SHA256-128      AES256-GCM-HMAC-SHA256 12.3 MiB / second
 18. HMAC-SHA256          AES256-GCM-HMAC-SHA256 12.3 MiB / second
 19. HMAC-SHA224          AES256-GCM-HMAC-SHA256 12.2 MiB / second
 20. HMAC-SHA3-224        AES256-GCM-HMAC-SHA256 11.4 MiB / second
 21. HMAC-SHA3-256        AES256-GCM-HMAC-SHA256 11.1 MiB / second

jkowalski · April 29, 2021, 1:59pm

Kopia will only repackage blobs that are <80% full, so once a blob gets rewritten it stays that way unless enough “holes” inside it appear due to deletes to warrant another repackaging.

It would probably make sense to make some of those parameters (like this percentage threshold) tweakable in the future. Today you can only tweak the frequency of maintenance.

qwuy1290 · April 29, 2021, 2:09pm

Ok now it makes sense!
Thanks a lot for all the info

bbccdd · April 30, 2021, 9:39am

Indeed, thanks for such fast and complete responses!

My stats are not that impressive (a dissapointing 350 GB thanks to the deduplication). Deduplicaton works great, e.g. in our company we have all have the same files in MS OneDrive which only get stored one time.

qwuy1290 · April 30, 2021, 4:23pm

Just in case someone finds this comparison useful useful…

The following describes the main differences in chunks (or blobs in Kopia terminology) management between Duplicacy and Kopia

Duplicacy:

chunks have a default size of 4mb. though, this size is highly configurable and there is still the
option of configuring chunks with variable or fixed sizes, the latter are very efficient for
backup of large files, such as VMs, Truecrypt volumes, etc.
stores, for every snapshot, the list of chunks used by it
if a file is changed on disk, only the chunks related to the new version of the file will be
uploaded, the remaining chunks related to the other files - not modified - will be used in storage
and referenced in the new revision (the term for snapshot in Duplicacy).
when pruning, if a chunk is no longer required by any snapshots, the chunk is deleted
duplicacy does not need a central index, since each chunks tracks which files (or part of files) it stores
due to the simple create/delete approach of chunks, no maintenance is needed

Kopia:

chunks have a default size of 22mb
since chunks are larger here, it is more probable that more files are inside a chunk rather than a
file is split in more chunks (as in duplicacy)
kopia have an index which tracks in which chunk (here called blob) each file is stored
if a file is changed on disk, a new snapshot will create a new chunk and include just that file
(and new possible files obviously), and the original chunk is kept
kopia need maintenance both of metadata (index) and data, thus it has to be run periodically
maintenance of metadata reorganize the index to ensure high performance when parsing it
maintenance of data consists in recreating chunks which are “less than 80% full”, meaning less
than 80% of the content are files actually referenced by some snapshot

Hope this is accurate enough.

This came out of a PM with @TowerBR … Thanks for the corrections!

After all this, I think I much prefer the Kopia approach, seems more resilient!

jkowalski · April 30, 2021, 7:37pm

I think it’s important to be precise here. What you call chunks are two different things, actually.

We have are objects, contents and blobs.

Objects are basically source files and directories of unlimited length

contents are parts of files after splitting, their size is generally 4-16MB that are individually compressed and encrypted.

blobs are the files you see stored in the storage, usually around 20-30MB each. Multiple contents are packed into a single blobs to avoid millions of tiny files in the repository and make it easier to manage.

TowerBR · May 1, 2021, 11:54am

Exactly. See if this comparison is correct:

	Kopia	Duplicacy
files	objects	----
splitted files	contents	----
splitted files stored	blobs	chunks

jkowalski · May 1, 2021, 3:31pm

I can’t comment since I never used duplicacy.

jkowalski · May 1, 2021, 3:49pm

I guess from the description you can say that the primary difference is that Kopia does packing of contents into blobs.

BTW. I haven’t read the Duplicacy paper yet, but I’m really curious how does it deal with the inherent race condition in systems like this, where somebody does a “purge” to get rid of a dead chunk (that is no longer referenced by any snapshot), while another snapshot is being created that makes the exact same chunk alive? Kopia does it through some very non-trivial protocol which relies on passage of time.

qwuy1290 · May 1, 2021, 3:55pm

Yeah that paper is quite illuminating, you will for sure find it interesting!
Maybe, someday, the same principles could be used in Kopia itself

jkowalski · May 1, 2021, 4:25pm

Have you done any comparison in terms of repository size, upload times, purge/maintenance time between Kopia and Duplicacy? I’m curious how they scale with large repository sizes and with lots of snapshots.

From the cursory glance at the paper I see huge difference in terms of how manifests are stored - in Kopia, manifests are trivial and are just roots of snapshots so it’s easy to have even millions of snapshots and they easily fit in RAM, looks like in Duplicacy they can be quite large and thus could be difficult to manipulate at scale.

I did a quick google search for “duplicacy vs kopia” trying to see if somebody already did the comparison and found this post:

Which says something scary:

Just don’t let prunes run while you’re backing up (and vice-versa??).

This is supported in Kopia, but this comment seems to be indicating it’s not supported in Duplicacy. I’d be curious to learn more.

qwuy1290 · May 1, 2021, 5:26pm

I still haven’t tried Kopia, so can’t comment for now.
Regarding Duplicacy, pruning while snapshotting is surely supported. In fact, what they advertise for lock-free is concurrent snapshot and pruning.
Apparently the performance isn’t that great with a large number of snapshots.
Though, this is understandable given how it works: chunks are marked as fossils or deleted (two-step fossil collection) instantly when pruning.
(If I got it correctly) here on Kopia, instead, pruning a snapshot consists just in marking it as such, while a subsequent “maintenance” will effectively delete the unused blobs from that snapshot

TowerBR · May 1, 2021, 5:49pm

Let me give you a real example: I have a 134 GB folder.

Backups are performed to a bucket in B2, which today is 198 GB in size.

This bucket has exactly 432 snapshots (which in Duplicacy nomenclature are “revisions”), referring to backups made since 2019.

The last backup, which sent 17 MB of new files, took just over 1 minute (check the new files → split these files into chunks → check if these chunks are already in B2 → send the chunks):

INFO BACKUP_STATS 17,158K bytes uploaded
INFO BACKUP_STATS Total running time: 00:01:04

xxxliqu1dxxx · May 3, 2021, 4:53pm

Thank you for this great information. I am also trying out Duplicacy and wanted to try Kopia and interested in knowing more about this product. I like what I read about it so far!

Alessandro_Zarrilli · May 3, 2021, 10:16pm

As a former Duplicacy user, I’d like to add my tiny bit to the discussion. The reason why I began looking around and finally found Kopia had nothing to do with the algorithm efficiency, but with a much more basic usability issue: you can’t trust Duplicacy 0 exit status. See here:

github.com/gilbertchen/duplicacy

The 0 exit code is a little misleading

opened 09:43AM - 13 Feb 19 UTC

drsound

As it is implemented right now, the 0 exit code is a little misleading: as it ha…ppens for every other program I know, a 0 exit code should be returned if absolutely no problem occurred, not even a warning, so that, for example, you can use it in a script to send an email saying "backup successful". I just executed a backup, a file was not backed up, but I still got a 0 exit code! Some questions I asked myself: 1. What if it was an important file? 2. What if 99% of the files for some reasons were not backed up? Would I still get a 0 exit code? 3. Should I always "grep" the backup output looking for signs of trouble? 4. What if I grep for some specific text but in some future release the warning message changes? I will think my backups are fine but I could miss some files! I think exit codes were invented to avoid this kind of problems. My proposal: why not to add another exit code in case of any warning or problem during the backup? Here is my backup output that gave me 0 exit code, look at the last line: ``` Backup for / at revision 3 completed Files: 1278499 total, 573,580M bytes; 183 new, 70,170K bytes File chunks: 116957 total, 573,807M bytes; 11 new, 62,841K bytes, 19,645K bytes uploaded Metadata chunks: 92 total, 482,955K bytes; 16 new, 99,754K bytes, 28,975K bytes uploaded All chunks: 117049 total, 574,278M bytes; 27 new, 162,596K bytes, 48,620K bytes uploaded Total running time: 00:02:30 1 file was not included due to access errors ```

Basically you can’t trust Duplicacy when it says “backup was fine: relax, your data is safe!”

Topic		Replies	Views
Maximum usable size of the repository? Petabyte scale possible? General Topics	16	2522	September 26, 2023
PLEASE READ: Don't use --safety=none for routine maintenance General	34	3051	March 25, 2025
(very) Newbie questions General	17	4778	March 15, 2021
Deduplication efficiency General	11	2907	July 25, 2022
Low read performance per distinct hashing process Support	42	2774	September 18, 2020

Questions i cannot find an answer to

Related topics