I will likely update the report a couple times as the tools improve over time.
For bupstash, I’ve implemented (but not benchmarked yet) a WIP for per-directory parallel stat()ing, which is its current bottleneck for my use case.
For kopia, my main request is to allow to runtime-configure the number of threads it uses (instead of having it hardcoded to 16), as my networked file system would benefit a lot from that. There are also a couple issues I found (and linked).
I would also appreciate that if you find some answers to the open questions in there (e.g. why my kopia run didn’t deduplcate the data within the first run on the “4 GB, small files” dataset), please answer them here or file an issue in my report’s repo.
Interesting report. Kopia will generally deduplicate across “contents” which are sections of large files between 1MB and 8MB - this is to keep the number of entries in the index low. This also makes deduplication across small files not effective, but those are typically compressible (esp. log/source files).
If you can, try with:
$ kopia policy set --global --compression=zstd-fastest
and pick compression method that is fastest on your machine.
You have some excellent points about parallelism, memory consumption, etc. All things we should improve over time - I’d be happy to review PRs for those.
Definitely please file individual GH issues for proposed improvements. That would be highly appreciated.