Low read performance per distinct hashing process

I wondered, why my test backup of a 9TB KVM disk storage, which hosts approx. 250 virtual disks, rangeing from 12G up to 285GB wouldn’t complete and always be somewhat restarted by the 45 min checkpoints. I looked at the issues on GitHub and this one seems to reference this symptom:

May never complete snapshot of a large file #463

I was actually reading the vdisks from a CephFS and the throughput was approx. 380 MB/s, but now, as most of the hashing processes have finished, I found out, that a sinble hashing process only seems to be able to read as fast as 35 Mb/s from the storage device, which is way less, than what I achieve with dd and a blocksize of 4k, which amount to 150 MB/s and which should be enough to fit the largest 285GB file into the window.

However reading only at 35 MB/s this file would take 138 mins. and thus will never be successfully read whithin the 45 min. checkpoint timeframe.

So my question is really straightforward: why does a single hashing process only achieve 35 MB/s and can anything be done about this?

“Kopia has an upper limit for the snapshot time and when exceeding that limit, it “checkpoints” its current progress and “restarts” the snapshot… On restart, (whole) files that have already been backed up, are skipped.”

Try to use: kopia snapshot create --checkpoint-interval=180m …

Yes, currently kopia will checkpoint and restart the upload after ~45 minutes. This is needed to ensure that no repository client ever goes for more than 1 hour without flushing all written files (so that parallel garbage collection can rely that all files older than 1h have been properly anchored in one of the snapshot manifests).

After restarting the upload most files will be very quickly skipped because they have already been uploaded and their metadata matches what was uploaded. The problem is for very large files which may take a long time to re-read and re-hash. Upload is not needed because of deduplication, but re-hashing does take a long time.

Read/hashing parallelization is done on a file-by-file basis and very large files are necessarily single-threaded because of how splitting works (file must be read sequentially).

I have 2 ideas for improving this:

  1. when checkpointing after 45 minutes, do not restart the entire upload, instead checkpoint partial file as-is and keep going. This is unfortunately easier said than done because of code structure, that’s why we have the current solution until we do bigger refactoring.

  2. When snapshotting very large file (say over 1GB) it may make sense to do striped reads and parallel hashing/uploads:

Basically say we have a 20 GB file, we can read it as if it was 20 x 1 GB files (read and upload can go in parallel as we read sections of the file). After upload we can write an intermediate object to concatenate them, which will be super quick. This has the non-perfect split points at multiples of 1 GB, but given that split points are O(2-4MB) that should not matter much in terms of deduplication efficiency.

If anybody is interested in working on either of these features, please let me know, I’ll be happy to guide you through the codebase - those could be some exciting improvements.

Hmm… maybe this is just a use case, which is not suitable for Kopia and it’s checkpointing. If a checkpoint occurs, all the files in progress will be re-hashed, right? So in my case there are 391 files to be backed up and 25 of them are above the 45 min. checkpointing timespan.

Now, I can enlarge the timespan for the checkpoint, but to what extend? There’s no way of telling what timespan would suffice other than setting it to the maximum amount of time the whole backup could possible need - or rather disable it completely in such cases.

However, this would then maybe clash with the parallel garbage collection, which may have to tweaked as well in such a case.

And you still haven’t answered the question about the ingress performance? Is it the hashing which limits each process/thread to 35 MB/s regardless of the throughput available form the underlying storage? Would choosing another hash algorithm be an option?

35 MB/s is not expected. If you run kopia benchmark crypto measures the throughput of hashing and encryption (which should be upwards of 200-300 MB/s single-threaded on modern CPUs).

On top of that there’s also compression (if you use it) and splitting (kopia benchmark splitter) which on my machine is about 150MB/s single threaded.

Kopia can achieve best throughput when the upload is parallelized, which is not implemented yet for individual large files but can and will be in the future.

Well… kopia benchmark crypto actually shows reasonable results:

     Hash                 Encryption           Throughput

  1. BLAKE2B-256-128 AES256-GCM-HMAC-SHA256 413.8 MiB / second
  2. BLAKE2B-256 AES256-GCM-HMAC-SHA256 399.8 MiB / second
  3. BLAKE2B-256-128 CHACHA20-POLY1305-HMAC-SHA256 392.5 MiB / second
  4. BLAKE2B-256 CHACHA20-POLY1305-HMAC-SHA256 386.8 MiB / second

However, I don’t know, what to make of the kopia benchmark splitter test:

  0. FIXED-1M                    2527 ms count:512 min:1048576 10th:1048576 25th:1048576 50th:1048576 75th:1048576 90th:1048576 max:1048576
  1. FIXED-4M 2534 ms count:128 min:4194304 10th:4194304 25th:4194304 50th:4194304 75th:4194304 90th:4194304 max:4194304
  2. FIXED-2M 2537 ms count:256 min:2097152 10th:2097152 25th:2097152 50th:2097152 75th:2097152 90th:2097152 max:2097152
  3. FIXED-8M 2537 ms count:64 min:8388608 10th:8388608 25th:8388608 50th:8388608 75th:8388608 90th:8388608 max:8388608
  4. FIXED 2544 ms count:128 min:4194304 10th:4194304 25th:4194304 50th:4194304 75th:4194304 90th:4194304 max:4194304

What are these values supposed to show?

In my opinion the current scenario limits use cases for Kopia. I had a 20GB+ file which was never completing. Gzipped it and it is now 13GB. It still never finishes.

 ! Ignored error when processing "cz_old_servers/plato/pgdata.tar.gz": canceled
 - 0 hashing, 293891 hashed (54.9 GB), 0 cached (0 B), 2897 uploaded (60.2 GB), 2 errors 26.0%
 ! Saving a checkpoint...

13GB sounds like a file that is small enough that one would not expect the process to fail.

Both of the suggested approaches sound like an improvement, but also like it probably would be complex work and take some time to implement. Wonder if there are any short term measures that could be done such as

  • Doing a snapshot before doing any file over 5GB. This will improve the changes of files that are say 20 to 40 minutes to backup to get done since the file will be it’s own process.
  • If a file doesn’t finish in 45 minutes then fail it. In my opinion it is better to have some files fail than to have a backup that never finishes and have not backup at all.
  • For now I would highly suggest to advice users of large files to not use kopia. If the process can’t handle a single 13GB file then a backup with many such files has little chance of success

Well, it seems to depend on the ability to process ingress data fast enough. As I stated ealier, on my system each single job seems to be limited to 35 MB/s, but having multiple files being hashed, actually drives the source storage towards is performance limit.

Can you determine, what your ingress speed is for a single file?

I have set the checkpoint interval to 360m, which runs just fine with the use case, where 10% of all files are larger than 92GB and thus don’t fit into the 45m default timespan for a checkpoint.

From previous post on this thread I thought that there was some inherent need to have snapshots happen within an hour. Did you check the maintenance and maintenance full windows to be over the 360 minute checkpoint interval?

Yes, I know. However, I haven’t yet figured out, what @jkowalski meant, when he talks about anchoring files in a snapshot manifest. I still assume, that this would also work larger intervals. If not, please tell me… :wink:

@francisco As @jkowalski suggested… have you checked the two kopia benchmarks? I was quite astonished that the high powered Xeons in my VM nodes are actually performing way worse then the i7 in my current MBP - at least when it comes to hashing/splitting.

Maybe I will recreate the repo and use another spliiter algorithm, which suits my CPU more.

I’ll prioritize the fix for checkpointing into 0.7 release. In the hindsight restarting the entire upload process and leveraging caching to make it fast(er), while simple in code, was not the right thing to do.

Just ran splitter benchmark. What is the default?

>     -----------------------------------------------------------------
>       0. FIXED-8M                    2727 ms count:64 min:8388608 10th:8388608 25th:8388608 50th:8388608 75th:8388608 90th:8388608 max:8388608
>       1. FIXED-1M                    2740 ms count:512 min:1048576 10th:1048576 25th:1048576 50th:1048576 75th:1048576 90th:1048576 max:1048576
>       2. FIXED-2M                    2756 ms count:256 min:2097152 10th:2097152 25th:2097152 50th:2097152 75th:2097152 90th:2097152 max:2097152
>       3. FIXED-4M                    2854 ms count:128 min:4194304 10th:4194304 25th:4194304 50th:4194304 75th:4194304 90th:4194304 max:4194304
>       4. FIXED                       2865 ms count:128 min:4194304 10th:4194304 25th:4194304 50th:4194304 75th:4194304 90th:4194304 max:4194304
>       5. DYNAMIC-8M-BUZHASH          6719 ms count:49 min:677680 10th:5260579 25th:6528562 50th:11102775 75th:16777216 90th:16777216 max:16777216
>       6. DYNAMIC-2M-BUZHASH          6786 ms count:204 min:64697 10th:1210184 25th:1638276 50th:2585985 75th:3944217 90th:4194304 max:4194304
>       7. DYNAMIC-4M-BUZHASH          6833 ms count:107 min:9467 10th:2277562 25th:2971794 50th:4747177 75th:7603998 90th:8388608 max:8388608
>       8. DYNAMIC-1M-BUZHASH          6967 ms count:428 min:9467 10th:612999 25th:766808 50th:1158068 75th:1744194 90th:2097152 max:2097152
>       9. DYNAMIC                     7234 ms count:107 min:9467 10th:2277562 25th:2971794 50th:4747177 75th:7603998 90th:8388608 max:8388608
>      10. DYNAMIC-8M-RABINKARP       13523 ms count:60 min:1446246 10th:4337385 25th:5293196 50th:8419217 75th:12334953 90th:16777216 max:16777216
>      11. DYNAMIC-4M-RABINKARP       13543 ms count:110 min:535925 10th:2242307 25th:2767610 50th:4400962 75th:6813401 90th:8388608 max:8388608
>      12. DYNAMIC-2M-RABINKARP       13765 ms count:204 min:535925 10th:1235674 25th:1675441 50th:2525341 75th:3658905 90th:4194304 max:4194304
>      13. DYNAMIC-1M-RABINKARP       14074 ms count:391 min:213638 10th:623062 25th:813953 50th:1328673 75th:2097152 90th:2097152 max:2097152

[edit] Looking at that output I can’t tell which one is best. Do we want lowest time in ms or highest count? is time = time to split? Is count = number of blocks split per unit time?

The best seems to be the one with the lowest timespan. Default seems to be DYNAMIC-4M-BUZHASH, but on my node I was actually able to double the throughput for one process from 35 MB/s to 70 MB/s by selecting FIXED-2M.

I do think though, that this will double the amount of contents fragments, so there’s a tradeoff here as well, since maintenance will now have to handle double the fragments - at least this is what I believe. However, you should also get a better dedupe rate, but to some expense.

@budy what command did you use to specify different hash?
kopia snapshot create --help
only shows:
--force-hash=0 Force hashing of source files for a given percentage of files [0..100]

Don’t see anything else related to splitter.

Use…

–block-hash=BLAKE2B-256-128 --object-splitter=FIXED-2M

when creating the repo and make sure to run the benchmark tests on the node, which runs the backup. In my case, the repo is located on a newer storage system, who’s CPU is as fast as the i7 on my MBP, but since I had been using this repo via kopia server, the actual client doing the work, had an older CPU, which is far from that performane. You also can’t change this afterwards - or at least, I haven’t found a way to do it, so I had to re-create my repo from scratch.

So, not possible to change after creation?

[edit]. Just saw

Can’t find a way either. I guess will have to create / test / migrate repo

FWIW fixed splitters are not ideal if you have files that are partially modified as they don’t perform content-based splitting.

So if you have a large video file (say 10 GB) and add one byte somewhere at the beginning (for example by changing embedded text metadata), fixed splitter will need to reupload 10 GB. content-based splitters (buzhash and rabin karp) will detect the change and typically only need to upload one or two chunks (<10MB total).

Fixed splitter will be much faster, though.

BTW. I finally have a fix for the painfully slow checkpointing. The fix avoids having to do restarts completely.

Quick demo:
https://asciinema.org/a/eSHQUHBduFAZBmTCAyknx2Nkr

Pull Request:

Yeah, but this really will depend on the kind of data, won’t it? Now, I am just starting to think about setting up a dedicated kopia host for use case: a CephFS client which will excel at splitting performance… :wink: