I’ve recently started looking at Kopia as a potential replacement for our restic backup/restore workload. I am using Azure, with a backup target of azure storage account/container, and very fast NFS storage (Azure NetApp Files) for the backup source, and restore target.
Backup is very fast, much, much faster than restic. Very happy with that (probably 300% faster for backup vs. restic).
Restore is a different issue. While it’s faster than restic (probably twice as fast), it’s still not anywhere near as fast as I’d like to get it to. Trying to figure out where the bottlenecks are. I’ve done measurements in the azure portal to ensure that I am not:
- CPU bound
- memory bound
- network bound
Based on my measurements, all of these have plenty of headroom. I compare with a baseline using azcopy, which can drive nearly 1GB/s of traffic through to the ANF volume, has no issue with network, etc. Understanding that this is not a completely equal comparison, since Kopia backup/restore is doing quite a bit more than just copying bits from one place to another (essentially what azcopy is doing), I’d still like to understand where the bottlenecks are with kopia restore.
I have tried varying the number of parallel requests, using monster AKS worker node sizes, putting cache on nvme drives to speed that up, etc. At the end of the day, I cannot speed up the restore at all - I essentially get roughly 150 MB/s (writes) to the ANF volume, and low Cpu utilization/memory utilization on the AKS cluster, relatively low network throughput, low transaction rate/blob gets.
Without profiling kopia to see where it’s spending it’s time, I’m at a bit of a loss on relatively weak restore performance. What else should I be looking at?