Kopia sync-to rclone using webdav

I noticed that when I do kopia sync-to rclone, the spawned rclone command is actually acting as a webdav server, and files being copied one by one via http. As a result, it is very slow and it always hangs forever after all files are copied.

I did some search here and found Jarek’s comment here: How does kopia repo sync-to works? - #2 by jkowalski . If all Kopia does in that command is copying missing files, why go through the complicated process of creating some certificates and keys to setup webdav, while rclone sync exists to have the same effect? What can the existing method do and rclone sync can’t?

Originally I thought kopia sync-to is capable of selectively copying some blobs. For example, imagine it has a --latest switch that only copies the latest snapshot over. Also, Kopia’s switches such as --parallel, --dry-run, --delete exist in parallel in rclone.

After switching to pure rclone command, I can see clear performance difference and no issue of hang at end.

For the record,

The spawned parent rclone process from kopia sync-to:

rclone serve webdav <rclone_dest> --addr 127.0.0.1:0 --cert C:\Users\CrendKing\AppData\Local\Temp\\webdav.cert --key C:\Users\CrendKing\AppData\Local\Temp\kopia-rclone069218207\webdav.key --htpasswd C:\Users\CrendKing\AppData\Local\Temp\kopia-rclone069218207\htpasswd

Child process:

rclone serve webdav <rclone_dest> --addr 127.0.0.1:0 --cert C:\Users\CrendKing\AppData\Local\Temp\kopia-rclone069218207\webdav.cert --key C:\Users\CrendKing\AppData\Local\Temp\kopia-rclone069218207\webdav.key --htpasswd C:\Users\CrendKing\AppData\Local\Temp\kopia-rclone069218207\htpasswd

My replacement command for dry-run check:

rclone check <kopia_repo> <rclone_dest> --size-only --one-way

For actual syncing:

rclone sync <kopia_repo> <rclone_dest>

Excellent question, this needs a longer explanation.

When you use rclone backend today, Kopia does indeed launch rclone binary as a server which runs on localhost and serves files to Kopia over WebDAV. Kopia is not really using any of rclone advanced capabilities this way - it only relies simple webdav primitives (get/list/put/delete) which rclone adapts to whatever storage you’re connected to. This allows Kopia to not have to copy/link all rclone code and still get access to all its features.

So why does Kopia do this weird thing?

Kopia uses abstract backend interface called BLOB Storage which provides certain guarantees, such as atomic operations and horizontal scalability (ability to have millions of files in a single bucket/folder). There are multiple implementations of BLOB storage for cloud providers (like GCS, S3, Azure, B2) an others which have less certain behaviors like (filesystem, sftp, webdav, rclone).

Cloud storage backends are generally well-behaved and natively support atomic operations and use “flat” storage organization, because they are really good at scaling so having a bucket with tens of millions of files is not a problem.

Other backends (like filesystem, sftp, webdav) generally use POSIX-like local filesystem and Kopia must emulate atomicity and use sharding to spread the load over multiple directories, otherwise file operations can become really slow if directories were to get really large.

Obviously this emulation of cloud semantics comes with some cost, usually quite small, but it actually depends on the provider rclone is using.

If your use case requires frequent efficient sync and you’re syncing between cloud providers, definitely use rclone directly - it will be always be faster.

Having said all of that, I’m actually surprised that the performance difference is that significant - after all rclone is serving data over localhost so it should not really have much higher latency because of that and throughput should be the same, assuming rclone can do parallelization.

Can you file an issue on GitHub to track this and provide the following information:

  • the type of rclone backend?
  • rclone version
  • OS (I presume Windows)
  • the speed of your network link?
  • what kinds of speeds you’re seeing using kopia and rclone
  • how big is your repository (can be measured using kopia blob stats and kopia content stats)

Also, rclone hanging at the end of sync looks like a bug - can you please file separate issue for this on GitHub?

I think the main difference is, as you mentioned in the other post, the initial check. Direct rclone command does not do that check. Also, I specifically used --size-only to speed up the checks, assuming the filenames already captures the hash of its content.

Another reason might be that sync-to rclone has --parallel=1 as default, while rclone does not (I usually see 4 files simultaneously being processed). This difference is amplified when user’s network bandwidth is capable of carrying the full workload (I have 1Gb fiber network).

So I just tried uploading my local repo to a new remote directory on Microsoft OneDrive. Since it is new, the initial check is quickly finished. I used --parallel=4 and I see great improve on the speed. However, at the first run, the upload quickly terminated with a weird error saying "kopia.exe: error: error copying blobs: error copying n639351c72e5260f3423a20ce0085a563: error writing blob 'n639351c72e5260f3423a20ce0085a563' to destination: BLOB not found, try --help". I retried exactly same command. It went further and then stopped again with same error on a different blob.

I also remember there used to be a different but similar error. It happens randomly to one file. After retry it passes the first file then err at another random file. Rinse and repeat.

So I think my concern of using webdav is actually not performance, but rather usability. I don’t know why Kopia sync-to clone constantly gives me weird errors (there’s also the hanging at end that mentioned in the OP). rclone itself never has problem like that. So it could be some bug in Kopia?

FYI:
the type of rclone backend?
Microsoft OneDrive

rclone version
rclone v1.54.0
os/arch: windows/amd64
go version: go1.15.7

OS (I presume Windows)
Windows 10

the speed of your network link?
1Gb/s up and down

what kinds of speeds you’re seeing using kopia and rclone
With default --parallel=1, 20Mbit/s. With --parallel=4, I’m seeing 51 Mbit/s and still climbing before the error

how big is your repository (can be measured using kopia blob stats and kopia content stats)
I’m testing with a small 3GB repo

This BLOB not found behavior could happen because the underlying storage (one drive) is may not be strongly consistent.

Definitely let’s investigate this. Can you share Kopia logs from this session (scrubbing any sensitive data)?

Sorry I deleted the logs (the whole log directory could become too huge if I keep a server running for long time. Maybe worth another issue ticket?) and can’t reproduce it right now. I’ll let you know.

Dear @CrendKing you have raised a very interesting set of questions. I was a bit disapointed to see that the thread ended prematurely. Of course we all have lives and other matters to attend to. Yet, hereby a kind request to revive this topic for example with a test on your behalf. I’d be very interested to follow this discussion!

I switched to rclone sync, which does not have any problem for me. Unless kopia sync-to rclone does something better/more efficient, I don’t see why would anyone ever use it instead of rclone sync.

Maybe the approach from restic to integrate rclone would help to speed up file transfer, see restic · Using rclone as a restic Backend

I think there are two use cases for using rclone. And we are talking about the different ones. Allow me to elaborate.

Case 1 is that we use kopia/restic to backup to LOCATION_1, and use rclone to sync to LOCATION_2. In this case, kopia/restic needs to implement the backend for that LOCATION_1 (e.g. local file system, or Amazon S3). LOCATION_2 would be something kopia/restic does not support of, such as OneDrive. This is what interests me. I backup to a hard drive, then sync to OneDrive.

Case 2 is that we use kopia/restic to directly backup to LOCATION_3, where LOCATION_3 can be anything rclone supports. By integrating rclone into the system, kopia/restic does not need to implement any backend. All work are delegated to rclone. This is not what I was talking about.

So if kopia would choose to go case 2, I’m completely OK with it. It can simplify the code and alleviate maintenance burden. However, 1) kopia does not do this ATM. It does have direct support of several backends. 2) the sync-to rclone command is clearly a “Case 1” instance.

Now, if we are talking about case 1, unless I’m missing something, I really don’t see any difference between setting up rclone as a HTTP server and feeding it existing blob data, versus directly invoking rclone sync.

So in conclusion, I think 1) why kopia sync-to rclone exist at all? To scale, it’s worth to investigate the link Chriz brought up.