Question on incremental snapshot

Hi,
Have been using kopis for a week for one of the projects, going through the docs it says all snapshots are incremental snapshot which means if we take backup of same path again, subsequent snapshots are only the diff from the previous snapshots.
Along these lines, when I delete parent snapshots (this was a full snapshot), I see that kopis can still restore from the incremental snapshot. How does kopia get data of the parent snapshot which is no more there? Some insights on how this is handled would be helpful.

-Prashanth

All snapshots are full backups, because kopia uses deduplication, it does not backup unchanged files.
So when restoring, you don’t have to restore the “full” backup and after that all incrementals.
That is a great feature in my opinion and saves backup storage space.

1 Like

Thanks for the reply. Just curious below statement from kopia doc, what does “Incremental snapshot” mean here?

“All snapshots in Kopia are always incremental - they will only upload files that are not in the repository yet, which saves storage space and upload time. This even applies to files that were moved or renamed. In fact if two computers have exactly the same file, it will still be stored only once.”

If snapshots are full and it doesn’t backup unchanged files if I delete the very first snapshot would restore be still possible? The very first backup had all files backed up and the second one just took files that changed as I understand there should be some reference to the old snapshot.

“incremental” means, that Kopia will only consider changed files for the next snapshot run. Since all the unchanged files are already in the repository, there’s no need to re-scan those, if they haven’t changed.

Nonetheless, all snapshots are always complete, so you can restore your source always in one pass. If you delete the first snapshot, you will loose those files, which have been present then, but have been deleted later on. Kopia’s pruning process takes care of that. See, a snapshot is more of a representation of the state of the source at that specific time. Blobs only get deleted when they are not anylonger referenced by any snapshot. So you won’t notice a huge drop of your repo size, when you delete the first “full” snapshot. It was only “full” in regards, that that time every file had to be scanned. So rather call it “full-work”…

That’s exactly right.

Let me provide more details here, because what makes this particularly interesting is packing - to keep repository structure manageable and not have millions of files in the repository, kopia will concatenate chunks of original files (“contents”) into larger files stored in repository (“blobs”). Only fully unreferenced blobs can be deleted, obviously.

As part of maintenance kopia will detect blobs that are partially full (they contain a mix live and non-live contents) and will periodically rewrite live contents to separate blobs making the partial blobs subject to garbage collection.

For example,

Imagine first snapshot had files/contents 1,2,3,4,5,6,7,8,9,10,11 and second snapshot had 1,3,5,7,9,… They would typically be packaged into pack blobs like so:

blob1: [1,2,3,4]
blob2: [5,6,7,8]
blob3: [9,10,11]

(note: second snapshot did not write any blobs due to deduplication)

When first snapshot gets deleted, contents 2,4,6,8 & 10 would become unreferenced but none of blob1,blob2,blob3 would be able to deleted.

As part of next maintenance Kopia will perform compaction of alive contents into new blobs and write:

blob4: [1,3,5,7]
blob5: [9,11]

Now some contents have 2 copies in multiple blobs and blob1, blob2, blob3 can now be deleted.

(Note: this compaction only happens when blobs have enough “holes” in them so deleting one or two contents may not always rewrite the remaining ones)

This deletion of unreferenced blobs does not happen immediately, because Kopia also supports concurrent operation, where another kopia instance can be running at the other side of the world at exactly the same time and have they may need blob1,blob2,blob3 and hold on to them for a while due to caching.

Each client will invalidate their cache after 15 minutes or so, thus deletion of blob1,2,3 will wait around 1h to be safe and ensure all other clients will switch to blob4, blob5 as the authoritative source.

1 Like

These details are quite good. One question so when we even do --delete option, are we saying actual blobs are not deleted? Does this mean if my repository is s3, those files are still there (still consuming space), and only during maintenance, these are deleted. Am I wrong here?

Thats correct. It’s recommended to run maintenance frequently and it will keep repository nice and compact as it changes.

Another question, I assume that when maintenance is done, kopia has the intelligence to go and delete blobs from s3?

That’s correct. It will do it automatically as part of each maintenance.

Hi, Was thinking through this with my use case in mind where I am writing a wrapper over CLI for taking a snapshot, restore and delete and have few questions.

  1. When I take a backup of a multiple repos to s3 (each having their own credentials), how does kopia maintenance know the required creds for each repo?
    [As i know when we create a repo it creates config files under /root/.config/kopia but what i see is this config files are overwritten every time a new repo is created.]

  2. I take a backup of a repo, after sometime let’s says maintenance is triggered and at the same time user triggers another backup for the repo. So in this scenario, i assume kopia will skip deleting those blobs from the repo right as another backup is in progress?

Hi,
Can anyone throw some light on this?

There’s currently no way to get dynamic credentials to s3 provider or any other provider for that matter. You can probably extend s3 provider for your use case but it i think you’re dangerously close to a solution that will be diverging from where Kopia is heading and thus difficult to maintain.

Regarding your other question - yes it is safe to do snapshots while maintenance is running.

When I am running such things, I always create several Kopia server instances, which are running simultaneously, such that there is one Kopia process per S3 bucket - each run either from a script which either contains the credentials or a config which points to a specific password file for the resp. S3 bucket.