I am using kopia to create various backups to various Backblaze buckets. On all repositories I have enabled Object Locks and a retention period of 90 d. For the backups I am using the default retention policy (10 latest, 48 hourly, 7 daily, 4 weekly, 24 monthly, 3 annual); snapshots are made once an hour.
My problem is that some buckets are now growing very large. In one example, kopia content stats reports a content size of 380.9 GB (231.2 GB compressed), but the corresponding Backblaze bucket has grown to 903.3 GB after 3 months of backups. Notably, some other buckets are not growing much, so I am trying to investigate whether there is anything I can do about those that do.
When I run rclone size on the bucket, the result is 244.3 GB. My guess would be that that means that most of the bucket size comes from either old file revisions or from files that have been deleted/hidden but are still retained by the object lock. After superficially browsing through the bucket, it seems like there are no old file revisions there (I still had it set up to keep them), but many files are marked for deletion.
After thinking about this, my guess is that the large bucket size is the result of a combination of two factors:
Having a retention period of 90 d basically means that all snapshots from the last 90 days are stored in my bucket and take up space there, in my case meaning one snapshot per hour, so at least 2160 snapshots.
The reason why some buckets grow so large while others don’t must be that in some backups there are large amounts of data that frequently changes.
As potential solutions, I am wondering:
Is it possible to enable object lock only for some snapshots? If I could enable it for only one snapshot per day or maybe two per week, the bucket size should be greatly reduced. If this is not possible yet, is this something that would technically even be imaginable with the way that kopia creates snapshots, so would it be worth opening a feature request?
Is there a way to investigate which files cause how much disk usage in a snapshot? I’m thinking about a command that would list the files that have changed in a snapshot along with their file sizes. When a file changes, does kopia store its entire new version, or does it only store a diff? If the latter is the case, is there a way to view the sizes of all diffs of a specific snapshot?
With object lock enabled your retention policy is irrelevant up to retention lock duration when it comes to occupied space. In your case any data you save to your repository will be stored for 90 days.
Nope. It is all or nothing.
But what you could is to have two buckets and two repositories. With and without object lock. Frequent one without lock and the other where you backup (or copy snapshot from the first one) less frequently.
On the other hand if you do not care about all these snapshots then why you take them at the first place? Maybe it is worth to rethink all backup strategy e.g. use some local medium for very frequent snapshots and cloud only for longer term protection ones.
Thank you for the clarification about object locks. I have created a feature request to allow for dynamic object locks based on the configured snapshot retention.
I have made a little bit of progress trying to find out which files are causing my backups to grow so much on some instances. In the folder /app/logs/cli-logs, there are some files of the name kopia-*-server-start.*.log. These contain logs of the automatic snapshots that were created. Some of these log entries contain info about the directories and their sizes that were saved as part of a snapshot. With this little bash command, I am extracting all directories > 1 MB and show them sorted by size:
cat kopia-20251021-225713-1-server-start.52.log | while read line; do if [[ "$line" =~ \"size\":([0-9]+) ]] && [ "${BASH_REMATCH[1]}" -gt 1000000 ]; then echo "$(numfmt --to=iec "${BASH_REMATCH[1]}") $line"; fi; done | sort -h
Now, I believe this does not show the size change, but the final size of the whole directory. But still, I believe that the directories that appear in the output at all had some file changes inside. This means that the output will only be useful in some situations. In my case, the output contains a mysql data directory, which I know only contains some large files that frequently change. In this case, this is a strong indicator that those large files are the culprit. In other cases, where a changed directory contains many files of various sizes, it would be impossible to tell which of those files changed and whether that incurred a significant size on the snapshot.
I have created a feature request to add an option that shows the actual file size changes.