I have a repo with backups of my user folders from my PC. Desktop, Downloads, Documents, Videos and such.
I’ve moved some files from my Desktop folder to my Videos folder. After I’ve made a snapshot of the Videos directory. I have not run snapshots of any other folder.
After running kopia snapshot ls --storage-stats, I get new-data:490.1 MBfor the Videos snapshop.
490.1 MB is the size of the videos. There’s now duplicate data shown across two snapshots for these videos, one in Desktop and one in Videos.
That must be the right behavior, as it’s new data for the specific snapshot, but is there any way to get the actual amount of new disk storage taken up by a snapshot?
I think the closest would be to see how much data is transferred for the snapshot, but given that Kopia is deduplicating all data, the values are not very useful.
If a snapshot shares file A with yesterdays snapshot X, and shares file B with other-computer snapshot Y, it would look as if it doesn’t take up much space because snap X and Y already “paid” for the space when they uploaded them first, but when snapshot X or Y expires and the maintenance removes them, suddenly your snapshot “owns” those two files A and B if it is the last owner of them, so now this snapshot is larger because it holds the stored size of file A&B which it didn’t yesterday.
There are some even more subtle issues like if two clients start at the same millisecond to make backups, they both check their file lists, both ask the repo if file C exists, and both get the answer “no”, so they start the transfers, but one client is lots faster (or has less unique data) so it sends file C over, then the second client hashes,splits,compresses,encrypts file A and sends it over, sends file B and finally sends file C over and by that time file C already exists in the repo, so it got sent over the wire (if the repo is remote) but will be cleaned out as a duplicate at some point, either when the collision is detected or at some maintenance later. Still, the transfer in itself is not a guarantee that this data is to be pinned to this actual snapshot.
My take on this is to monitor the total space used on the repo in order to not go full, and then use the “incorrect” listing of “how much data does this snapshot protect” in order to split costs between clients (or whatever the reason for knowing each snaps true size is). A client that protects 10x data compared to other clients would then pay 10x more, even if the snaps for this big client are small on repo disks because it dedupes against data already sent by 10-20 small 1x clients.