Repo folder structure suggestion | default .shards option suggestion

The default sharding option of kopia and kopiaUI are different: kopia use [1, 3] while kopiaUI use [3, 3].

The problem is that it will create lots of folders for s-files and x-files (session files and index files), and nearly every actions will list these folders, while listing many folder on a HDD or network drive is very slow. (using --flat option will put many files in one folder, which is slow too)

I think the default .shards can be:

{
    "default": [1, 2],
    "maxNonShardedLength": 20,
    "overrides": [
        { "prefix": "q", "shards": [1, 1] },
        { "prefix": "s", "shards": [1] },
        { "prefix": "x", "shards": [1] }
    ]
}

The folder structure will be:

p\        # data blocks
    04\
    25\
    a6\
    ...\
q\        # metadata blocks
    1\
    3\
    a\
    ...\
s\        # session files
    ****.f
x\        # index blocks
    n23_****.f
    n24_****.f
    r0_7_****.f
    r8_15_****.f
    s0_****.f
    s1_****.f
    ...
    s22_****.f
kopia_****.f
xw****.f
xe23.f
xe24.f

The sharding strategy is based on some assumptions:

  1. More files and folders in a folder, the folder is slower to read file and to create new files.
  2. Files are faster to be listed in a single folder then in many different folders.
  3. P-files and q-files are only listed when running maintenance, otherwise they are directly created or read (with the help of indexes), so these files can be grouped into folders to boost speed.
  4. X-files are listed on every operation so they are in a single folder.
  5. While s-files are only listed when running maintenance, they are created and deleted on every operation and the number of them is too small. So it is better to place these files in a single folder instead of creating many folder then empty them, so these folders will not be created and listed.
  6. Maintenance is running regularly, so the numbers of x-files and q-files are small.
  7. a ~400GB repository contains about 40000 p-files and 200 q-files, so grouping these into 256(folders)*150(files) and 16(folders)*13(files) is better then using 4096 folders.

Here is the same issue I have created on github:


After explored the forum, I found a hidden command to modify a local repository to the shards format I have mentioned:

kopia blob shards modify --path=/path/to/repo --i-am-sure-kopia-is-not-running --default-shards=1,2 --override=q=1,1 --override=x=1 --override=s=1 --unsharded-length 20

It will move these files to their new places. It may take a long time, while you cannot use the repo before it done.

2 Likes

This .shards file group these _log_ files in one single folder (_log_\***.f insteads of _\lo\g_***.f),:

{
    "default": [1, 2],
    "maxNonShardedLength": 20,
    "overrides": [
        { "prefix": "_log_", "shards": [5] },
        { "prefix": "q", "shards": [1, 1] },
        { "prefix": "s", "shards": [1] },
        { "prefix": "x", "shards": [1] }
    ]
}

Modify command:

kopia blob shards modify --path=/path/to/repo --i-am-sure-kopia-is-not-running --default-shards=1,2 --override=_log_=5 --override=q=1,1 --override=x=1 --override=s=1 --unsharded-length 20
Prefix Description Listing frequently? Large numbers of file? Strategy
P Data pack No Yes Group by first two chars
Q Metadata pack No No Group by first char
S Session file (see design document) No No All in one folder
X Index file Yes No All in one folder
_log_ log file No No All in one folder
1 Like