File Index

The *file-index* is a part of the Mercurial store that tracks all file-paths used within a repository store.

WARNING: The "file-index" is an experimental feature still in development. While the file format ('fileindex-v1') is frozen and you can thus expect it to be supported, the feature itself may still have bugs or need API/config changes.

The file index supersedes the fncache (see the "Difference with the Fncache" section) and is independent of the working copy, only tracking file-paths relevant for the history.

The file index is used by operations that needs to iterate over all the information in the store, like local and stream cloning, repository upgrades or various debug commands computing statistics.

The file index also provides a bidirectional mapping between "file-path" and "file-token". The "file-token" are an internal opaque fixed size identifiers that can be used by algorithms for efficiency.

The file index is designed for efficient querying and updating.

The file index supersede the "fncache" feature, see the 'Difference with FnCache' section for details. It is also currently incompatible with the experimental 'tree manifest' feature.

This feature is guarded by the 'fileindex-v1' requirement. See 'hg help internals.requirements' for details.

High level design

Functionality

The file-index features revolve around two main concepts:

The feature envelop of the file-index is:

As the feature set and format is not frozen yet, this feature set is subject to change.

Implementation summary

There are four main component to the 'file-index'.

Three are related to the data themselves:

The last one is the "docket". It stores various metadata and provides transactionality, mmap safety, and other lower level details. See the "Internal filesystem representation" section for details on the docket.

The "file-path-data" contains all the "file-path" stored in the repository. They are each stored in full, to be able access them directly without processing or memory allocation. This content is "append only" and it's size complexity is 'O(N)'.

The "token-index" is a linear index that can resolve each "file-token" to a "file-path" in 'O(1)' time. It also contains extra metadata to efficiently split the "basename" of "file-path" from the directories containing it. This content is append only and it's size complexity is 'O(N)'.

The "path-index" is a prefix tree that allow to map a "file-path" into a "file-token" (if the "file-path" is known to the 'file-index'). The prefix tree rely on the content of the "file-path-data" to encode its prefix. This allow it to use fixed size node. The complexity of search or adding a "file-path" is 'O(log(N))' time. To preserve the append-only property on disk, this prefix tree is stored on disk as "persistent" tree, only adding new nodes pointing to existing one when inserting data, never overwriting existing nodes. As we will eventually vacuum the "dead" nodes, its size complexity remains 'O(N)'.

Difference with FnCache

The 'fncache' and the 'file-index' store similar but different information. While the 'file-index' store plain 'file-path', the 'fncache' store the path of filelog related files (stored on disk in '.hg/store/data/').

So when committing a "Foo/Bar/luz.txt" file:

Note: that the path stored in the 'fncache' are "unencoded", so they are the path before any "path encoding happens", so in practice, the actual file system file will likely be "data/_foo/_bar/luz.txt.i", and some of the filelogs may live in '.hg/store/dh/'.

See 'hg help internals.revlogs' for details on the revlog related files and path encoding.

By leaking revlog's implementation details, the 'fncache' offer less flexibility to the revlog, and the storage layer in general. A repository using 'fncache' must use one filelog per "file-path" and its revlogs cannot use flexible filenames. On the other hand, a 'file-index' repository store an higher level information from which the filelog on can be computed. So it offers the same features while offering more flexibility.

The only "drawback" is that the 'fncache' provides a way to know that a given filelog is not inlined (even after lock release), while with the 'file-index' we need to open the filelog to get that information. However, the 'fncache' doesn't provide a way to know that a given filelog is inlined, once the lock is released. In addition, revlog's inlining is overall a quite flawed feature that we are slowly moving away from, so this is not expected to be a major problem.

Another significant difference is that the 'fncache' doesn't provides 'file-token'.

Unlike the 'file-index', the 'fncache' is currently not visible during "pending" operations.

Finally, but not least, the 'fncache' file format is not tailored for performance. Stored as flat, line based listing of path, the 'fncache' needs to load all the paths from disk to be able to find if a path exist within itself, slowing down searches and updates. This affects the performance of higher level operation like 'commit' or 'pull'.

Internal filesystem representation

File organization

The file index storage consists of four files:

The docket is a small file that functions similarly to the 'dirstate-v2' docket. It stores metadata about the other files, including their active sizes and IDs. The files with ID suffixes are append-only and never truncated. This provide use with "transactionality" and "mmap safety". When we ever need to rewrite their content, we write to a file with a new ID and update the docket to point to it.

The docket file format

The purpose of the docket file is to provide a consistent view of the file index. Changes to the other files are only visible to other process when the docket is updated on disk, hence only the docket needs to be rolled back when a transaction is aborted.

The docket file contains the following fields at fixed byte offsets counting from the start of the file:

The following five "used size" fields are stored as 32-bit big-endian integers. The actual size of the respective files may be larger (if another Mercurial process is appending but has not updated the docket yet). That extra data at the end of the files must be ignored.

The following four "ID" fields are stored as 8-byte strings. They indicate where the corresponding data file is stored. For example, if the list file ID is "ab19b7c0", then the list file is stored at '.hg/store/fileindex-list.ab19b7c0'.

The list file format ---------------------

The purpose of the list file is to allow the meta file and tree file to reference paths from fixed-length structures. It also enables code to construct file paths without allocation, assuming the list file is mapped in memory.

Because the list file is append-only, to remove paths from it, we must write a new list file from scratch. This is necessary in rare cases such as when narrowing a repository or stripping changesets.

The list file contains all the file paths terminated by null bytes, one after the other. The order of the paths is arbitrary and should not be relied on. The presence of the null bytes is optional and should not be relied on.

The meta file format

The purpose of the meta file is to store information for each token (in particular, its file path) with constant time lookup.

Because the meta file is append-only, to remove elements from it, we must write a new meta file from scratch. This is necessary in rare cases such as when narrowing a repository or stripping changesets.

The meta file contains an array of 8-byte elements. The element for token T is found at byte offset T*8. Each element has the following layout:

The tree file format

The purpose of the tree file is to store information for each file path (in particular, its token) and support fast lookup.

The tree file contains nodes that form a prefix tree. The docket file indicates which node is the root node. Because the tree file is append-only, every time we insert a path into the tree, we must create new internal nodes all the way up to the root. This results in a growing number of unreachable nodes over time. When this unused space exceeds a threshold, we rebuild the tree file from scratch. Note that we don't do this process for each individual insertion, but only when we flush the current inserted batch to disk.

Each node has a string *label*, and a *prefix* obtained by concatenating labels from the root to the node. The node does not store its label directly, only its first character, its length, and a file token. The label is a substring of the file path associated with the token, where the starting index is obtained by adding label lengths from the root to the node.

There must not be a common prefix between any pair of sibling node labels. This limits the number of children to 256 based on initial bytes 0x00 to 0xff, and further to 253 since "\n", "\r", and "\0" are forbidden in file paths. This limit allows us to store the number of children in an 8-bit integer.

Each node has the following 6-byte header:

The root node has token 0 and label length 0. For non-root nodes, the token and the label length must be nonzero.

The header is followed by two arrays: