Skip to content

WIP -- New index api design #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 122 additions & 0 deletions Documentation/technical/ng-index-api-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# ng-index-api-design.txt

This document describes a series of changes to the existing index/cache APIs.
My goal here is to hide the in-memory organization of the index/cache from
most of the code base and allow it to be more easily changed to handle larger
repositories and in the future possibly different on-disk index file formats.

Much like the on-going series of patches converting from "char[20] sha" to
"struct object_id oid" which have been gradually introduced to the code base,
I envision a similar sequence of small conversion steps.


## 1. NG Index API Goals


###Index Macros

The "NO_THE_INDEX_COMPATIBILITY_MACROS" macro is used to define "casual API
macros" for use by most of the source in the tree. These macros hide some
"struct index_state" fields and define macro functions that always pass the
global variable "the_index" to the actual index functions. For example:

#define active_nr (the_index.cache_nr)
#define read_cache() read_index(&the_index)

Currently only 10 source files do not use these macros.

**Goal 1:** gradually phase out the use of these macros in the rest of the
source.


### "the_index" Global Variable

Currently, the main in-memory index is stored in a global variable called
"the_index" and is implicilty referenced throughout the source. This causes
various problems especially if we want to support submodules in the same
process.

**Goal 2:** gradually phase out "the_index" and replace it with a passed
parameter. The signature of all routines that need to access the index will
be widened to include it. This meshes nicely with the currently in-progress
"struct repository" work. Work on this goal may need to be staged behind the
in-progress repository changes. That is, if most functions are modified to
take a "struct repository", they can just use the "struct index_state" pointer
within it.


### Hide "cache_entry" Array

Currently, the main in-memory index is stored using an array of pointers to
"struct cache_entry" objects. Very little attempt is made to hide this array
from the entire code base.

These are ordered by full relative pathname and then by stage. Code
throughout the tree knows this and directly operates on the array. There are
some helper routines to find, insert, and delete entries, but most access is
direct.

Since the index is linear table of pathnames, any iteration of the index is
implicitly a depth-first walk of the tracked files. But it is not possible
to efficiently ask for hierarchy-related iterations.

Additionally, since files with multiple stages are stored in adjacent entries,
some iterations need to be adjusted to process or skip them.

**Goal 3:** introduce a set of iterators to initially hide the array and later
to allow alternative data structures to be considered.

**Goal 4:** will be to hide the stage-array details for unmerged entries.


### Hide or Eliminate Name/Dir Hash

Currently, the main in-memory index also contains 2 hash tables, the "name"
and "dir" hashes. These allow for efficient case-insensitive lookups for
files and each unique directory prefix. These are used on case-insensitive
platforms (like Windows and maybe Mac) to help correct case-sloppy command
line pathnames from the user.

These hash tables are very expensive to compute. They only exist because
the existing array of index-entries. That is, if we could change the in-memory
layout, we would not need the hash tables.

**Goal 5:** elminate these hash tables.


### Memory Allocation of "cache_entry"

Currently, the main in-memory index consists of an array of pointers to
individually-allocated "struct cache_entry" objects. For index files with a
very large number of entries, there can be significant malloc overhead when
reading the index.

**Goal 6:** consider a memory-pool to block allocate them. The pool should
be thread-aware to allow threaded operations to create new cache-entries
minimal thread contention.


## 2. NG Index API

### Iterator Types





## A1. Appendix 1: Future Work

* New on-disk index formats.
* Incremental update of on-disk index.

## A2. Appendix 2: TODO

* TODO Describe other common patterns like the following in
blame.c:fake_working_tre_commit()

discard_cache();
read_cache();

* TODO Define which source files are considered inside the "core" and
able to use the private fields that we are trying to hide.

2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -667,6 +667,7 @@ TEST_PROGRAMS_NEED_X += test-line-buffer
TEST_PROGRAMS_NEED_X += test-match-trees
TEST_PROGRAMS_NEED_X += test-mergesort
TEST_PROGRAMS_NEED_X += test-mktemp
TEST_PROGRAMS_NEED_X += test-ng-index-api
TEST_PROGRAMS_NEED_X += test-online-cpus
TEST_PROGRAMS_NEED_X += test-parse-options
TEST_PROGRAMS_NEED_X += test-path-utils
Expand Down Expand Up @@ -834,6 +835,7 @@ LIB_OBJS += merge-recursive.o
LIB_OBJS += mergesort.o
LIB_OBJS += mru.o
LIB_OBJS += name-hash.o
LIB_OBJS += ng-index-api.o
LIB_OBJS += notes.o
LIB_OBJS += notes-cache.o
LIB_OBJS += notes-merge.o
Expand Down
Loading