Description
I have a large enough ipfs repo (over 300 million blocks) that it takes a very large amount of memory to perform a gc. computing the marked set is expensive.
I'm thinking that using something like a bloom filter could make this process use up much less memory, at the expense of not cleaning out every block. The difficulty here is that false positives while enumerating the set of pinned objects could result in drastically lowered performance (we could accidentally think a block that points to everything is pinned and end up cleaning out nothing) so selecting parameters to try and avoid this is important.
Another (potentially more complicated but more accurate) option is to use a disk backed prefix tree to store the enumerated pinset (with heavy caching up to some memory limit to make perf tolerable). This just offloads the memory cost of storing the sets to disk, which is generally acceptable, but can be an issue as it would prevent people from running a GC if their disk was full, this is generally considered to be a bad thing.
I'm also interested in strategies that can do "a little gc". Something that allows us to quickly free a smaller subset of the blocks, without the overhead of performing an entire gc scan. Implementing something that lets us do this may require a rethinking of how pinsets and objects are stored.