Reverse dependencies: Exploring a better data model and storage design #960

Kleidukos · 2021-08-02T20:22:12Z

Context: #40

@gbaz and I talked about exploring the data model and storage design space to store reverse-dependencies.

From what I can read in packdeps' source code (which does an admirable job at this task), a binary file produced from processing the Hackage index is used to store the necessary information.

@phadej @snoyberg Your experience is more than welcome on this thorny issue. :)

gbaz · 2021-08-02T20:48:24Z

The unmerged PR for revdeps lives here: #723

There may be useful UI code, etc. in there. The obstacle to merging was the memory footprint.

My overall feeling is the problem is we construct a complex structure that tries to answer every possible question. Instead we should enumerate the few key questions we want to answer. In my mind, that's as follows:

Immediate rev-deps
Total count of transitive rev-deps
Packages pinned to no more than a certain version.

My suggestion was that a simpler architecture could be to use a standard relational database to encode the dependency constraints -- for each package/version we simply record each other package name it depends on, as well as the min and max version bounds. (approximating when the bounds look weirder). This lets us query directly for immediate rev-deps and cache/store them in memory, and also lets us do the more restrictive query to answer #3.

Periodically, we could run "rollup" queries to calculate the total set of transitive rev-deps, and store just the count in memory to cut down on footprint.

I suspect this would give a much reduced memory footprint compared to the existing code -- although I will point out that the existing code is pretty neat in that it has like navigable javascript graphs and soforth. So maybe we could just move to constructing the existing datastructure for it out of band as well and storing the data for those graphs in blob storage or something, if we think they're sufficiently cool. That said, I haven't played with the existing implementation and explored how neat/usable/useful the graphs really are in quite some time, so I'd encourage someone interested to explore a bit...

Kleidukos · 2021-08-02T20:51:39Z

Periodically, we could run "rollup" queries to calculate the total set of transitive rev-deps, and store just the count in memory to cut down on footprint.

Yep, this is definitely something I would go with.

kozross · 2021-08-04T21:36:28Z

I agree with @gbaz for the most part. However, I am a little confused about the 'memory footprint' being referred to. If we're using a standard relational schema to encode dependency information (which is a graph), the memory cost would be on-disk (since I don't think there's a good in-memory database solution that would do what we want and is open-source). There is a quadratic cost in size involved (since you're basically storing something like an adjacency matrix), but this is disk, not RAM, so I don't think it's that big a deal.

I may be misunderstanding the scope of the issue though - clarification is welcome. My current 10 cents more or less amounts to:

Encode the dependency graph of everything properly, using standard relational methods;
Write a (parameterized) query to do this kind of lookup.

This should be both reasonably fast, and not that difficult to do (or at least, I know how I would do it). Am I missing something?

gbaz · 2021-09-04T04:05:38Z

The memory footprint as discussed referred solely to resident memory. The prior design was entirely in memory, and that was (partly) the issue.

gbaz · 2024-11-02T18:20:43Z

revdeps has been completed and merged.

gbaz closed this as completed Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reverse dependencies: Exploring a better data model and storage design #960

Reverse dependencies: Exploring a better data model and storage design #960

Kleidukos commented Aug 2, 2021

gbaz commented Aug 2, 2021

Kleidukos commented Aug 2, 2021

kozross commented Aug 4, 2021

gbaz commented Sep 4, 2021 •

edited

Loading

gbaz commented Nov 2, 2024

Reverse dependencies: Exploring a better data model and storage design #960

Reverse dependencies: Exploring a better data model and storage design #960

Comments

Kleidukos commented Aug 2, 2021

gbaz commented Aug 2, 2021

Kleidukos commented Aug 2, 2021

kozross commented Aug 4, 2021

gbaz commented Sep 4, 2021 • edited Loading

gbaz commented Nov 2, 2024

gbaz commented Sep 4, 2021 •

edited

Loading