Skip to content

Reverse dependencies: Exploring a better data model and storage design #960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Kleidukos opened this issue Aug 2, 2021 · 5 comments
Closed

Comments

@Kleidukos
Copy link
Member

Context: #40

@gbaz and I talked about exploring the data model and storage design space to store reverse-dependencies.

From what I can read in packdeps' source code (which does an admirable job at this task), a binary file produced from processing the Hackage index is used to store the necessary information.

@phadej @snoyberg Your experience is more than welcome on this thorny issue. :)

@gbaz
Copy link
Contributor

gbaz commented Aug 2, 2021

The unmerged PR for revdeps lives here: #723

There may be useful UI code, etc. in there. The obstacle to merging was the memory footprint.

My overall feeling is the problem is we construct a complex structure that tries to answer every possible question. Instead we should enumerate the few key questions we want to answer. In my mind, that's as follows:

  1. Immediate rev-deps
  2. Total count of transitive rev-deps
  3. Packages pinned to no more than a certain version.

My suggestion was that a simpler architecture could be to use a standard relational database to encode the dependency constraints -- for each package/version we simply record each other package name it depends on, as well as the min and max version bounds. (approximating when the bounds look weirder). This lets us query directly for immediate rev-deps and cache/store them in memory, and also lets us do the more restrictive query to answer #3.

Periodically, we could run "rollup" queries to calculate the total set of transitive rev-deps, and store just the count in memory to cut down on footprint.

I suspect this would give a much reduced memory footprint compared to the existing code -- although I will point out that the existing code is pretty neat in that it has like navigable javascript graphs and soforth. So maybe we could just move to constructing the existing datastructure for it out of band as well and storing the data for those graphs in blob storage or something, if we think they're sufficiently cool. That said, I haven't played with the existing implementation and explored how neat/usable/useful the graphs really are in quite some time, so I'd encourage someone interested to explore a bit...

@Kleidukos
Copy link
Member Author

Periodically, we could run "rollup" queries to calculate the total set of transitive rev-deps, and store just the count in memory to cut down on footprint.

Yep, this is definitely something I would go with.

@kozross
Copy link

kozross commented Aug 4, 2021

I agree with @gbaz for the most part. However, I am a little confused about the 'memory footprint' being referred to. If we're using a standard relational schema to encode dependency information (which is a graph), the memory cost would be on-disk (since I don't think there's a good in-memory database solution that would do what we want and is open-source). There is a quadratic cost in size involved (since you're basically storing something like an adjacency matrix), but this is disk, not RAM, so I don't think it's that big a deal.

I may be misunderstanding the scope of the issue though - clarification is welcome. My current 10 cents more or less amounts to:

  • Encode the dependency graph of everything properly, using standard relational methods;
  • Write a (parameterized) query to do this kind of lookup.

This should be both reasonably fast, and not that difficult to do (or at least, I know how I would do it). Am I missing something?

@gbaz
Copy link
Contributor

gbaz commented Sep 4, 2021

The memory footprint as discussed referred solely to resident memory. The prior design was entirely in memory, and that was (partly) the issue.

@gbaz
Copy link
Contributor

gbaz commented Nov 2, 2024

revdeps has been completed and merged.

@gbaz gbaz closed this as completed Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants