-
Notifications
You must be signed in to change notification settings - Fork 200
Reverse dependencies: Exploring a better data model and storage design #960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The unmerged PR for revdeps lives here: #723 There may be useful UI code, etc. in there. The obstacle to merging was the memory footprint. My overall feeling is the problem is we construct a complex structure that tries to answer every possible question. Instead we should enumerate the few key questions we want to answer. In my mind, that's as follows:
My suggestion was that a simpler architecture could be to use a standard relational database to encode the dependency constraints -- for each package/version we simply record each other package name it depends on, as well as the min and max version bounds. (approximating when the bounds look weirder). This lets us query directly for immediate rev-deps and cache/store them in memory, and also lets us do the more restrictive query to answer #3. Periodically, we could run "rollup" queries to calculate the total set of transitive rev-deps, and store just the count in memory to cut down on footprint. I suspect this would give a much reduced memory footprint compared to the existing code -- although I will point out that the existing code is pretty neat in that it has like navigable javascript graphs and soforth. So maybe we could just move to constructing the existing datastructure for it out of band as well and storing the data for those graphs in blob storage or something, if we think they're sufficiently cool. That said, I haven't played with the existing implementation and explored how neat/usable/useful the graphs really are in quite some time, so I'd encourage someone interested to explore a bit... |
Yep, this is definitely something I would go with. |
I agree with @gbaz for the most part. However, I am a little confused about the 'memory footprint' being referred to. If we're using a standard relational schema to encode dependency information (which is a graph), the memory cost would be on-disk (since I don't think there's a good in-memory database solution that would do what we want and is open-source). There is a quadratic cost in size involved (since you're basically storing something like an adjacency matrix), but this is disk, not RAM, so I don't think it's that big a deal. I may be misunderstanding the scope of the issue though - clarification is welcome. My current 10 cents more or less amounts to:
This should be both reasonably fast, and not that difficult to do (or at least, I know how I would do it). Am I missing something? |
The memory footprint as discussed referred solely to resident memory. The prior design was entirely in memory, and that was (partly) the issue. |
revdeps has been completed and merged. |
Context: #40
@gbaz and I talked about exploring the data model and storage design space to store reverse-dependencies.
From what I can read in packdeps' source code (which does an admirable job at this task), a binary file produced from processing the Hackage index is used to store the necessary information.
@phadej @snoyberg Your experience is more than welcome on this thorny issue. :)
The text was updated successfully, but these errors were encountered: