-
Notifications
You must be signed in to change notification settings - Fork 1k
Public Dataset for distribution metadata #7403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
One problem with distributing via BigQuery is that it adds additional barriers to access the data. Although sometimes it is useful to run a quick query on the data without setting up anything. What will be the size of metadata dump? I think it should only be a few GBs. Can't it be distributed via alternate channels? |
@ChillarAnand you bring up a great point about barrier to entry, but as far as I'm aware there isn't really any good "requestor pays" options for online queryable datasets aside from BigQuery. We could publish it as a single file, but I'm not sure how much less of a barrier that is. In addition when combined with the existing data in BigQuery that we already have this metadata would provide all kinds of interesting options for analyzing downloads. |
Another thought is how we handle when releases and such are deleted, should they be removed from the public dataset? If the public dataset matches PyPI's db 1:1 it would really be a headache for people doing retrospective analysis. |
IMO, they should not. |
@ewdurbin I agree. I was wondering if the dump size is less, we can also distribute via google drive or dropbox or any other channels. This makes it easy to play with data offline. |
@ChillarAnand Ultimately I'm not sure if the limited volunteer admin time can be spent maintaining two sources, but the dataset is permissibly licensed under a Creative Commons Attribution 4.0 International License so redistributions of dumps of metadata discussed here would be 100% ok. For the time being I'm going to work under the assumption that we'll be outputting to BigQuery. |
Actually, this brings up a possible concern with licensing. We'd need to be careful to ensure that what is published in these tables can be licensed under Creative Commons. This may exclude us from some fields like description/description_html. |
Leaving this open until the dataset has been fully backfilled and is ready for use. We should probably update our documentation about the datasets as well. |
Quick update here: this has been enabled for PyPI and TestPyPI. The backfilling has been completed for TestPyPI and is in progress for PyPI. Once the backfilling is complete, we can merge #8240 (documentation updates) and this issue should be complete. |
Could it be exported to wikidata? |
Please do so, for it actually is much less of a barrier. BigQuery is not an option at all as it makes the data unavailable for people without google account. |
As requested in pypa/packaging-problems#323, we should explore publishing the metadata for each released distribution in a public dataset via BigQuery.
I'm imagining that each row would contain all the core metadata fields included in each release, as well as filename, digests, file size, upload time, URL to the distribution, etc. Essentially everything in the "Release" JSON API, with the per-release
info
field included for every individual distribution.Once we're publishing to the dataset on upload, we'd also need to backfill prior distributions as well.
Not entirely sure what we'd name it, does
the-psf:pypi.distributions
make sense?The text was updated successfully, but these errors were encountered: