Skip to content

Public Dataset for distribution metadata #7403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
di opened this issue Feb 18, 2020 · 11 comments · Fixed by #8093 or #8240
Closed

Public Dataset for distribution metadata #7403

di opened this issue Feb 18, 2020 · 11 comments · Fixed by #8093 or #8240

Comments

@di
Copy link
Member

di commented Feb 18, 2020

As requested in pypa/packaging-problems#323, we should explore publishing the metadata for each released distribution in a public dataset via BigQuery.

I'm imagining that each row would contain all the core metadata fields included in each release, as well as filename, digests, file size, upload time, URL to the distribution, etc. Essentially everything in the "Release" JSON API, with the per-release info field included for every individual distribution.

Once we're publishing to the dataset on upload, we'd also need to backfill prior distributions as well.

Not entirely sure what we'd name it, does the-psf:pypi.distributions make sense?

@ChillarAnand
Copy link

ChillarAnand commented Feb 26, 2020

One problem with distributing via BigQuery is that it adds additional barriers to access the data. Although sometimes it is useful to run a quick query on the data without setting up anything.

What will be the size of metadata dump? I think it should only be a few GBs. Can't it be distributed via alternate channels?

@ewdurbin
Copy link
Member

ewdurbin commented Apr 3, 2020

@ChillarAnand you bring up a great point about barrier to entry, but as far as I'm aware there isn't really any good "requestor pays" options for online queryable datasets aside from BigQuery. We could publish it as a single file, but I'm not sure how much less of a barrier that is.

In addition when combined with the existing data in BigQuery that we already have this metadata would provide all kinds of interesting options for analyzing downloads.

@ewdurbin
Copy link
Member

ewdurbin commented Apr 3, 2020

Another thought is how we handle when releases and such are deleted, should they be removed from the public dataset? If the public dataset matches PyPI's db 1:1 it would really be a headache for people doing retrospective analysis.

@di
Copy link
Member Author

di commented Apr 3, 2020

Another thought is how we handle when releases and such are deleted, should they be removed from the public dataset?

IMO, they should not.

@ChillarAnand
Copy link

@ewdurbin I agree. I was wondering if the dump size is less, we can also distribute via google drive or dropbox or any other channels. This makes it easy to play with data offline.

@ewdurbin
Copy link
Member

ewdurbin commented Apr 4, 2020

@ChillarAnand Ultimately I'm not sure if the limited volunteer admin time can be spent maintaining two sources, but the dataset is permissibly licensed under a Creative Commons Attribution 4.0 International License so redistributions of dumps of metadata discussed here would be 100% ok.

For the time being I'm going to work under the assumption that we'll be outputting to BigQuery.

@ewdurbin
Copy link
Member

ewdurbin commented Apr 4, 2020

Actually, this brings up a possible concern with licensing. We'd need to be careful to ensure that what is published in these tables can be licensed under Creative Commons.

This may exclude us from some fields like description/description_html.

@di
Copy link
Member Author

di commented Jul 7, 2020

Leaving this open until the dataset has been fully backfilled and is ready for use. We should probably update our documentation about the datasets as well.

@di
Copy link
Member Author

di commented Jul 23, 2020

Quick update here: this has been enabled for PyPI and TestPyPI. The backfilling has been completed for TestPyPI and is in progress for PyPI.

Once the backfilling is complete, we can merge #8240 (documentation updates) and this issue should be complete.

@Mic92
Copy link

Mic92 commented Jul 27, 2020

Could it be exported to wikidata?

@di di closed this as completed in #8240 Aug 6, 2020
@AMDmi3
Copy link

AMDmi3 commented Nov 6, 2020

We could publish it as a single file, but I'm not sure how much less of a barrier that is.

Please do so, for it actually is much less of a barrier. BigQuery is not an option at all as it makes the data unavailable for people without google account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants