Public Dataset for distribution metadata #7403

di · 2020-02-18T19:08:14Z

As requested in pypa/packaging-problems#323, we should explore publishing the metadata for each released distribution in a public dataset via BigQuery.

I'm imagining that each row would contain all the core metadata fields included in each release, as well as filename, digests, file size, upload time, URL to the distribution, etc. Essentially everything in the "Release" JSON API, with the per-release info field included for every individual distribution.

Once we're publishing to the dataset on upload, we'd also need to backfill prior distributions as well.

Not entirely sure what we'd name it, does the-psf:pypi.distributions make sense?

The text was updated successfully, but these errors were encountered:

ChillarAnand · 2020-02-26T16:15:52Z

One problem with distributing via BigQuery is that it adds additional barriers to access the data. Although sometimes it is useful to run a quick query on the data without setting up anything.

What will be the size of metadata dump? I think it should only be a few GBs. Can't it be distributed via alternate channels?

ewdurbin · 2020-04-03T21:27:33Z

@ChillarAnand you bring up a great point about barrier to entry, but as far as I'm aware there isn't really any good "requestor pays" options for online queryable datasets aside from BigQuery. We could publish it as a single file, but I'm not sure how much less of a barrier that is.

In addition when combined with the existing data in BigQuery that we already have this metadata would provide all kinds of interesting options for analyzing downloads.

ewdurbin · 2020-04-03T21:30:20Z

Another thought is how we handle when releases and such are deleted, should they be removed from the public dataset? If the public dataset matches PyPI's db 1:1 it would really be a headache for people doing retrospective analysis.

di · 2020-04-03T21:37:48Z

Another thought is how we handle when releases and such are deleted, should they be removed from the public dataset?

IMO, they should not.

ChillarAnand · 2020-04-04T05:43:28Z

@ewdurbin I agree. I was wondering if the dump size is less, we can also distribute via google drive or dropbox or any other channels. This makes it easy to play with data offline.

ewdurbin · 2020-04-04T11:54:27Z

@ChillarAnand Ultimately I'm not sure if the limited volunteer admin time can be spent maintaining two sources, but the dataset is permissibly licensed under a Creative Commons Attribution 4.0 International License so redistributions of dumps of metadata discussed here would be 100% ok.

For the time being I'm going to work under the assumption that we'll be outputting to BigQuery.

ewdurbin · 2020-04-04T11:56:02Z

Actually, this brings up a possible concern with licensing. We'd need to be careful to ensure that what is published in these tables can be licensed under Creative Commons.

This may exclude us from some fields like description/description_html.

di · 2020-07-07T20:54:48Z

Leaving this open until the dataset has been fully backfilled and is ready for use. We should probably update our documentation about the datasets as well.

di · 2020-07-23T17:23:13Z

Quick update here: this has been enabled for PyPI and TestPyPI. The backfilling has been completed for TestPyPI and is in progress for PyPI.

Once the backfilling is complete, we can merge #8240 (documentation updates) and this issue should be complete.

Mic92 · 2020-07-27T20:17:37Z

Could it be exported to wikidata?

AMDmi3 · 2020-11-06T21:50:21Z

We could publish it as a single file, but I'm not sure how much less of a barrier that is.

Please do so, for it actually is much less of a barrier. BigQuery is not an option at all as it makes the data unavailable for people without google account.

di added the feature request label Feb 18, 2020

di mentioned this issue Feb 26, 2020

Regular dump of PyPI database #1478

Closed

patelneel55 mentioned this issue Jun 11, 2020

Public BigQuery table for distribution metadata #8093

Merged

3 tasks

di mentioned this issue Jul 6, 2020

[Question] Is there a dataset of all pypi projects with title & description? #8212

Closed

di closed this as completed in #8093 Jul 7, 2020

di reopened this Jul 7, 2020

patelneel55 mentioned this issue Jul 23, 2020

Update Warehouse documentation to inform about BigQuery datasets #8240

Merged

davidak mentioned this issue Jul 24, 2020

Fix PyPi support repology/repology-updater#278

Closed

di closed this as completed in #8240 Aug 6, 2020

AMDmi3 mentioned this issue Nov 9, 2020

Make all packages metadata available as a single plain file #8802

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Public Dataset for distribution metadata #7403

Public Dataset for distribution metadata #7403

di commented Feb 18, 2020

ChillarAnand commented Feb 26, 2020 •

edited

Loading

Uh oh!

ewdurbin commented Apr 3, 2020

Uh oh!

ewdurbin commented Apr 3, 2020

Uh oh!

di commented Apr 3, 2020

Uh oh!

ChillarAnand commented Apr 4, 2020

Uh oh!

ewdurbin commented Apr 4, 2020

Uh oh!

ewdurbin commented Apr 4, 2020

Uh oh!

di commented Jul 7, 2020

Uh oh!

di commented Jul 23, 2020

Uh oh!

Mic92 commented Jul 27, 2020

Uh oh!

AMDmi3 commented Nov 6, 2020

Uh oh!

Public Dataset for distribution metadata #7403

Public Dataset for distribution metadata #7403

Comments

di commented Feb 18, 2020

ChillarAnand commented Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ewdurbin commented Apr 3, 2020

Uh oh!

ewdurbin commented Apr 3, 2020

Uh oh!

di commented Apr 3, 2020

Uh oh!

ChillarAnand commented Apr 4, 2020

Uh oh!

ewdurbin commented Apr 4, 2020

Uh oh!

ewdurbin commented Apr 4, 2020

Uh oh!

di commented Jul 7, 2020

Uh oh!

di commented Jul 23, 2020

Uh oh!

Mic92 commented Jul 27, 2020

Uh oh!

AMDmi3 commented Nov 6, 2020

Uh oh!

ChillarAnand commented Feb 26, 2020 •

edited

Loading