Skip to content

Public database of general project metadata #323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Melykuti opened this issue Feb 17, 2020 · 4 comments
Closed

Public database of general project metadata #323

Melykuti opened this issue Feb 17, 2020 · 4 comments

Comments

@Melykuti
Copy link

Similarly to how the Linehaul project streams PyPI download data into a Google BigQuery public dataset (into tables called `the-psf:pypi.downloadsYYYYMMDD`), I wish to be able to query a public dataset that contains the project metadata from PyPI, for example how it is presented by Warehouse JSON API calls.

I would like to analyse project licenses. In fact, I would not be the only one interested in this data to do a study. This data is technically simpler to collect than the downloads data and would be similarly informative about the totality of Python packages. I'd be thankful if such a database were created.

@pradyunsg
Copy link
Member

pradyunsg commented Feb 18, 2020

Relevant here: https://dustingram.com/articles/2018/03/05/why-pypi-doesnt-know-dependencies/

Note that license information is part of the same "metadata" as dependency information.

@pradyunsg pradyunsg transferred this issue from pypi/linehaul Feb 18, 2020
@KOLANICH
Copy link

KOLANICH commented Feb 18, 2020

I would like to analyse project licenses.

I don't think that copyright trolls (including GPL trolls) should be gifted with a tool for their needs. Of course they can create an own tool, but I don't think they should be helped.

Implementing new specifications like PEP 517

They have forgotten to notice setup.cfg.

Also IMHO using wheels should be encouraged. How about banning all uploads to pypy that are not wheels or their signatures?

@ncoghlan
Copy link
Member

Tidelift provide such a database at libraries.io: https://libraries.io/languages/Python

The relevant libraries.io page is linked from the "Statistics" section on each PyPI project page.

There aren't any plans to build a Python specific version of this dataset (since a language independent one is generally more useful).

@ChillarAnand
Copy link

Does tidelift dump has all the metadata provided by PyPi or does it provide only a subset of the metadata?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants