-
Notifications
You must be signed in to change notification settings - Fork 1k
Repurpose 'malware' checks / YARA scanning for secret scanning #12412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think there is limited value in rolling our own secret scanning solution for PyPi. We could of course scan for specific patterns we derive ourselves (like I did for AWS keys) and work out how to validate each one, but this is really re-implementing the Github secret scanning infrastructure. Which already has a lot of partners: https://docs.github.com/en/code-security/secret-scanning/secret-scanning-patterns I've been toying with a stupid sounding idea that I can't seem to let go of. Why don't we extract all text files from each uploaded release and commit them to a unique orphaned branch in a Github repository? We could expire the branches after a period of time, to keep the repo size down, and I expect the total size of all unique text content in pypi, compressed, is significantly lower than the 12tb total storage size. This would trigger a scan against all supported providers, without us needing to really do much. |
An interesting idea, but IMO that sort of feels like an abuse of GitHub, or at least not a way in which GitHub is intended to be used, and might be confusing to our users and the secret scanning partners:
I want to reiterate that the scanning infrastructure already exists on PyPI, we just need to write some YARA patterns for specific partners we want to support, and add fictionality for hitting a webhook with the result to notify. |
These are all valid points, however I'm assuming in all of this that we're talking about scanning more than just pypi tokens. These can (and should) be done in house where convenient. My argument is that while it's simple to scan for and revoke pypi tokens, the moment we want "reciprocal scanning of GitHub tokens" or "other platforms/ecosystems" things seem to get a lot more complex. And I'm not sure of the value of just scanning for pypi tokens. GitHub does seem to care about these kinds of issues, and my overall point is that working with them might be better than working alone. That might come in different forms, from hacking an MVP together with git to seeing if there's a chance an external secret scanning API for pypi and other registries might be put on their roadmap. |
I've started work on committing the text content of every file uploaded to PyPi in November 2022 with the aim of seeing how many secrets Github would detect. It's being added in chunks here: https://github.com/orf/pypi-import I think we (or maybe just me?) have massively underestimated the number and variety of credentials that have been published to PyPi. It has only completed about 15% of November but it's already found hundreds of keys from many different services. 9 distinct pypi API tokens so far. Spot checking them reveals even more, including just straight up, valid, raw database credentials. |
I finally got around to finishing the side project I mentioned above, and the results are quite scary. I began mirroring uploads to Github (for example this repo contains code uploaded to PyPI between 2023-08-24 and 2023-08-29), with the intention to both allow easier analysis of the code and leverage Github's existing secrets scanning infrastructure. In #13647 it was said by @ewdurbin:
This sounds like a great idea and I know gitguardian.com is interested in this. They support a wider range of credential types than Github and I've been working with them to scan the all the historic code. Both the volume and types of credentials that are present are very scary and a non-trivial portion of them are live + valid. The stats page on the project website details the high-level breakdown of secrets detected by Github: ![]() And this analyis using the data from GitGuardian on only valid credentials (i.e ones their service is able to detect as valid) shows the number of individual projects with valid credentials published is growing over time: ![]() However the issue is both detection and remediation. It's OK to get an alert from PyPI saying there's some credentials in your release, but unless you see and/or action that then you're in for a bad time. The part that (unfortunately) benefits from centralization here is notifying the providers as well. For example, when an AWS key is added to Github AWS will be pinged and will:
Other providers do similar things. This part is tough to crack unless Github themselves allow limited third parties to use this portion of their secrets scanning infrastructure without necesseraly comitting code to Github. |
Here is a breakdown of all the live credential types found: Credentials
Azure Active Directory API Keys are a bit scary, but the point is that there is a very large variety of key types and it's impractical to build something to detect them or provide a way to notify the providers themselves. |
What's the problem this feature will solve?
While our existing 'malware' checks are not being used due to noise and false positives, the mechanics available via the checks are perfect for token/credential scanning.
Describe the solution you'd like
We should be able to write a class of checks that:
Once this is implemented, we should support scanning for specific tokens:
pypi-
andtestpypi-
prefixed tokens, which should automatically be revoked when foundAdditional context
Some other potential ecosystems that might want to integrate: https://docs.github.com/en/code-security/secret-scanning/secret-scanning-patterns
The text was updated successfully, but these errors were encountered: