Skip to content

Repurpose 'malware' checks / YARA scanning for secret scanning #12412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks
di opened this issue Oct 24, 2022 · 7 comments · Fixed by #13647
Closed
3 tasks

Repurpose 'malware' checks / YARA scanning for secret scanning #12412

di opened this issue Oct 24, 2022 · 7 comments · Fixed by #13647
Labels
feature request malware-detection Issues related to automated malware detection.

Comments

@di
Copy link
Member

di commented Oct 24, 2022

What's the problem this feature will solve?
While our existing 'malware' checks are not being used due to noise and false positives, the mechanics available via the checks are perfect for token/credential scanning.

Describe the solution you'd like
We should be able to write a class of checks that:

  • scans every file in a distribution for a given pattern
  • calls an internal/external webhook with the result, in some specified format

Once this is implemented, we should support scanning for specific tokens:

Additional context
Some other potential ecosystems that might want to integrate: https://docs.github.com/en/code-security/secret-scanning/secret-scanning-patterns

@di di added feature request malware-detection Issues related to automated malware detection. labels Oct 24, 2022
@di
Copy link
Member Author

di commented Jan 6, 2023

@orf
Copy link

orf commented Jan 10, 2023

I think there is limited value in rolling our own secret scanning solution for PyPi. We could of course scan for specific patterns we derive ourselves (like I did for AWS keys) and work out how to validate each one, but this is really re-implementing the Github secret scanning infrastructure. Which already has a lot of partners: https://docs.github.com/en/code-security/secret-scanning/secret-scanning-patterns

I've been toying with a stupid sounding idea that I can't seem to let go of. Why don't we extract all text files from each uploaded release and commit them to a unique orphaned branch in a Github repository? We could expire the branches after a period of time, to keep the repo size down, and I expect the total size of all unique text content in pypi, compressed, is significantly lower than the 12tb total storage size.

This would trigger a scan against all supported providers, without us needing to really do much.

@di
Copy link
Member Author

di commented Jan 10, 2023

An interesting idea, but IMO that sort of feels like an abuse of GitHub, or at least not a way in which GitHub is intended to be used, and might be confusing to our users and the secret scanning partners:

  • While PyPI's terms of use permit redistribution on PyPI and "any mirroring facility", I think our users would be a bit surprised to find that things they publish to PyPI also are getting published to GitHub (even if it's eventually removed).
  • I'm not sure if GitHub notifies the user when they find a secret, but I would imagine that they do and that we would in turn want to notify our users as well. Trying to generate this notification from GitHub's notification seems challenging, and it would be much easier for us to generate this notification if we're scanning for secrets 'in-house'.
  • When GitHub notifies on a found secret, they indicate the source to the secret provider. In this instance, the report they would be giving to the partner would indicate that the source is our 'mirror' GitHub repository, not PyPI itself.
  • While many of GitHubs partners have open & public means to integrate with their secret-reporting APIs, not all do (for example Google Cloud requires specific onboarding steps). While it would be nice to immediately support all the partners GitHub supports, I think their partners might be surprised to be getting reports from PyPI.
  • Finally, round-tripping to GitHub just to revoke our own API tokens seems unnecessarily circuitous. These revocations should happen as quickly as possible.

I want to reiterate that the scanning infrastructure already exists on PyPI, we just need to write some YARA patterns for specific partners we want to support, and add fictionality for hitting a webhook with the result to notify.

@orf
Copy link

orf commented Jan 10, 2023

These are all valid points, however I'm assuming in all of this that we're talking about scanning more than just pypi tokens. These can (and should) be done in house where convenient.

My argument is that while it's simple to scan for and revoke pypi tokens, the moment we want "reciprocal scanning of GitHub tokens" or "other platforms/ecosystems" things seem to get a lot more complex. And I'm not sure of the value of just scanning for pypi tokens.

GitHub does seem to care about these kinds of issues, and my overall point is that working with them might be better than working alone.

That might come in different forms, from hacking an MVP together with git to seeing if there's a chance an external secret scanning API for pypi and other registries might be put on their roadmap.

@orf
Copy link

orf commented Jan 12, 2023

I've started work on committing the text content of every file uploaded to PyPi in November 2022 with the aim of seeing how many secrets Github would detect. It's being added in chunks here: https://github.com/orf/pypi-import

I think we (or maybe just me?) have massively underestimated the number and variety of credentials that have been published to PyPi. It has only completed about 15% of November but it's already found hundreds of keys from many different services. 9 distinct pypi API tokens so far.

Spot checking them reveals even more, including just straight up, valid, raw database credentials.

@orf
Copy link

orf commented Sep 6, 2023

I finally got around to finishing the side project I mentioned above, and the results are quite scary.

I began mirroring uploads to Github (for example this repo contains code uploaded to PyPI between 2023-08-24 and 2023-08-29), with the intention to both allow easier analysis of the code and leverage Github's existing secrets scanning infrastructure.

In #13647 it was said by @ewdurbin:

I propose #13596 combined with some form of "what's new!" feed to build out a more generic interface for internal and external "third parties" to perform these kinds of scans and report them to PyPI.

This sounds like a great idea and I know gitguardian.com is interested in this. They support a wider range of credential types than Github and I've been working with them to scan the all the historic code. Both the volume and types of credentials that are present are very scary and a non-trivial portion of them are live + valid.

The stats page on the project website details the high-level breakdown of secrets detected by Github:

Screenshot 2023-09-06 at 10 28 32

And this analyis using the data from GitGuardian on only valid credentials (i.e ones their service is able to detect as valid) shows the number of individual projects with valid credentials published is growing over time:

Screenshot 2023-09-06 at 10 30 26

However the issue is both detection and remediation. It's OK to get an alert from PyPI saying there's some credentials in your release, but unless you see and/or action that then you're in for a bad time.

The part that (unfortunately) benefits from centralization here is notifying the providers as well. For example, when an AWS key is added to Github AWS will be pinged and will:

  1. Apply a "quarantine policy" to the key to prevent obviously destructive/expensive operations
  2. Email the account holder and the user associated with the key

Other providers do similar things. This part is tough to crack unless Github themselves allow limited third parties to use this portion of their secrets scanning infrastructure without necesseraly comitting code to Github.

@orf
Copy link

orf commented Sep 6, 2023

Here is a breakdown of all the live credential types found:

Credentials
  • AMQP Credentials
  • Dropbox Key
  • GitLab Token
  • Airtable API Key
  • Line Token
  • Facebook App Keys
  • Pushbullet API key
  • Pusher Channels Keys
  • Coveralls Repository Token
  • Clarifai Key
  • reCAPTCHA Key
  • New Relic API Key
  • Datadog API Credentials
  • Zoom API JWT Keys
  • FTP Credentials
  • Twilio Master Credentials
  • Dropbox App Credentials
  • Hunter API Key
  • Telegram Bot Token
  • Clearbit Key
  • Mapbox Token
  • Cloudflare API Credentials
  • Coveralls Personal Token
  • CircleCI Personal Token
  • IBM Cloud Key
  • Yelp API key
  • Alchemy API Key
  • Trello Keys
  • Shodan Key
  • PayPal OAuth2 Keys
  • DigitalOcean Token
  • Yousign API Key
  • Auth0 Keys
  • Google API Key
  • New Relic Key
  • IBM COS HMAC Credentials
  • Cloudinary API keys
  • Sentry Token
  • GitHub OAuth App Keys
  • AWS Keys
  • Algolia Keys
  • PubNub Publish and Subscription Keys
  • Mailgun Primary Key
  • Stripe Keys
  • MySQL Credentials
  • Redis Credentials
  • MongoDB Credentials
  • Coinbase Keys
  • PagerDuty Authorization Token
  • Firebase Cloud Messaging API Key
  • Pinecone API Key and environment
  • Twitter Access Keys
  • Etsy Developer Key
  • WeChat App Keys
  • Stream Keys
  • Discord Webhook URL
  • Notion Integration Token
  • OpenWeatherMap Token
  • Mailjet Keys
  • GitHub App Keys
  • AWS SES Keys
  • PostgreSQL Credentials
  • Spotify Keys
  • SSH Credentials
  • SparkPost Key
  • Riot Games API Key
  • Azure Active Directory API Keys
  • Discord Oauth2 Keys
  • Codacy API Credentials
  • Google Cloud Keys
  • Tencent Cloud Keys

Azure Active Directory API Keys are a bit scary, but the point is that there is a very large variety of key types and it's impractical to build something to detect them or provide a way to notify the providers themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request malware-detection Issues related to automated malware detection.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants