Repurpose 'malware' checks / YARA scanning for secret scanning #12412

di · 2022-10-24T15:47:20Z

What's the problem this feature will solve?
While our existing 'malware' checks are not being used due to noise and false positives, the mechanics available via the checks are perfect for token/credential scanning.

Describe the solution you'd like
We should be able to write a class of checks that:

scans every file in a distribution for a given pattern
calls an internal/external webhook with the result, in some specified format

Once this is implemented, we should support scanning for specific tokens:

pypi- and testpypi- prefixed tokens, which should automatically be revoked when found
'reciprocal' scanning for GitHub tokens: https://docs.github.com/en/code-security/secret-scanning/about-secret-scanning
other platforms/ecosystems?

Additional context
Some other potential ecosystems that might want to integrate: https://docs.github.com/en/code-security/secret-scanning/secret-scanning-patterns

The text was updated successfully, but these errors were encountered:

di · 2023-01-06T22:22:10Z

Related: https://tomforb.es/i-scanned-every-package-on-pypi-and-found-57-live-aws-keys/

orf · 2023-01-10T14:46:39Z

I think there is limited value in rolling our own secret scanning solution for PyPi. We could of course scan for specific patterns we derive ourselves (like I did for AWS keys) and work out how to validate each one, but this is really re-implementing the Github secret scanning infrastructure. Which already has a lot of partners: https://docs.github.com/en/code-security/secret-scanning/secret-scanning-patterns

I've been toying with a stupid sounding idea that I can't seem to let go of. Why don't we extract all text files from each uploaded release and commit them to a unique orphaned branch in a Github repository? We could expire the branches after a period of time, to keep the repo size down, and I expect the total size of all unique text content in pypi, compressed, is significantly lower than the 12tb total storage size.

This would trigger a scan against all supported providers, without us needing to really do much.

di · 2023-01-10T16:20:36Z

An interesting idea, but IMO that sort of feels like an abuse of GitHub, or at least not a way in which GitHub is intended to be used, and might be confusing to our users and the secret scanning partners:

While PyPI's terms of use permit redistribution on PyPI and "any mirroring facility", I think our users would be a bit surprised to find that things they publish to PyPI also are getting published to GitHub (even if it's eventually removed).
I'm not sure if GitHub notifies the user when they find a secret, but I would imagine that they do and that we would in turn want to notify our users as well. Trying to generate this notification from GitHub's notification seems challenging, and it would be much easier for us to generate this notification if we're scanning for secrets 'in-house'.
When GitHub notifies on a found secret, they indicate the source to the secret provider. In this instance, the report they would be giving to the partner would indicate that the source is our 'mirror' GitHub repository, not PyPI itself.
While many of GitHubs partners have open & public means to integrate with their secret-reporting APIs, not all do (for example Google Cloud requires specific onboarding steps). While it would be nice to immediately support all the partners GitHub supports, I think their partners might be surprised to be getting reports from PyPI.
Finally, round-tripping to GitHub just to revoke our own API tokens seems unnecessarily circuitous. These revocations should happen as quickly as possible.

I want to reiterate that the scanning infrastructure already exists on PyPI, we just need to write some YARA patterns for specific partners we want to support, and add fictionality for hitting a webhook with the result to notify.

orf · 2023-01-10T16:47:40Z

These are all valid points, however I'm assuming in all of this that we're talking about scanning more than just pypi tokens. These can (and should) be done in house where convenient.

My argument is that while it's simple to scan for and revoke pypi tokens, the moment we want "reciprocal scanning of GitHub tokens" or "other platforms/ecosystems" things seem to get a lot more complex. And I'm not sure of the value of just scanning for pypi tokens.

GitHub does seem to care about these kinds of issues, and my overall point is that working with them might be better than working alone.

That might come in different forms, from hacking an MVP together with git to seeing if there's a chance an external secret scanning API for pypi and other registries might be put on their roadmap.

orf · 2023-01-12T09:46:13Z

I've started work on committing the text content of every file uploaded to PyPi in November 2022 with the aim of seeing how many secrets Github would detect. It's being added in chunks here: https://github.com/orf/pypi-import

I think we (or maybe just me?) have massively underestimated the number and variety of credentials that have been published to PyPi. It has only completed about 15% of November but it's already found hundreds of keys from many different services. 9 distinct pypi API tokens so far.

Spot checking them reveals even more, including just straight up, valid, raw database credentials.

orf · 2023-09-06T09:35:09Z

I finally got around to finishing the side project I mentioned above, and the results are quite scary.

I began mirroring uploads to Github (for example this repo contains code uploaded to PyPI between 2023-08-24 and 2023-08-29), with the intention to both allow easier analysis of the code and leverage Github's existing secrets scanning infrastructure.

In #13647 it was said by @ewdurbin:

I propose #13596 combined with some form of "what's new!" feed to build out a more generic interface for internal and external "third parties" to perform these kinds of scans and report them to PyPI.

This sounds like a great idea and I know gitguardian.com is interested in this. They support a wider range of credential types than Github and I've been working with them to scan the all the historic code. Both the volume and types of credentials that are present are very scary and a non-trivial portion of them are live + valid.

The stats page on the project website details the high-level breakdown of secrets detected by Github:

And this analyis using the data from GitGuardian on only valid credentials (i.e ones their service is able to detect as valid) shows the number of individual projects with valid credentials published is growing over time:

However the issue is both detection and remediation. It's OK to get an alert from PyPI saying there's some credentials in your release, but unless you see and/or action that then you're in for a bad time.

The part that (unfortunately) benefits from centralization here is notifying the providers as well. For example, when an AWS key is added to Github AWS will be pinged and will:

Apply a "quarantine policy" to the key to prevent obviously destructive/expensive operations
Email the account holder and the user associated with the key

Other providers do similar things. This part is tough to crack unless Github themselves allow limited third parties to use this portion of their secrets scanning infrastructure without necesseraly comitting code to Github.

orf · 2023-09-06T09:38:05Z

Here is a breakdown of all the live credential types found:

Credentials

AMQP Credentials
Dropbox Key
GitLab Token
Airtable API Key
Line Token
Facebook App Keys
Pushbullet API key
Pusher Channels Keys
Coveralls Repository Token
Clarifai Key
reCAPTCHA Key
New Relic API Key
Datadog API Credentials
Zoom API JWT Keys
FTP Credentials
Twilio Master Credentials
Dropbox App Credentials
Hunter API Key
Telegram Bot Token
Clearbit Key
Mapbox Token
Cloudflare API Credentials
Coveralls Personal Token
CircleCI Personal Token
IBM Cloud Key
Yelp API key
Alchemy API Key
Trello Keys
Shodan Key
PayPal OAuth2 Keys
DigitalOcean Token
Yousign API Key
Auth0 Keys
Google API Key
New Relic Key
IBM COS HMAC Credentials
Cloudinary API keys
Sentry Token
GitHub OAuth App Keys
AWS Keys
Algolia Keys
PubNub Publish and Subscription Keys
Mailgun Primary Key
Stripe Keys
MySQL Credentials
Redis Credentials
MongoDB Credentials
Coinbase Keys
PagerDuty Authorization Token
Firebase Cloud Messaging API Key
Pinecone API Key and environment
Twitter Access Keys
Etsy Developer Key
WeChat App Keys
Stream Keys
Discord Webhook URL
Notion Integration Token
OpenWeatherMap Token
Mailjet Keys
GitHub App Keys
AWS SES Keys
PostgreSQL Credentials
Spotify Keys
SSH Credentials
SparkPost Key
Riot Games API Key
Azure Active Directory API Keys
Discord Oauth2 Keys
Codacy API Credentials
Google Cloud Keys
Tencent Cloud Keys

Azure Active Directory API Keys are a bit scary, but the point is that there is a very large variety of key types and it's impractical to build something to detect them or provide a way to notify the providers themselves.

di added feature request malware-detection Issues related to automated malware detection. labels Oct 24, 2022

di mentioned this issue May 22, 2023

Remove internal malware infrastructure/checks #13647

Merged

ewdurbin closed this as completed in #13647 May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repurpose 'malware' checks / YARA scanning for secret scanning #12412

Repurpose 'malware' checks / YARA scanning for secret scanning #12412

di commented Oct 24, 2022

di commented Jan 6, 2023

Uh oh!

orf commented Jan 10, 2023 •

edited

Loading

Uh oh!

di commented Jan 10, 2023

Uh oh!

orf commented Jan 10, 2023 •

edited

Loading

Uh oh!

orf commented Jan 12, 2023 •

edited

Loading

Uh oh!

orf commented Sep 6, 2023

Uh oh!

orf commented Sep 6, 2023 •

edited

Loading

Uh oh!

Repurpose 'malware' checks / YARA scanning for secret scanning #12412

Repurpose 'malware' checks / YARA scanning for secret scanning #12412

Comments

di commented Oct 24, 2022

di commented Jan 6, 2023

Uh oh!

orf commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

di commented Jan 10, 2023

Uh oh!

orf commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orf commented Jan 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orf commented Sep 6, 2023

Uh oh!

orf commented Sep 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orf commented Jan 10, 2023 •

edited

Loading

orf commented Jan 10, 2023 •

edited

Loading

orf commented Jan 12, 2023 •

edited

Loading

orf commented Sep 6, 2023 •

edited

Loading