Skip to content

Ultranormalization encourages name squatting #11139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
orsinium opened this issue Apr 7, 2022 · 9 comments
Open

Ultranormalization encourages name squatting #11139

orsinium opened this issue Apr 7, 2022 · 9 comments
Labels
bug 🐛 squatting Issues related to preventing any kinds of namesquatting, typosquatting, dependency confusion

Comments

@orsinium
Copy link
Contributor

orsinium commented Apr 7, 2022

Describe the bug

#10498 introduced "ultranormalization" to prevent name squatting of package names similar to ones already registered:

requests.exceptions.HTTPError: 400 Client Error: The name 'l10n' is too similar to an existing project. See https://pypi.org/help/#project-name for more information. for url: https://upload.pypi.org/legacy/

While the initiative, in general, is something of major concern for PyPI (and any other big package registry), the implementation has a few painful drawbacks:

  1. It simplifies name squatting. For example, registration of the name lili (a French name) allows to additionally squat many other similar names, such as 1111, i111, i11l (could be a good name for internationalization package), i-11-l, iiii (4 in Roman numerals), and so on. In total, it's a huge amount of combinations, the exact number depends on the package max size and if you count names such as l-------ll--------l.
  2. It complicates name registration. A few months ago, I started my work on an l10n+i18n package. I always start by picking a name. A quick search on PyPI showed that the name l10n is free. A few months later, I have the package ready but PyPI rejects my upload. How could I know that the name is "taken"? Should I register the name before I have any code ready? Then again, that encourages name squatting.
  3. The name rejection reason isn't clear. It just says that the name is similar to another one. Which one? If I knew, I could claim it as per PEP 541. But it can't be completely solved by just showing the name. Before the change was introduced, there were registered multiple packages that wouldn't pass the check. That means if the name l10n is rejected by PyPI because there is a package i10n, claiming the i10n name would reveal that there is a package lion which the user would need to claim again. How many times can one claim names to register a single package? And if PyPI would show all similar names, would it be reasonable to allow mass name claiming? Then again, it's not much different from mass squatting. And if I could claim any name itself without claiming all collisions, wouldn't it defeat the point of the change altogether?
  4. It reduces the scope of available names. This one is similar to previous points but worth covering. As the Python ecosystem grows, so grows the list of registered names on PyPI, and so shrinks the scope of names available for registration. Less free names mean more frustration, harder name picking, and more awful names. You might know this frustration when you try to register a new account on, let's say, Reddit, but all usernames you ever used are already taken and in frustration, you just start slamming the keyboard trying to find just any random combination that would work. And now imagine that instead of nice readable names of packages you have such randomness in your dependency file, imports, and tracebacks. It's important to keep as many nice names for packages available as possible and the change goes against this initiative. Less good names available again means more effective name squatting.

Expected behavior

"What I see is what I get". If there is a package with this name, the name is already taken. You might claim it as per PEP 541 or pick another one. If there is no package with such name (and it's not in the stdlib), you can use it.

To Reproduce

Try to register l10n package. Or run test_fails_with_ultranormalized_names from the PyPI test suite.

My Platform

Irrelevant.

Additional context

Irrelevant.

Possible solutions

I understand the motivation behind the change but find it bringing more harm than good. To not be that person who only complaints about things, there are some solutions for the problem I see:

  1. From the obvious, just revert the change. It would solve all the issues I outlined but the problem with the name squatting will stay for further discussions (wouldn't it always stay unsolved anyway?).
  2. Reduce the scope of the change by checking collisions with only popular names. It makes sense to prevent registering names such as djang0 but at the same time there is no harm in having some not very popular or nearly abandoned packages collide. However, PyPI doesn't have a reliable metric of package popularity just yet. The downloads count is stored separately in BigQuery (and querying it for each name registration could be costly) and even then, the metric is pretty unreliable. GitHub stars count is an even worse indication of popularity and is available not for all packages.
  3. Allow registering colliding names but provide a warning in the Web UI that there are packages with a similar name.
  4. Write warnings about registering colliding names into an audit log. IDK if PyPI has any internal audits for this purpose but having one regularly might be a good idea if the registry security matters.
  5. Avoid implementing heuristics on the PyPI side and leave PyPI scanning to the bored (or paid) pentest companies.

Sorry for a lot of text. I don't want to fight against your vision of how the project should look like but I find this particular change harmful for both security and user experience.

@orsinium
Copy link
Contributor Author

orsinium commented Apr 7, 2022

Here I use "name squatting" to indicate two slightly different attacks:

  1. Squatting of names similar to existing projects. The end goal usually is to distribute a malware.
  2. Squatting of nice-looking names. Usually, dictionary words or popular brands. The end goal usually is to sell the name later.

If there is a term to distinguish these two, it's not known to me. The difference, however, is quite small, and persuaded goals may mix. Both are somewhat of a concern for a package registry and both should be approached carefully without sacrificing one for another.

@di
Copy link
Member

di commented Apr 7, 2022

Hey @orsinium, thanks for the issue. I think we're unlikely to reverse this policy: this may not be apparent to PyPI users but this has significantly cut down on the creation of malicious packages attempting to similar-squat legitimate project names. It's generally made PyPI safer to use but also means we (PyPI maintainers) can spend less time dealing with these types of packages.

I think there's a few things we can do to make this policy easier to deal with, though:

  • Add the ability to reserve a name #2082, so you can acquire a project name before you're ready to publish
  • Manage projects with namespaces #2589, so there's less contention for PyPI's single global namespace
  • Streamline the process of making a request for a prohibited or 'too similar' name, since these don't really need to happen in our public support issue tracker.
  • Make it more clear that a project name rejected for being "too similar" is available to request via PEP 541, and should pretty much always be approved (this is just a matter of updating https://pypi.org/help/#project-name)

@orsinium
Copy link
Contributor Author

orsinium commented Apr 8, 2022

I like the last 2 points. Even if the presence of the feature isn't something that can be discussed, there are still ways to improve it:

  1. Show with which names exactly the requested name conflicts. I covered in the issue description why it might be a good idea.
  2. Set a threshold for the allowed text distance. For example, 1111 and l111 (distance 1) are similar but 1111 and lili (distance 4) are completely different names.
  3. Make threshold adaptive based on the project name. For example, python-dateutll is too similar to python-dateutil while ll and li are two completely different names. Both cases have the same distance 1 but their length (and so relative difference) is different.
  4. Do not apply the rule to some abandoned or unpopular projects. The original change is targeted against name squatting for the purpose of malware distribution and so makes sense only for packages that have names similar to one that people often use (and so may mistype).

When I was working on dephell, I had an idea to warn users if they try to install a package that looks like a more popular project but mistyped (dephell/dephell#133). And to this day, I still think that allowing packages to have similar names but warning users about it could be a good idea. At least because it allows for an even more aggressive similarity search than the currently implemented ultranormalization..

@Matthelonianxl

This comment was marked as off-topic.

@domdfcoding
Copy link
Contributor

Having hit the error message myself I am at a loss as to which name my chosen name is too similar to, despite searching the list of all project names. I could spend all day playing "guess a valid name", but I'd rather not.

@jedie
Copy link

jedie commented Aug 26, 2022

What's about to use the Levenshtein distance ?

EDIT: Oh there a few issues about "Levenshtein distance": https://github.com/pypi/warehouse/issues?q=Levenshtein+distance ;)

@di
Copy link
Member

di commented Aug 26, 2022

Yes, we tried that in #5001, unfortunately it was far too noisy to be actually useful.

@thatch
Copy link

thatch commented Oct 27, 2022

Which one? If I knew, I could claim it as per PEP 541.

I agree. I tried to register "checkreqs" or so, and it was considered too similar to an unknown existing project.

I don't see anyone against giving the squatted project name, either here or in #11872, so I should probably just send a PR?

@SnoopJ
Copy link

SnoopJ commented Jan 12, 2023

Which one? If I knew, I could claim it as per PEP 541.

I agree. I tried to register "checkreqs" or so, and it was considered too similar to an unknown existing project.

I don't see anyone against giving the squatted project name, either here or in #11872, so I should probably just send a PR?

My opinion carries no organizational weight, but I think it would be a nice improvement if PyPI could be issue a more specific error message than the current one, and a PR would represent a very actionable decision for the maintainers, +1 from me. This may be easier to track if the other issue is re-opened or if a new issue with a suitably narrow scope is opened, since this issue has other things going on.

(For the sake of context: I ended up on this issue after helping a user in #python on Libera.chat navigate the existing error message, which left them perplexed about what they collided with and what to do about it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 squatting Issues related to preventing any kinds of namesquatting, typosquatting, dependency confusion
Projects
None yet
Development

No branches or pull requests

8 participants