Ultranormalization encourages name squatting #11139

orsinium · 2022-04-07T12:48:51Z

Describe the bug

#10498 introduced "ultranormalization" to prevent name squatting of package names similar to ones already registered:

requests.exceptions.HTTPError: 400 Client Error: The name 'l10n' is too similar to an existing project. See https://pypi.org/help/#project-name for more information. for url: https://upload.pypi.org/legacy/

While the initiative, in general, is something of major concern for PyPI (and any other big package registry), the implementation has a few painful drawbacks:

It simplifies name squatting. For example, registration of the name lili (a French name) allows to additionally squat many other similar names, such as 1111, i111, i11l (could be a good name for internationalization package), i-11-l, iiii (4 in Roman numerals), and so on. In total, it's a huge amount of combinations, the exact number depends on the package max size and if you count names such as l-------ll--------l.
It complicates name registration. A few months ago, I started my work on an l10n+i18n package. I always start by picking a name. A quick search on PyPI showed that the name l10n is free. A few months later, I have the package ready but PyPI rejects my upload. How could I know that the name is "taken"? Should I register the name before I have any code ready? Then again, that encourages name squatting.
The name rejection reason isn't clear. It just says that the name is similar to another one. Which one? If I knew, I could claim it as per PEP 541. But it can't be completely solved by just showing the name. Before the change was introduced, there were registered multiple packages that wouldn't pass the check. That means if the name l10n is rejected by PyPI because there is a package i10n, claiming the i10n name would reveal that there is a package lion which the user would need to claim again. How many times can one claim names to register a single package? And if PyPI would show all similar names, would it be reasonable to allow mass name claiming? Then again, it's not much different from mass squatting. And if I could claim any name itself without claiming all collisions, wouldn't it defeat the point of the change altogether?
It reduces the scope of available names. This one is similar to previous points but worth covering. As the Python ecosystem grows, so grows the list of registered names on PyPI, and so shrinks the scope of names available for registration. Less free names mean more frustration, harder name picking, and more awful names. You might know this frustration when you try to register a new account on, let's say, Reddit, but all usernames you ever used are already taken and in frustration, you just start slamming the keyboard trying to find just any random combination that would work. And now imagine that instead of nice readable names of packages you have such randomness in your dependency file, imports, and tracebacks. It's important to keep as many nice names for packages available as possible and the change goes against this initiative. Less good names available again means more effective name squatting.

Expected behavior

"What I see is what I get". If there is a package with this name, the name is already taken. You might claim it as per PEP 541 or pick another one. If there is no package with such name (and it's not in the stdlib), you can use it.

To Reproduce

Try to register l10n package. Or run test_fails_with_ultranormalized_names from the PyPI test suite.

My Platform

Irrelevant.

Additional context

Irrelevant.

Possible solutions

I understand the motivation behind the change but find it bringing more harm than good. To not be that person who only complaints about things, there are some solutions for the problem I see:

From the obvious, just revert the change. It would solve all the issues I outlined but the problem with the name squatting will stay for further discussions (wouldn't it always stay unsolved anyway?).
Reduce the scope of the change by checking collisions with only popular names. It makes sense to prevent registering names such as djang0 but at the same time there is no harm in having some not very popular or nearly abandoned packages collide. However, PyPI doesn't have a reliable metric of package popularity just yet. The downloads count is stored separately in BigQuery (and querying it for each name registration could be costly) and even then, the metric is pretty unreliable. GitHub stars count is an even worse indication of popularity and is available not for all packages.
Allow registering colliding names but provide a warning in the Web UI that there are packages with a similar name.
Write warnings about registering colliding names into an audit log. IDK if PyPI has any internal audits for this purpose but having one regularly might be a good idea if the registry security matters.
Avoid implementing heuristics on the PyPI side and leave PyPI scanning to the bored (or paid) pentest companies.

Sorry for a lot of text. I don't want to fight against your vision of how the project should look like but I find this particular change harmful for both security and user experience.

The text was updated successfully, but these errors were encountered:

orsinium · 2022-04-07T12:56:06Z

Here I use "name squatting" to indicate two slightly different attacks:

Squatting of names similar to existing projects. The end goal usually is to distribute a malware.
Squatting of nice-looking names. Usually, dictionary words or popular brands. The end goal usually is to sell the name later.

If there is a term to distinguish these two, it's not known to me. The difference, however, is quite small, and persuaded goals may mix. Both are somewhat of a concern for a package registry and both should be approached carefully without sacrificing one for another.

di · 2022-04-07T13:43:29Z

Hey @orsinium, thanks for the issue. I think we're unlikely to reverse this policy: this may not be apparent to PyPI users but this has significantly cut down on the creation of malicious packages attempting to similar-squat legitimate project names. It's generally made PyPI safer to use but also means we (PyPI maintainers) can spend less time dealing with these types of packages.

I think there's a few things we can do to make this policy easier to deal with, though:

Add the ability to reserve a name #2082, so you can acquire a project name before you're ready to publish
Manage projects with namespaces #2589, so there's less contention for PyPI's single global namespace
Streamline the process of making a request for a prohibited or 'too similar' name, since these don't really need to happen in our public support issue tracker.
Make it more clear that a project name rejected for being "too similar" is available to request via PEP 541, and should pretty much always be approved (this is just a matter of updating https://pypi.org/help/#project-name)

orsinium · 2022-04-08T07:15:50Z

I like the last 2 points. Even if the presence of the feature isn't something that can be discussed, there are still ways to improve it:

Show with which names exactly the requested name conflicts. I covered in the issue description why it might be a good idea.
Set a threshold for the allowed text distance. For example, 1111 and l111 (distance 1) are similar but 1111 and lili (distance 4) are completely different names.
Make threshold adaptive based on the project name. For example, python-dateutll is too similar to python-dateutil while ll and li are two completely different names. Both cases have the same distance 1 but their length (and so relative difference) is different.
Do not apply the rule to some abandoned or unpopular projects. The original change is targeted against name squatting for the purpose of malware distribution and so makes sense only for packages that have names similar to one that people often use (and so may mistype).

When I was working on dephell, I had an idea to warn users if they try to install a package that looks like a more popular project but mistyped (dephell/dephell#133). And to this day, I still think that allowing packages to have similar names but warning users about it could be a good idea. At least because it allows for an even more aggressive similarity search than the currently implemented ultranormalization..

domdfcoding · 2022-05-29T22:07:44Z

Having hit the error message myself I am at a loss as to which name my chosen name is too similar to, despite searching the list of all project names. I could spend all day playing "guess a valid name", but I'd rather not.

jedie · 2022-08-26T05:56:50Z

What's about to use the Levenshtein distance ?

EDIT: Oh there a few issues about "Levenshtein distance": https://github.com/pypi/warehouse/issues?q=Levenshtein+distance ;)

di · 2022-08-26T12:33:12Z

Yes, we tried that in #5001, unfortunately it was far too noisy to be actually useful.

thatch · 2022-10-27T19:14:49Z

Which one? If I knew, I could claim it as per PEP 541.

I agree. I tried to register "checkreqs" or so, and it was considered too similar to an unknown existing project.

I don't see anyone against giving the squatted project name, either here or in #11872, so I should probably just send a PR?

SnoopJ · 2023-01-12T02:33:45Z

Which one? If I knew, I could claim it as per PEP 541.

I agree. I tried to register "checkreqs" or so, and it was considered too similar to an unknown existing project.

I don't see anyone against giving the squatted project name, either here or in #11872, so I should probably just send a PR?

My opinion carries no organizational weight, but I think it would be a nice improvement if PyPI could be issue a more specific error message than the current one, and a PR would represent a very actionable decision for the maintainers, +1 from me. This may be easier to track if the other issue is re-opened or if a new issue with a suitably narrow scope is opened, since this issue has other things going on.

(For the sake of context: I ended up on this issue after helping a user in #python on Libera.chat navigate the existing error message, which left them perplexed about what they collided with and what to do about it)

orsinium added the bug 🐛 label Apr 7, 2022

orsinium mentioned this issue Apr 8, 2022

PEP 541 Request: l10n pypi/support#1822

Closed

1 task

This comment was marked as off-topic.

Sign in to view

domdfcoding mentioned this issue Jul 19, 2022

PyPI does not always give a specific reason for why a name is not allowed #11872

Closed

martindemello mentioned this issue Sep 12, 2023

Add a check for similar names #14535

Open

Mark-Lowell mentioned this issue Mar 23, 2024

PEP 541 Request: torchcast pypi/support#3782

Closed

1 task

twm mentioned this issue Jul 7, 2024

Trusted publishing: pending publisher should warn about ultranormalized name collision #16226

Closed

miketheman added the squatting Issues related to preventing any kinds of namesquatting, typosquatting, dependency confusion label Feb 13, 2025

ehmatthes mentioned this issue Mar 25, 2025

PEP 541 Request: py-bugger pypi/support#5976

Closed

1 task

MakarovDi mentioned this issue Apr 18, 2025

Make a pip-installable package MakarovDi/uplt#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ultranormalization encourages name squatting #11139

Ultranormalization encourages name squatting #11139

orsinium commented Apr 7, 2022 •

edited

Loading

orsinium commented Apr 7, 2022

Uh oh!

di commented Apr 7, 2022

Uh oh!

orsinium commented Apr 8, 2022

Uh oh!

This comment was marked as off-topic.

domdfcoding commented May 29, 2022

Uh oh!

jedie commented Aug 26, 2022 •

edited

Loading

Uh oh!

di commented Aug 26, 2022

Uh oh!

thatch commented Oct 27, 2022

Uh oh!

SnoopJ commented Jan 12, 2023 •

edited

Loading

Uh oh!

Ultranormalization encourages name squatting #11139

Ultranormalization encourages name squatting #11139

Comments

orsinium commented Apr 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Possible solutions

orsinium commented Apr 7, 2022

Uh oh!

di commented Apr 7, 2022

Uh oh!

orsinium commented Apr 8, 2022

Uh oh!

This comment was marked as off-topic.

domdfcoding commented May 29, 2022

Uh oh!

jedie commented Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

di commented Aug 26, 2022

Uh oh!

thatch commented Oct 27, 2022

Uh oh!

SnoopJ commented Jan 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orsinium commented Apr 7, 2022 •

edited

Loading

jedie commented Aug 26, 2022 •

edited

Loading

SnoopJ commented Jan 12, 2023 •

edited

Loading