Skip to content

os.path should use AnyStr, not unicode #50

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
matthiaskramm opened this issue Jan 21, 2016 · 10 comments · Fixed by #1150
Closed

os.path should use AnyStr, not unicode #50

matthiaskramm opened this issue Jan 21, 2016 · 10 comments · Fixed by #1150

Comments

@matthiaskramm
Copy link
Contributor

os.path currently defines everything as using unicode. But most functions accept both str and unicode.
E.g. os.path.isdir uses os.stat() underneath, which does PyArg_ParseTuple(args, "et"). Hence it supports str and unicode. (And character buffer objects, for that matter)

os.path should use AnyStr.

@JukkaL
Copy link
Contributor

JukkaL commented Jan 21, 2016

The Python 3 stub uses AnyStr:

def commonprefix(list: List[AnyStr]) -> Any: ...
def dirname(path: AnyStr) -> AnyStr: ...
def exists(path: AnyStr) -> bool: ...
def lexists(path: AnyStr) -> bool: ...
def expanduser(path: AnyStr) -> AnyStr: ...
def expandvars(path: AnyStr) -> AnyStr: ...

The Python 2 stub uses unicode with the understanding that str is a subtype of unicode, so AnyStr is often not necessary. See discussion at python/mypy#1135 for more context. This is how mypy works currently, but it's not necessarily the best way to do it. We should agree on a general approach used by all PEP 484 compliant tools (that support Python 2) and then update stubs as needed.

@matthiaskramm
Copy link
Contributor Author

I don't have a problem with mypy injecting the artificial subtype relationship str <: unicode, but I do feel that typeshed should model the real world.

In mypy, is there any practical difference between AnyStr (a.k.a. Union[str, unicode]) and unicode (a.k.a. unicode and all its subclasses, including str) in Python 2 mode, outside of invariant type parameters? I.e., will changing unicode to AnyStr break anything, on your side?

@JukkaL
Copy link
Contributor

JukkaL commented Jan 21, 2016

If/when we add some language about Python 2 to PEP 484, we may consider making the str <: unicode relationship official. The special subtype relationship bytearray <: bytes is already specified by PEP 484 and typeshed stubs (mostly) reflect that.

AnyStr is quite different from Union[str, unicode], as it's a type variable, i.e. different instances of AnyStr in a scope can't vary independently. So a signature like (AnyStr, AnyStr) -> AnyStr is similar to overloaded signatures (str, str) -> str and (unicode, unicode) -> unicode (but no (str, unicode) -> unicode), but with unions there would just be a single signature and the return type would always be Union[str, unicode].

Using AnyStr without the special subtyping relationship would not be correct for all functions in os.path. Consider this:

def relpath(path: AnyStr, start: AnyStr = ...) -> AnyStr: ...

relpath('foo', u'foo') should be valid but without the subtyping relationship it would be an error, since there is no way to map the signature to (str, unicode) -> ... by substitution.

@JukkaL
Copy link
Contributor

JukkaL commented Jan 21, 2016

I created a mypy issue for discussing how mypy should deal with this: python/mypy#1141. We could also move this to typehinting or python-ideas.

@matthiaskramm
Copy link
Contributor Author

relpath is an interesting example, since it always returns str, even if both arguments are unicode. I wouldn't mind modelling that as Union[str, unicode] for that specific signature. (We can add a type alias to the top of the module, for readability)

Remembering our email discussion about simplifying types, it seems to me that replacing unicode with Union[str, unicode] should always be safe in the mypy context?

I'm still making up my mind about the str <: unicode hack. It first sounded to me like it might hide legitimate programming errors.
However, I just went looking through the standard library for half an hour, trying to find a good example for a function that would only accept unicode. The best I could come up with is

array.array('u', u'x')[0] = 'y'

.
This is obviously rather contrived.

So the only remaining thing I wonder about is whether some user-space code would want to limit types to unicode as a preparatory mesaure for Python 2->3 conversion. If so, it wouldn't have a way to do that with the implicit subclassing in place.

@JukkaL
Copy link
Contributor

JukkaL commented Jan 21, 2016

relpath may return an unicode object:

>>> relpath(u'/foo', '/')
u'foo'

Also, it's important that the return type is str if the argument types are all str, since unicode should not be a subtype of str. That's why AnyStr is defined in a somewhat peculiar way. The signature (Union[str, unicode], Union[str, unicode]) -> Union[str, unicode] for relpath is not precise, because the return type is too general.

Limiting str/unicode compatibility is a reasonable use case, but a likely better way to do it is to run a type checker in Python 3 mode. It will also catch other Python 2/3 errors.

@gvanrossum
Copy link
Member

relpath is an interesting example, since it always returns str, even if
both arguments are unicode

I think you missed something; it seems to follow the first arg.

os.path.relpath(u'/etc', '/')
u'etc'

But reading the code that's hard to figure out (it calls commonprefix() on
the args, and also join().

But in the case of the most things in os and os.path, I have a feeling we
should try to decide what we want to support rather than trying to figure
out what the code actually supports -- because in Python 3 most of these
only accept uniform argument types. So even is relpath() accepts a str and
a unicode in PY2, maybe we should frown upon that, and use AnyStr in the
type regardless. It might be more annoying, but it will also catch
potential bugs in PY2 (e.g. "str in unicode" succeeds often but fails if
str has non-ASCII bytes), and it will encourage code that's more easily
ported to PY3.

@gvanrossum
Copy link
Member

I just read Jukka's reply (even though it came before mine).

I would like to have a clear proposal. My proposal is to use AnyStr everywhere for os and os.path, and just explain when users complain that their code is suspect at best. (If using a type variable in one position is a problem, we can use basestring for those.)

Can we vote on this?

@JukkaL
Copy link
Contributor

JukkaL commented Jan 21, 2016

Note that AnyStr with str str / unicode subtyping rule accepts mixed str / unicode arguments -- the value of AnyStr will be unicode if any argument is unicode.

Anyway, I believe that experimentation with production code is the right way to understand this issue better. The problem is that for each experiment we may have to update all relevant stubs :-/

@JukkaL
Copy link
Contributor

JukkaL commented Jan 21, 2016

Regarding Guido's previous comment, using AnyStr consistently is okay in os.path, though this answer doesn't fully address the underlying problem. See python/mypy#1141 for an exploration of a subset of the design space.

euresti pushed a commit to euresti/typeshed that referenced this issue Apr 7, 2017
To be renamed into stdlib/2and3/os/path.pyi later.

Also fixes python#50
JelleZijlstra pushed a commit that referenced this issue Apr 10, 2017
* Merge stdlib/{2,3}/os/path.pyi

To be renamed into stdlib/2and3/os/path.pyi later.

Also fixes #50

* CR fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants