Skip to content

Parametrized values containing non-ascii bytes break cache #1031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 23, 2015

Conversation

nicoddemus
Copy link
Member

Just added the test to see what was going on. This is the value of the lastfailed attribute when it fails:

{'test_cache_param.py::test_fail[\xac\x10\x02G]': True}  

I thought nodeids would always be a "well formed" string, was I wrong or perhaps this is a bug? @RonnyPfannschmidt or @hpk42?

Committing directly to pytest repository to get feedback from others on how to proceed (and contribute directly to this branch).

Related to #1030

@The-Compiler
Copy link
Member

By default, string IDs just get passed through:

# _pytest/python.py
def _idval(val, argname, idx, idfn):
    # ...
    if isinstance(val, (float, int, str, bool, NoneType)):
        return str(val)

It seems Python 2's json can't handle that, while Python 3 can:

$ python3 -c 'import json; print(json.dumps({"": "\xac\x10\x02G"}))'
{"": "\u00ac\u0010\u0002G"}

$ python2 -c 'import json; print(json.dumps({"": "\xac\x10\x02G"}))'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python2.7/json/__init__.py", line 243, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xac in position 0: invalid start byte

I think the issue is that with Python 2, normal strings are bytestrings, while in Python 3, they're unicode strings.

I haven't looked at the cache code yet, but I think one solution would be to encode the data as utf-8 before passing it to json - the question is what should be done with unencodable data.

Another solution would be to sanitize invalid UTF-8 when generating the test IDs already - but I don't know enough about encodings in Python 2 to do that.

I think that'd be preferrable as it caused trouble elsewhere already.

@RonnyPfannschmidt
Copy link
Member

we should simplyencode python2 strings that dont fit into ascii as string-escape,
and to be consistent encode python3 byre strings the same way

@RonnyPfannschmidt
Copy link
Member

while we are at it we should have a unittest that ensures all kinds of test id string are encodable on all supported python versions,

@nicoddemus nicoddemus force-pushed the unmarshable-parametrize branch from c9b0370 to 716fa97 Compare September 23, 2015 02:21
@nicoddemus
Copy link
Member Author

Thanks @RonnyPfannschmidt for the suggestion! 😄

Now bytes are escaped if they are not directly convertible to ascii.

Python 2: I also took the opportunity to apply the same logic for unicode strings, they are returned as plain bytes/str if they are ascii. This is specially useful for python 2 shops which always use from __future__ import unicode_literals. Also I think this is acceptable solution for #656.

Remove two xfail marks related to #714 which have been passing since #908, I presume.

@nicoddemus nicoddemus changed the title Write failing test for parametrized tests with unmarshable parameters Parametrized values containing non-ascii bytes break cache Sep 23, 2015
@RonnyPfannschmidt
Copy link
Member

well done

RonnyPfannschmidt added a commit that referenced this pull request Sep 23, 2015
Parametrized values containing non-ascii bytes break cache
@RonnyPfannschmidt RonnyPfannschmidt merged commit a3fdcd9 into master Sep 23, 2015
@nicoddemus nicoddemus deleted the unmarshable-parametrize branch September 23, 2015 07:56
is a full ascii string, otherwise escape it into its binary form.
"""
try:
return val.encode('ascii')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicoddemus encode/decode mismatch

also the kind of code that begs for unit-tests to run across all code paths

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, thanks for tracking this down.

Some unit-tests were written for this (take a look at test_idmaker_native_strings), you think adding more values in there that demonstrate this problem is enough, or some other tests should be written?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly for _escape_bytes, the idea is to properly run trough all code-paths of that function
so we don't have to exercise the code paths via the idmaker

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I reached the same conclusion after I posted my comment and thought about it a little more. I will start working on this now. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants