-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
unicode encoding error callbacks #34615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This patch adds unicode error handling callbacks to the For example replacing unencodable characters with XML u"aäoöuüß".encode( |
Logged In: YES Thanks for the patch -- it looks very impressive !. I'll give it a try later this week. Some first cosmetic tidbits:
One thing which I don't like about your API change is that Please separate the errors.c patch from this patch -- it Thanks. |
Logged In: YES
OK, done!
Fixed!
Fixed!
encode one-to-one, it implements both ASCII and latin-1
Which ones? I introduced a new function for every old one,
I look through the code and found no situation where the
PyCodec_RaiseEncodeErrors uses this the have a \Uxxxx with I'll upload a revised patch as soon as it's done. |
Logged In: YES
Another problem is, that the callback requires a Python |
Logged In: YES About the Py_UNICODE*data, int size APIs: In general, I think we ought to keep the callback feature as BTW, could you summarize how the callback works in a few About _Encode121: I'd name this _EncodeUCS1 since that's About the new functions: I was referring to the new static |
Logged In: YES How the callbacks work: A PyObject * named errors is passed in. This may by NULL, The implementation of the loop through the string is done in (I hope that's enough explanation of the API and implementation) I have renamed the static ...121 function to all lowercase BTW, I guess PyUnicode_EncodeUnicodeEscape could be PyCodec_RaiseEncodeErrors, PyCodec_IgnoreEncodeErrors, I have not touched PyUnicode_TranslateCharmap yet, A remaining problem is how to implement decoding error |
Logged In: YES One additional note: It is vital that errors is an Consider the XML example: For writing an XML DOM tree one BTW, should we continue the discussion in the i18n SIG |
Logged In: YES
Nice.
Very elegant solution !
Could you add these docs to the Misc/unicode.txt file ? I
Ok.
Hmm, wouldn't that result in a slowdown ? If so, I'd rather
I think that codecs.c is the right place for these APIs. One thing I noted about the callbacks: they assume that they I think it would be worthwhile to rename the callbacks to
I'd suggest adding another set of PyCodec_UnicodeDecode...()
It is already !
Sure.
I'd rather keep the discussions on this patch here -- |
Logged In: YES
But the special casing of U+FFFD makes the interface def FFFDreplace(enc, uni, pos):
if uni[pos] == "\ufffd":
return u"?"
else:
raise UnicodeError(...)
I'll put it as a comment in the source.
I could, but first we should work out how the decoding
It would be a slowdown. But callbacks open many For example: Why can't I print u"gürk"? is probably one of the most frequently asked questions in
OK, done (and PyCodec_XMLCharRefReplaceUnicodeEncodeErrors
OK, do we want TranslateCharmap to work exactly like BTW, when I implement it I can implement patch bpo-403100 Should the old TranslateCharmap map to the new
Sounds good. Now what is the decoding callback supposed to Maybe the same should be added to the encoding callbacks to?
I know, but IMHO it should be documented that an assignable Misc/unicode.txt is not clear on that: |
Logged In: YES Guido van Rossum wrote in python-dev:
I noticed that too. asserting that errors=='strict' would |
Logged In: YES On your comment about the non-Unicode codecs: let's keep Don't have much time today. I'll comment on the other things |
Logged In: YES Sorry to keep you waiting, Walter. I will look into this |
Logged In: YES Ok, here we go...
True.
Go for it.
Ok. BTW, Barry Warsaw already did the work of converting the
True, but in this case I believe that we should stick with
There already is a print callback in Python (forgot the name of the
It's better to take the second approach (copy the callback I suppose this will also simplify the implementation somewhat.
I've seen it; will comment on it later.
If possible, please also add the multichar replacement [Decoding error callbacks]
I like the idea of having an optional state object (basically About the return value: I'd suggest to always use the same tuple interface, e.g.
(I think it's better to use absolute values for the position Perhaps the encoding callbacks should use the same
Good point. I'll add that to the PEP-100. |
Logged In: YES
OK, done!
OK. I guess it would be best to do this when everything
OK, done, now there's a
True: sys.displayhook
OK! I will try to find the time to implement that in the
state) ->
This would make the callback feature hypergeneric and a I implemented this and changed the encoders to only Do we want to enforce new_input_position>input_position,
OK. Here's is the current todo list:
I'm thinking about a different strategy for implementing We coould have a error handler registry, which maps names But with an error handler registry this function would def xmlreplace(encoding, unicode, pos, state):
return (u"&#%d;" % ord(uni[pos]), pos+1)
import codec
codec.registerError("xmlreplace",xmlreplace) and then the following call can be made: But for special one-shot error handlers, it might still be |
Logged In: YES
Great !
Good.
That's the point. Note that I don't think the tuple creation
No; moving backwards should be allowed (this may be useful
Good idea ! One minor nit: codecs.registerError() should be named |
Logged In: YES New version of the patch with the error handling callback
Now PyCodec_EscapeReplaceUnicodeEncodeErrors uses \x
OK, but these function are specific to unicode encoding, Now all callbacks (including the new |
Logged In: YES Changing the decoding API is done now. There There may be many reasons for decoding errors >>> "\\U1111111".decode("unicode_escape")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: encoding 'unicodeescape' can't decode byte
0x31 in position 8: truncated \UXXXXXXXX escape
>>> "\\U11111111".decode("unicode_escape")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: encoding 'unicodeescape' can't decode byte
0x31 in position 9: illegal Unicode character
For symmetry I added this to the encoding API too:
>>> u"\xff".encode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: encoding 'ascii' can't decode byte 0xff in
position 0: ordinal not in range(128) The parameters passed to the callbacks now are: The encoding and decoding API for strings has been >>> unicode("a\xffb\xffc", "ascii",
... lambda enc, uni, pos, rea, sta: (u"<?>", pos+1))
u'a<?>b<?>c'
>>> "a\xffb\xffc".decode("ascii",
... lambda enc, uni, pos, rea, sta: (u"<?>",
pos+1))
u'a<?>b<?>c' I had a problem with the decoding API: all the I changed all the old function to call the new There are still a few spots that call the old API: Should we switch to the new API everywhere even The size of this patch begins to scare me. I |
Logged In: YES I think we ought to summarize these changes in a PEP to get some more feedback and testing from others as I'll look into this after I'm back from vacation on the 10.09. Given the release schedule I am not sure whether this feature will make it into 2.2. The size of the patch is huge |
Logged In: YES I am postponing this patch until the PEP process has started. This feature won't make it into Python 2.2. Walter, you may want to reference this patch in the PEP. |
Logged In: YES Walter, are you making any progress on the new scheme |
Logged In: YES I started from scratch, and the current state is this: Encoding mostly works (except that I haven't changed For encoding the callback helper knows how to The patch so far didn't require any changes to |
Logged In: YES I'm think about extending the API a little bit: Consider the following example:
>>> "\\u1".decode("unicode-escape")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: encoding 'unicodeescape'
can't decode byte 0x31
in position 2: truncated \uXXXX escape The error message is a lie: Not the '1' For encoding this would be useful too:
Suppose I want to have an encoder that
colors the unencodable character via an
ANSI escape sequences. Then I could do
the following:
>>> import codecs
>>> def color(enc, uni, pos, why, sta):
... return (u"\033[1m<%d>\033[0m" % ord(uni[pos]), pos+1)
...
>>> codecs.register_unicodeencodeerrorhandler("color",
color)
>>> u"aäüöo".encode("ascii", "color")
'a\x1b[1m<228>\x1b[0m\x1b[1m<252>\x1b[0m\x1b[1m<246>\x1b
[0mo' But here the sequences "\x1b[0m\x1b[1m" are not needed. To fix this problem the encoder could collect as many This fixes the above problem and reduces the number of What do you think? |
Logged In: YES Sounds like a good idea. Please keep the encoder and I like the highlighting feature ! |
Logged In: YES What should replace do: Return u"?" or (end-start)*u"?" |
Logged In: YES Hmm, whatever it takes to maintain backwards |
Logged In: YES For encoding it's always (end-start)*u"?":
>>> u"ää".encode("ascii", "replace")
'??'
But for decoding, it is neither nor:
>>> "\\Ux\\U".decode("unicode-escape", "replace")
u'\ufffd\ufffd' i.e. a sequence of 5 illegal characters was replace by two (It seems that this patch would be much, much simpler, if |
Logged In: YES So this means that the encoder can collect illegal Decoders don't collect all illegal byte sequences, but call Does this make sense? |
Logged In: YES Another note: the patch will change the meaning of charmap With the patch the above example will raise an exception. Off course with the patch many more replace characters can Is this semantic change OK? (I guess all of the existing |
Logged In: YES Sorry for the late response. About the difference between encoding and decoding: you shouldn't Error handling has to be flexible enough to handle all these For the existing codecs, backward compatibility should be Raising an exception for the charmap codec is the right For new codecs, I think we should suggest that replace About the codec error handling registry: Does that make sense ? BTW, the patch which uses the callback registry does not seem Note that the highlighting codec would make a nice example Thanks. |
Logged In: YES
unicode.encode encodes to str and >>> u"gürk".encode("rot13")
't\xfcex'
>>> "gürk".decode("rot13")
u't\xfcex'
>>> u"gürk".decode("rot13")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'unicode' object has no attribute 'decode'
>>> "gürk".encode("rot13")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/home/walter/Python-current-
readonly/dist/src/Lib/encodings/rot_13.py", line 18, in
encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeError: ASCII decoding error: ordinal not in range
(128) Here the str is converted to unicode Is there an example where something
OK, but we should suggest, that for encoding
OK, this is implemented in PyUnicode_EncodeCharmap now, I completely changed the implementation,
OK for encoders, for decoders see
The handlers in the registry are all Unicode I renamed the function because of your
We could require that unique names But I think two unicode specific
OK, I'll upload a preliminary version As PyUnicode_EncodeDecimal is only used
This could be part of the codec callback test
Another idea: we could have as an example def relaxedutf8(enc, uni, startpos, endpos, reason, data):
if uni[startpos:startpos+2] == u"\xc0\x80":
return (u"\x00", startpos+2)
else:
raise UnicodeError(...) |
Logged In: YES OK, here is the current version of the patch (diff7.txt). |
Logged In: YES And here is the test script (test_codeccallbacks.py) |
Logged In: YES A new idea for the interface between the Maybe we could have new exception classes There is no data object, because when a codec It might be better to move these attributes With this method we really can have one global def replace(exc):
if isinstance(exc, UnicodeDecodeError):
return ("?", exc.end)
else:
return (u"?"*(exc.end-exc.start), exc.end) Another possibility would be to do the commucation def replace(exc):
if isinstance(exc, UnicodeDecodeError):
exc.replacement = "?"
else:
exc.replacement = u"?"*(exc.end-exc.start) As many of the assignments can now be done on Does this make sense, or is this to fancy? |
Logged In: YES OK, PyUnicode_EncodeDecimal is done (diff8.txt), but as the |
Logged In: YES OK, PyUnicode_TranslateCharmap is finished too. As the |
Logged In: YES This new version diff10.txt fixes a memory |
Logged In: YES diff11.txt fixes two refcounting bugs in codecs.c. |
Logged In: YES diff12.txt finally implements the PEP-293 specification (i.e. |
Logged In: YES Attached is a new version of the test script. But we need |
Logged In: YES The attached new version of the test script add test for wrong UTF-7 decoding still has a flaw inherited from the current >>> "+xxx".decode("utf-7")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf7' codec can't decode bytes in
position 0-3: unterminated shift sequence
*>>> "+xxx".decode("utf-7", "ignore")
u'\uc71c' The decoder should consider the whole sequence "+xxx" as |
Logged In: YES This new version diff13.txt moves the initialization of The error logic for the accessor function is inverted (now Updated the prototypes to use the new PyAPI_FUNC macro. Enhanced the docstrings for str.(de|en)code and unicode.encode. There seems to be a new string decoding function |
Logged In: YES Checked in as: Doc/lib/libcodecs.tex 1.11 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: