Skip to content

unicode encoding error callbacks #34615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
doerwalter opened this issue Jun 12, 2001 · 42 comments
Closed

unicode encoding error callbacks #34615

doerwalter opened this issue Jun 12, 2001 · 42 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@doerwalter
Copy link
Contributor

BPO 432401
Nosy @malemburg, @doerwalter
Files
  • diff7.txt
  • test_codeccallbacks.py: test script
  • diff8.txt: PyUnicode_EncodeDecimal done
  • diff9.txt
  • diff10.txt
  • test_codeccallbacks.py: test script which catches the bugs fixed in diff10.txt
  • diff11.txt
  • speedtest.py: test speed for encoding
  • diff12.txt
  • test_codeccallbacks.py: test script for diff12.txt
  • test_codeccallbacks.py: Adds parameter and result test for the callbacks
  • diff13.txt
  • diff.txt
  • diff.txt: Revised patch
  • diff.txt: Patch V3: Renamed the encode function to include "unicode". a few fixes in Lib/encodings
  • diff.txt: Patch V4: (enc, uni, pos, state) -> (out, newpos) communication and speedups
  • diff.txt: Patch V5: Unicode encoding error handling callback registry
  • diff.txt: Patch V6: Encoding and decoding for string and unicode done
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/malemburg'
    closed_at = <Date 2002-09-02.13:19:15.000>
    created_at = <Date 2001-06-12.13:43:03.000>
    labels = ['interpreter-core']
    title = 'unicode encoding error callbacks'
    updated_at = <Date 2002-09-02.13:19:15.000>
    user = 'https://github.com/doerwalter'

    bugs.python.org fields:

    activity = <Date 2002-09-02.13:19:15.000>
    actor = 'doerwalter'
    assignee = 'lemburg'
    closed = True
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2001-06-12.13:43:03.000>
    creator = 'doerwalter'
    dependencies = []
    files = ['3365', '3366', '3367', '3368', '3369', '3370', '3371', '3372', '3373', '3374', '3375', '3376', '3377', '3378', '3379', '3380', '3381', '3382']
    hgrepos = []
    issue_num = 432401
    keywords = ['patch']
    message_count = 42.0
    messages = ['36773', '36774', '36775', '36776', '36777', '36778', '36779', '36780', '36781', '36782', '36783', '36784', '36785', '36786', '36787', '36788', '36789', '36790', '36791', '36792', '36793', '36794', '36795', '36796', '36797', '36798', '36799', '36800', '36801', '36802', '36803', '36804', '36805', '36806', '36807', '36808', '36809', '36810', '36811', '36812', '36813', '36814']
    nosy_count = 2.0
    nosy_names = ['lemburg', 'doerwalter']
    pr_nums = []
    priority = 'high'
    resolution = 'accepted'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue432401'
    versions = []

    @doerwalter
    Copy link
    Contributor Author

    This patch adds unicode error handling callbacks to the
    encode functionality. With this patch it's possible to
    not only pass 'strict', 'ignore' or 'replace' as the
    errors argument to encode, but also a callable
    function, that will be called with the encoding name,
    the original unicode object and the position of the
    unencodable character. The callback must return a
    replacement unicode object that will be encoded instead
    of the original character.

    For example replacing unencodable characters with XML
    character references can be done in the following way.

    u"aäoöuüß".encode(
    "ascii",
    lambda enc, uni, pos: u"&#x%x;" % ord(uni[pos])
    )

    @doerwalter doerwalter added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jun 12, 2001
    @doerwalter doerwalter added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jun 12, 2001
    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Thanks for the patch -- it looks very impressive !.

    I'll give it a try later this week.

    Some first cosmetic tidbits:

    • please don't place more than one C statement on one line
      like in:
      """
      + unicode = unicode2; unicodepos =
      unicode2pos;
      + unicode2 = NULL; unicode2pos = 0;
      """

    • Comments should start with a capital letter and be
      prepended
      to the section they apply to

    • There should be spaces between arguments in compares
      (a == b) not (a==b)

    • Where does the name "...Encode121" originate ?

    • module internal APIs should use lower case names (you
      converted some of these to PyUnicode_...() -- this is
      normally reserved for APIs which are either marked as
      potential candidates for the public API or are very
      prominent in the code)

    One thing which I don't like about your API change is that
    you removed the Py_UNICODE*data, int size style arguments --
    this makes it impossible to use the new APIs on non-Python
    data or data which is not available as Unicode object.

    Please separate the errors.c patch from this patch -- it
    seems totally unrelated to Unicode.

    Thanks.

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    • please don't place more than one C statement on one line
      like in:
      """
    •           unicode = unicode2; unicodepos =
      

    unicode2pos;

    •           unicode2 = NULL; unicode2pos = 0;
      

    """

    OK, done!

    • Comments should start with a capital letter and be
      prepended
      to the section they apply to

    Fixed!

    • There should be spaces between arguments in compares
      (a == b) not (a==b)

    Fixed!

    • Where does the name "...Encode121" originate ?

    encode one-to-one, it implements both ASCII and latin-1
    encoding.

    • module internal APIs should use lower case names (you
      converted some of these to PyUnicode_...() -- this is
      normally reserved for APIs which are either marked as
      potential candidates for the public API or are very
      prominent in the code)

    Which ones? I introduced a new function for every old one,
    that had a "const char *errors" argument, and a few new ones
    in codecs.h, of those PyCodec_EncodeHandlerForObject is
    vital, because it is used to map for old string arguments to
    the new function objects. PyCodec_RaiseEncodeErrors can be
    used in the encoder implementation to raise an encode error,
    but it could be made static in unicodeobject.h so only those
    encoders implemented there have access to it.

    One thing which I don't like about your API change is that
    you removed the Py_UNICODE*data, int size style arguments > --
    this makes it impossible to use the new APIs on non-Python
    data or data which is not available as Unicode object.

    I look through the code and found no situation where the
    Py_UNICODE*/int version is really used and having two
    (PyObject *)s (the original and the replacement string),
    instead of UNICODE*/int and PyObject * made the
    implementation a little easier, but I can fix that.

    Please separate the errors.c patch from this patch -- it
    seems totally unrelated to Unicode.

    PyCodec_RaiseEncodeErrors uses this the have a \Uxxxx with
    four hex digits. I removed it.

    I'll upload a revised patch as soon as it's done.

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    One thing which I don't like about your API change is that
    you removed the Py_UNICODE*data, int size style arguments

    this makes it impossible to use the new APIs on non-Python
    data or data which is not available as Unicode object.

    Another problem is, that the callback requires a Python
    object, so in the PyObject *version, the refcount is
    incref'd and the object is passed to the callback. The
    Py_UNICODE*/int version would have to create a new Unicode
    object from the data.

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    About the Py_UNICODE*data, int size APIs:
    Ok, point taken.

    In general, I think we ought to keep the callback feature as
    open as possible, so passing in pointers and sizes would not
    be very useful.

    BTW, could you summarize how the callback works in a few
    lines ?

    About _Encode121: I'd name this _EncodeUCS1 since that's
    what it is ;-)

    About the new functions: I was referring to the new static
    functions which you gave PyUnicode_... names. If these are
    not supposed to turn into non-static functions, I'd rather
    have them use lower case names (since that's how the Python
    internals work too -- most of the times).

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    How the callbacks work:

    A PyObject * named errors is passed in. This may by NULL,
    Py_None, 'strict', u'strict', 'ignore', u'ignore',
    'replace', u'replace' or a callable object.
    PyCodec_EncodeHandlerForObject maps all of these objects to
    one of the three builtin error callbacks
    PyCodec_RaiseEncodeErrors (raises an exception),
    PyCodec_IgnoreEncodeErrors (returns an empty replacement
    string, in effect ignoring the error),
    PyCodec_ReplaceEncodeErrors (returns U+FFFD, the Unicode
    replacement character to signify to the encoder that it
    should choose a suitable replacement character) or directly
    returns errors if it is a callable object. When an
    unencodable character is encounterd the error handling
    callback will be called with the encoding name, the original
    unicode object and the error position and must return a
    unicode object that will be encoded instead of the offending
    character (or the callback may of course raise an
    exception). U+FFFD characters in the replacement string will
    be replaced with a character that the encoder chooses ('?'
    in all cases).

    The implementation of the loop through the string is done in
    the following way. A stack with two strings is kept and the
    loop always encodes a character from the string at the
    stacktop. If an error is encountered and the stack has only
    one entry (during encoding of the original string) the
    callback is called and the unicode object returned is pushed
    on the stack, so the encoding continues with the replacement
    string. If the stack has two entries when an error is
    encountered, the replacement string itself has an
    unencodable character and a normal exception raised. When
    the encoder has reached the end of it's current string there
    are two possibilities: when the stack contains two entries,
    this was the replacement string, so the replacement string
    will be poppep from the stack and encoding continues with
    the next character from the original string. If the stack
    had only one entry, encoding is finished.

    (I hope that's enough explanation of the API and implementation)

    I have renamed the static ...121 function to all lowercase
    names.

    BTW, I guess PyUnicode_EncodeUnicodeEscape could be
    reimplemented as PyUnicode_EncodeASCII with a \uxxxx
    replacement callback.

    PyCodec_RaiseEncodeErrors, PyCodec_IgnoreEncodeErrors,
    PyCodec_ReplaceEncodeErrors are globally visible because
    they have to be available in _codecsmodule.c to wrap them as
    Python function objects, but they can't be implemented in
    _codecsmodule, because they need to be available to the
    encoders in unicodeobject.c (through
    PyCodec_EncodeHandlerForObject), but importing the codecs
    module might result in an endless recursion, because
    importing a module requires unpickling of the bytecode,
    which might require decoding utf8, which ... (but this will
    only happen, if we implement the same mechanism for the
    decoding API)

    I have not touched PyUnicode_TranslateCharmap yet,
    should this function also support error callbacks? Why would
    one want the insert None into the mapping to call the callback?

    A remaining problem is how to implement decoding error
    callbacks. In Python 2.1 encoding and decoding errors are
    handled in the same way with a string value. But with
    callbacks it doesn't make sense to use the same callback for
    encoding and decoding (like codecs.StreamReaderWriter and
    codecs.StreamRecoder do). Decoding callbacks have a
    different API. Which arguments should be passed to the
    decoding callback, and what is the decoding callback
    supposed to do?

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    One additional note: It is vital that errors is an
    assignable attribute of the StreamWriter.

    Consider the XML example: For writing an XML DOM tree one
    StreamWriter object is used. When a text node is written,
    the error handling has to be set to
    codecs.xmlreplace_encode_errors, but inside a comment or
    processing instruction replacing unencodable characters with
    charrefs is not possible, so here codecs.raise_encode_errors
    should be used (or better a custom error handler that raises
    an error that says "sorry, you can't have unencodable
    characters inside a comment")

    BTW, should we continue the discussion in the i18n SIG
    mailing list? An email program is much more comfortable than
    a HTML textarea! ;)

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    How the callbacks work:

    A PyObject * named errors is passed in. This may by NULL,
    Py_None, 'strict', u'strict', 'ignore', u'ignore',
    'replace', u'replace' or a callable object.
    PyCodec_EncodeHandlerForObject maps all of these objects
    to
    one of the three builtin error callbacks
    PyCodec_RaiseEncodeErrors (raises an exception),
    PyCodec_IgnoreEncodeErrors (returns an empty replacement
    string, in effect ignoring the error),
    PyCodec_ReplaceEncodeErrors (returns U+FFFD, the Unicode
    replacement character to signify to the encoder that it
    should choose a suitable replacement character) or
    directly
    returns errors if it is a callable object. When an
    unencodable character is encounterd the error handling
    callback will be called with the encoding name, the
    original
    unicode object and the error position and must return a
    unicode object that will be encoded instead of the
    offending
    character (or the callback may of course raise an
    exception). U+FFFD characters in the replacement string
    will
    be replaced with a character that the encoder chooses ('?'
    in all cases).

    Nice.

    The implementation of the loop through the string is done
    in
    the following way. A stack with two strings is kept and
    the
    loop always encodes a character from the string at the
    stacktop. If an error is encountered and the stack has
    only
    one entry (during encoding of the original string) the
    callback is called and the unicode object returned is
    pushed
    on the stack, so the encoding continues with the
    replacement
    string. If the stack has two entries when an error is
    encountered, the replacement string itself has an
    unencodable character and a normal exception raised. When
    the encoder has reached the end of it's current string
    there
    are two possibilities: when the stack contains two
    entries,
    this was the replacement string, so the replacement string
    will be poppep from the stack and encoding continues with
    the next character from the original string. If the stack
    had only one entry, encoding is finished.

    Very elegant solution !

    (I hope that's enough explanation of the API and
    implementation)

    Could you add these docs to the Misc/unicode.txt file ? I
    will eventually take that file and turn it into a PEP which
    will then serve as general documentation for these things.

    I have renamed the static ...121 function to all lowercase
    names.

    Ok.

    BTW, I guess PyUnicode_EncodeUnicodeEscape could be
    reimplemented as PyUnicode_EncodeASCII with a \uxxxx
    replacement callback.

    Hmm, wouldn't that result in a slowdown ? If so, I'd rather
    leave the special encoder in place, since it is being used a
    lot in Python and probably some applications too.

    PyCodec_RaiseEncodeErrors, PyCodec_IgnoreEncodeErrors,
    PyCodec_ReplaceEncodeErrors are globally visible because
    they have to be available in _codecsmodule.c to wrap them
    as
    Python function objects, but they can't be implemented in
    _codecsmodule, because they need to be available to the
    encoders in unicodeobject.c (through
    PyCodec_EncodeHandlerForObject), but importing the codecs
    module might result in an endless recursion, because
    importing a module requires unpickling of the bytecode,
    which might require decoding utf8, which ... (but this
    will
    only happen, if we implement the same mechanism for the
    decoding API)

    I think that codecs.c is the right place for these APIs.
    _codecsmodule.c is only meant as Python access wrapper for
    the internal codecs and nothing more.

    One thing I noted about the callbacks: they assume that they
    will always get Unicode objects as input. This is certainly
    not true in the general case (it is for the codecs you touch
    in the patch).

    I think it would be worthwhile to rename the callbacks to
    include "Unicode" somewhere, e.g.
    PyCodec_UnicodeReplaceEncodeErrors(). It's a long name, but
    then it points out the application field of the callback
    rather well. Same for the callbacks exposed through the
    _codecsmodule.

    I have not touched PyUnicode_TranslateCharmap yet,
    should this function also support error callbacks? Why
    would
    one want the insert None into the mapping to call the
    callback?

    1. Yes.
    2. The user may want to e.g. restrict usage of certain
      character ranges. In this case the codec would be used to
      verify the input and an exception would indeed be useful
      (e.g. say you want to restrict input to Hangul + ASCII).

    A remaining problem is how to implement decoding error
    callbacks. In Python 2.1 encoding and decoding errors are
    handled in the same way with a string value. But with
    callbacks it doesn't make sense to use the same callback
    for
    encoding and decoding (like codecs.StreamReaderWriter and
    codecs.StreamRecoder do). Decoding callbacks have a
    different API. Which arguments should be passed to the
    decoding callback, and what is the decoding callback
    supposed to do?

    I'd suggest adding another set of PyCodec_UnicodeDecode...()
    APIs for this. We'd then have to augment the base classes of
    the StreamCodecs to provide two attributes for .errors with
    a fallback solution for the string case (i.s. "strict" can
    still be used for both directions).

    One additional note: It is vital that errors is an
    assignable attribute of the StreamWriter.

    It is already !

    Consider the XML example: For writing an XML DOM tree one
    StreamWriter object is used. When a text node is written,
    the error handling has to be set to
    codecs.xmlreplace_encode_errors, but inside a comment or
    processing instruction replacing unencodable characters
    with
    charrefs is not possible, so here
    codecs.raise_encode_errors
    should be used (or better a custom error handler that
    raises
    an error that says "sorry, you can't have unencodable
    characters inside a comment")

    Sure.

    BTW, should we continue the discussion in the i18n SIG
    mailing list? An email program is much more comfortable
    than
    a HTML textarea! ;)

    I'd rather keep the discussions on this patch here --
    forking it off to the i18n sig will make it very hard to
    follow up on it. (This HTML area is indeed damn small ;-)

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    > [...]
    > raise an exception). U+FFFD characters in the
    replacement
    > string will be replaced with a character that the
    encoder
    > chooses ('?' in all cases).

    Nice.

    But the special casing of U+FFFD makes the interface
    somewhat
    less clean than it could be. It was only done to be 100%
    backwards compatible. With the original "replace" error
    handling the codec chose the replacement character. But as
    far as I can tell none of the codecs uses anything other
    than '?', so I guess we could change the replace handler
    to always return u'?'. This would make the implementation a
    little bit simpler, but the explanation of the callback
    feature *a lot* simpler. And if you still want to handle
    an unencodable U+FFFD, you can write a special callback for
    that, e.g.

    def FFFDreplace(enc, uni, pos):
    if uni[pos] == "\ufffd":
    return u"?"
    else:
    raise UnicodeError(...)

    > The implementation of the loop through the string is
    done
    > in the following way. A stack with two strings is kept
    > and the loop always encodes a character from the string
    > at the stacktop. If an error is encountered and the
    stack
    > has only one entry (during encoding of the original
    string)
    > the callback is called and the unicode object returned
    is
    > pushed on the stack, so the encoding continues with the
    > replacement string. If the stack has two entries when an
    > error is encountered, the replacement string itself has
    > an unencodable character and a normal exception raised.
    > When the encoder has reached the end of it's current
    string
    > there are two possibilities: when the stack contains two
    > entries, this was the replacement string, so the
    replacement
    > string will be poppep from the stack and encoding
    continues
    > with the next character from the original string. If the
    > stack had only one entry, encoding is finished.

    Very elegant solution !

    I'll put it as a comment in the source.

    > (I hope that's enough explanation of the API and
    implementation)

    Could you add these docs to the Misc/unicode.txt file ? I
    will eventually take that file and turn it into a PEP
    which
    will then serve as general documentation for these things.

    I could, but first we should work out how the decoding
    callback API will work.

    > I have renamed the static ...121 function to all
    lowercase
    > names.

    Ok.

    > BTW, I guess PyUnicode_EncodeUnicodeEscape could be
    > reimplemented as PyUnicode_EncodeASCII with a \uxxxx
    > replacement callback.

    Hmm, wouldn't that result in a slowdown ? If so, I'd
    rather
    leave the special encoder in place, since it is being
    used a
    lot in Python and probably some applications too.

    It would be a slowdown. But callbacks open many
    possiblities.

    For example:

    Why can't I print u"gürk"?

    is probably one of the most frequently asked questions in
    comp.lang.python. For printing Unicode stuff, print could be
    extended the use an error handling callback for Unicode
    strings (or objects where __str__ or tp_str returns a
    Unicode object) instead of using str() which always returns
    an 8bit string and uses strict encoding. There might even
    be a
    sys.setprintencodehandler()/sys.getprintencodehandler()

    [...]
    I think it would be worthwhile to rename the callbacks to
    include "Unicode" somewhere, e.g.
    PyCodec_UnicodeReplaceEncodeErrors(). It's a long name,
    but
    then it points out the application field of the callback
    rather well. Same for the callbacks exposed through the
    _codecsmodule.

    OK, done (and PyCodec_XMLCharRefReplaceUnicodeEncodeErrors
    really is a long name ;))

    > I have not touched PyUnicode_TranslateCharmap yet,
    > should this function also support error callbacks? Why
    > would one want the insert None into the mapping to call
    > the callback?

    1. Yes.
    2. The user may want to e.g. restrict usage of certain
      character ranges. In this case the codec would be used to
      verify the input and an exception would indeed be useful
      (e.g. say you want to restrict input to Hangul + ASCII).

    OK, do we want TranslateCharmap to work exactly like
    encoding,
    i.e. in case of an error should the returned replacement
    string again be mapped through the translation mapping or
    should it be copied to the output directly? The former would
    be more in line with encoding, but IMHO the latter would
    be much more useful.

    BTW, when I implement it I can implement patch bpo-403100
    ("Multicharacter replacements in
    PyUnicode_TranslateCharmap")
    along the way.

    Should the old TranslateCharmap map to the new
    TranslateCharmapEx
    and inherit the "multicharacter replacement" feature, or
    should I leave it as it is?

    > A remaining problem is how to implement decoding error
    > callbacks. In Python 2.1 encoding and decoding errors
    are
    > handled in the same way with a string value. But with
    > callbacks it doesn't make sense to use the same callback
    > for encoding and decoding (like
    codecs.StreamReaderWriter
    > and codecs.StreamRecoder do). Decoding callbacks have a
    > different API. Which arguments should be passed to the
    > decoding callback, and what is the decoding callback
    > supposed to do?

    I'd suggest adding another set of PyCodec_UnicodeDecode...
    ()
    APIs for this. We'd then have to augment the base classes
    of
    the StreamCodecs to provide two attributes for .errors
    with
    a fallback solution for the string case (i.s. "strict" can
    still be used for both directions).

    Sounds good. Now what is the decoding callback supposed to
    do?
    I guess it will be called in the same way as the encoding
    callback, i.e. with encoding name, original string and
    position of the error. It might returns a Unicode string
    (i.e. an object of the decoding target type), that will be
    emitted from the codec instead of the one offending byte. Or
    it might return a tuple with replacement Unicode object and
    a resynchronisation offset, i.e. returning (u"?", 1) means
    emit a '?' and skip the offending character. But to make
    the offset really useful the callback has to know something
    about the encoding, perhaps the codec should be allowed to
    pass an additional state object to the callback?

    Maybe the same should be added to the encoding callbacks to?
    Maybe the encoding callback should be able to tell the
    encoder if the replacement returned should be reencoded
    (in which case it's a Unicode object), or directly emitted
    (in which case it's an 8bit string)?

    > One additional note: It is vital that errors is an
    > assignable attribute of the StreamWriter.

    It is already !

    I know, but IMHO it should be documented that an assignable
    errors attribute must be supported as part of the official
    codec API.

    Misc/unicode.txt is not clear on that:
    """
    It is not required by the Unicode implementation to use
    these base classes, only the interfaces must match; this
    allows writing Codecs as extension types.
    """

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    Guido van Rossum wrote in python-dev:

    True, the "codec" pattern can be used for other
    encodings than Unicode. But it seems to me that the
    entire codecs architecture is rather strongly geared
    towards en/decoding Unicode, and it's not clear
    how well other codecs fit in this pattern (e.g. I
    noticed that all the non-Unicode codecs ignore the
    error handling parameter or assert that
    it is set to 'strict').

    I noticed that too. asserting that errors=='strict' would
    mean that the encoder is not able to deal in any other way
    with unencodable stuff than by raising an error. But that
    is not the problem here, because for zlib, base64, quopri,
    hex and uu encoding there can be no unencodable characters.
    The encoders can simply ignore the errors parameter. Should
    I remove the asserts from those codecs and change the
    docstrings accordingly, or will this be done separately?

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    On your comment about the non-Unicode codecs: let's keep
    this separated from the current patch.

    Don't have much time today. I'll comment on the other things
    tomorrow.

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Sorry to keep you waiting, Walter. I will look into this
    again next week -- this week was way too busy...

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Ok, here we go...

    > > raise an exception). U+FFFD characters in the
    replacement
    > > string will be replaced with a character that the
    encoder
    > > chooses ('?' in all cases).
    >
    > Nice.

    But the special casing of U+FFFD makes the interface
    somewhat
    less clean than it could be. It was only done to be 100%
    backwards compatible. With the original "replace"
    error
    handling the codec chose the replacement character. But as
    far as I can tell none of the codecs uses anything other
    than '?',

    True.

    so I guess we could change the replace handler
    to always return u'?'. This would make the implementation a
    little bit simpler, but the explanation of the callback
    feature *a lot* simpler.

    Go for it.

    And if you still want to handle
    an unencodable U+FFFD, you can write a special callback for
    that, e.g.

    def FFFDreplace(enc, uni, pos):
    if uni[pos] == "\ufffd":
    return u"?"
    else:
    raise UnicodeError(...)

    > ...docs...
    >
    > Could you add these docs to the Misc/unicode.txt file ? I
    > will eventually take that file and turn it into a PEP
    which
    > will then serve as general documentation for these things.

    I could, but first we should work out how the decoding
    callback API will work.

    Ok. BTW, Barry Warsaw already did the work of converting the
    unicode.txt to PEP-100, so the docs should eventually go there.

    > > BTW, I guess PyUnicode_EncodeUnicodeEscape could be
    > > reimplemented as PyUnicode_EncodeASCII with a \uxxxx
    > > replacement callback.
    >
    > Hmm, wouldn't that result in a slowdown ? If so, I'd
    rather
    > leave the special encoder in place, since it is being
    used a
    > lot in Python and probably some applications too.

    It would be a slowdown. But callbacks open many
    possiblities.

    True, but in this case I believe that we should stick with
    the native implementation for "unicode-escape". Having
    a standard callback error handler which does the \uXXXX
    replacement would be nice to have though, since this would
    also be usable with lots of other codecs (e.g. all the code page
    ones).

    For example:

      Why can't I print u"gürk"?
    

    is probably one of the most frequently asked questions in
    comp.lang.python. For printing Unicode stuff, print could be
    extended the use an error handling callback for Unicode
    strings (or objects where __str__ or tp_str returns a
    Unicode object) instead of using str() which always returns
    an 8bit string and uses strict encoding. There might even
    be a
    sys.setprintencodehandler()/sys.getprintencodehandler()

    There already is a print callback in Python (forgot the name of the
    hook though), so this should be possible by providing the
    encoding logic in the hook.

    > > I have not touched PyUnicode_TranslateCharmap yet,
    > > should this function also support error callbacks? Why
    > > would one want the insert None into the mapping to
    call
    > > the callback?
    >
    > 1. Yes.
    > 2. The user may want to e.g. restrict usage of certain
    > character ranges. In this case the codec would be used to
    > verify the input and an exception would indeed be useful
    > (e.g. say you want to restrict input to Hangul + ASCII).

    OK, do we want TranslateCharmap to work exactly like
    encoding,
    i.e. in case of an error should the returned replacement
    string again be mapped through the translation mapping or
    should it be copied to the output directly? The former would
    be more in line with encoding, but IMHO the latter would
    be much more useful.

    It's better to take the second approach (copy the callback
    output directly to the output string) to avoid endless
    recursion and other pitfalls.

    I suppose this will also simplify the implementation somewhat.

    BTW, when I implement it I can implement patch bpo-403100
    ("Multicharacter replacements in
    PyUnicode_TranslateCharmap")
    along the way.

    I've seen it; will comment on it later.

    Should the old TranslateCharmap map to the new
    TranslateCharmapEx
    and inherit the "multicharacter replacement" feature,
    or
    should I leave it as it is?

    If possible, please also add the multichar replacement
    to the old API. I think it is very useful and since the
    old APIs work on raw buffers it would be a benefit to have
    the functionality in the old implementation too.

    [Decoding error callbacks]

    > > A remaining problem is how to implement decoding error
    > > callbacks. In Python 2.1 encoding and decoding errors
    are
    > > handled in the same way with a string value. But with
    > > callbacks it doesn't make sense to use the same
    callback
    > > for encoding and decoding (like
    codecs.StreamReaderWriter
    > > and codecs.StreamRecoder do). Decoding callbacks have
    a
    > > different API. Which arguments should be passed to the
    > > decoding callback, and what is the decoding callback
    > > supposed to do?
    >
    > I'd suggest adding another set of PyCodec_UnicodeDecode...
    ()
    > APIs for this. We'd then have to augment the base classes
    of
    > the StreamCodecs to provide two attributes for .errors
    with
    > a fallback solution for the string case (i.s. "strict"
    can
    > still be used for both directions).

    Sounds good. Now what is the decoding callback supposed to
    do?
    I guess it will be called in the same way as the encoding
    callback, i.e. with encoding name, original string and
    position of the error. It might returns a Unicode string
    (i.e. an object of the decoding target type), that will be
    emitted from the codec instead of the one offending byte. Or
    it might return a tuple with replacement Unicode object and
    a resynchronisation offset, i.e. returning (u"?", 1)
    means
    emit a '?' and skip the offending character. But to make
    the offset really useful the callback has to know something
    about the encoding, perhaps the codec should be allowed to
    pass an additional state object to the callback?

    Maybe the same should be added to the encoding callbacks to?
    Maybe the encoding callback should be able to tell the
    encoder if the replacement returned should be reencoded
    (in which case it's a Unicode object), or directly emitted
    (in which case it's an 8bit string)?

    I like the idea of having an optional state object (basically
    this should be a codec-defined arbitrary Python object)
    which then allow the callback to apply additional tricks.
    The object should be documented to be modifyable in place
    (simplifies the interface).

    About the return value:

    I'd suggest to always use the same tuple interface, e.g.

    callback(encoding, input_data, input_position, state) -> 
        (output_to_be_appended, new_input_position)
    

    (I think it's better to use absolute values for the position
    rather than offsets.)

    Perhaps the encoding callbacks should use the same
    interface... what do you think ?

    > > One additional note: It is vital that errors is an
    > > assignable attribute of the StreamWriter.
    >
    > It is already !

    I know, but IMHO it should be documented that an assignable
    errors attribute must be supported as part of the official
    codec API.

    Misc/unicode.txt is not clear on that:
    """
    It is not required by the Unicode implementation to use
    these base classes, only the interfaces must match; this
    allows writing Codecs as extension types.
    """

    Good point. I'll add that to the PEP-100.

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    > [...]
    > so I guess we could change the replace handler
    > to always return u'?'. This would make the
    > implementation a little bit simpler, but the
    > explanation of the callback feature *a lot*
    > simpler.

    Go for it.

    OK, done!

    [...]
    > > Could you add these docs to the Misc/unicode.txt
    > > file ? I will eventually take that file and turn
    > > it into a PEP which will then serve as general
    > > documentation for these things.
    >
    > I could, but first we should work out how the
    > decoding callback API will work.

    Ok. BTW, Barry Warsaw already did the work of converting
    the unicode.txt to PEP-100, so the docs should eventually
    go there.

    OK. I guess it would be best to do this when everything
    is finished.

    > > > BTW, I guess PyUnicode_EncodeUnicodeEscape
    > > > could be reimplemented as PyUnicode_EncodeASCII
    > > > with \uxxxx replacement callback.
    > >
    > > Hmm, wouldn't that result in a slowdown ? If so,
    > > I'd rather leave the special encoder in place,
    > > since it is being used a lot in Python and
    > > probably some applications too.
    >
    > It would be a slowdown. But callbacks open many
    > possiblities.

    True, but in this case I believe that we should stick with
    the native implementation for "unicode-escape". Having
    a standard callback error handler which does the \uXXXX
    replacement would be nice to have though, since this would
    also be usable with lots of other codecs (e.g. all the
    code page ones).

    OK, done, now there's a
    PyCodec_EscapeReplaceUnicodeEncodeErrors/
    codecs.escapereplace_unicodeencode_errors
    that uses \u (or \U if x>0xffff (with a wide build
    of Python)).

    > For example:
    >
    > Why can't I print u"gürk"?
    >
    > is probably one of the most frequently asked
    > questions in comp.lang.python. For printing
    > Unicode stuff, print could be extended the use an
    > error handling callback for Unicode strings (or
    > objects where __str__ or tp_str returns a Unicode
    > object) instead of using str() which always
    > returns an 8bit string and uses strict encoding.
    > There might even be a
    > sys.setprintencodehandler()/sys.getprintencodehandler
    ()

    There already is a print callback in Python (forgot the
    name of the hook though), so this should be possible by
    providing the encoding logic in the hook.

    True: sys.displayhook

    [...]
    > Should the old TranslateCharmap map to the new
    > TranslateCharmapEx and inherit the
    > "multicharacter replacement" feature,
    > or should I leave it as it is?

    If possible, please also add the multichar replacement
    to the old API. I think it is very useful and since the
    old APIs work on raw buffers it would be a benefit to have
    the functionality in the old implementation too.

    OK! I will try to find the time to implement that in the
    next days.

    [Decoding error callbacks]

    About the return value:

    I'd suggest to always use the same tuple interface, e.g.

    callback(encoding, input_data, input_position, 
    

    state) ->

        (output_to_be_appended, new_input_position)
    

    (I think it's better to use absolute values for the
    position rather than offsets.)

    Perhaps the encoding callbacks should use the same
    interface... what do you think ?

    This would make the callback feature hypergeneric and a
    little slower, because tuples have to be created, but it
    (almost) unifies the encoding and decoding API. ("almost"
    because, for the encoder output_to_be_appended will be
    reencoded, for the decoder it will simply be appended.),
    so I'm for it.

    I implemented this and changed the encoders to only
    lookup the error handler on the first error. The UCS1
    encoder now no longer uses the two-item stack strategy.
    (This strategy only makes sense for those encoder where
    the encoding itself is much more complicated than the
    looping/callback etc.) So now memory overflow tests are
    only done, when an unencodable error occurs, so now the
    UCS1 encoder should be as fast as it was without
    error callbacks.

    Do we want to enforce new_input_position>input_position,
    or should jumping back be allowed?

    > > > One additional note: It is vital that errors
    > > > is an assignable attribute of the StreamWriter.
    > >
    > > It is already !
    >
    > I know, but IMHO it should be documented that an
    > assignable errors attribute must be supported
    > as part of the official codec API.
    >
    > Misc/unicode.txt is not clear on that:
    > """
    > It is not required by the Unicode implementation
    > to use these base classes, only the interfaces must
    > match; this allows writing Codecs as extension types.
    > """

    Good point. I'll add that to the PEP-100.

    OK.

    Here's is the current todo list:

    1. implement a new TranslateCharmap and fix the old.
    2. New encoding API for string objects too.
    3. Decoding
    4. Documentation
    5. Test cases

    I'm thinking about a different strategy for implementing
    callbacks
    (see http://mail.python.org/pipermail/i18n-sig/2001-
    July/001262.html)

    We coould have a error handler registry, which maps names
    to error handlers, then it would be possible to keep the
    errors argument as "const char *" instead of "PyObject *".
    Currently PyCodec_UnicodeEncodeHandlerForObject is a
    backwards compatibility hack that will never go away,
    because
    it's always more convenient to type
    u"...".encode("...", "strict")
    instead of
    import codecs
    u"...".encode("...", codecs.raise_encode_errors)

    But with an error handler registry this function would
    become the official lookup method for error handlers.
    (PyCodec_LookupUnicodeEncodeErrorHandler?)
    Python code would look like this:
    ---

    def xmlreplace(encoding, unicode, pos, state):
       return (u"&#%d;" % ord(uni[pos]), pos+1)
    
    import codec
    
    codec.registerError("xmlreplace",xmlreplace)

    and then the following call can be made:
    u"äöü".encode("ascii", "xmlreplace")
    As soon as the first error is encountered, the encoder uses
    its builtin error handling method if it recognizes the name
    ("strict", "replace" or "ignore") or looks up the error
    handling function in the registry if it doesn't. In this way
    the speed for the backwards compatible features is the same
    as before and "const char *error" can be kept as the
    parameter to all encoding functions. For speed common error
    handling names could even be implemented in the encoder
    itself.

    But for special one-shot error handlers, it might still be
    useful to pass the error handler directly, so maybe we
    should leave error as PyObject *, but implement the
    registry anyway?

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    > > > > BTW, I guess PyUnicode_EncodeUnicodeEscape
    > > > > could be reimplemented as PyUnicode_EncodeASCII
    > > > > with \uxxxx replacement callback.
    > > >
    > > > Hmm, wouldn't that result in a slowdown ? If so,
    > > > I'd rather leave the special encoder in place,
    > > > since it is being used a lot in Python and
    > > > probably some applications too.
    > >
    > > It would be a slowdown. But callbacks open many
    > > possiblities.
    >
    > True, but in this case I believe that we should stick with
    > the native implementation for "unicode-escape". Having
    > a standard callback error handler which does the \uXXXX
    > replacement would be nice to have though, since this would
    > also be usable with lots of other codecs (e.g. all the
    > code page ones).

    OK, done, now there's a
    PyCodec_EscapeReplaceUnicodeEncodeErrors/
    codecs.escapereplace_unicodeencode_errors
    that uses \u (or \U if x>0xffff (with a wide build
    of Python)).

    Great !

    > [...]
    > > Should the old TranslateCharmap map to the new
    > > TranslateCharmapEx and inherit the
    > > "multicharacter replacement" feature,
    > > or should I leave it as it is?
    >
    > If possible, please also add the multichar replacement
    > to the old API. I think it is very useful and since the
    > old APIs work on raw buffers it would be a benefit to have
    > the functionality in the old implementation too.

    OK! I will try to find the time to implement that in the
    next days.

    Good.

    > [Decoding error callbacks]
    >
    > About the return value:
    >
    > I'd suggest to always use the same tuple interface, e.g.
    >
    > callback(encoding, input_data, input_position,
    state) ->
    > (output_to_be_appended, new_input_position)
    >
    > (I think it's better to use absolute values for the
    > position rather than offsets.)
    >
    > Perhaps the encoding callbacks should use the same
    > interface... what do you think ?

    This would make the callback feature hypergeneric and a
    little slower, because tuples have to be created, but it
    (almost) unifies the encoding and decoding API. ("almost"
    because, for the encoder output_to_be_appended will be
    reencoded, for the decoder it will simply be appended.),
    so I'm for it.

    That's the point.

    Note that I don't think the tuple creation
    will hurt much (see the make_tuple() API in codecs.c)
    since small tuples are cached by Python internally.

    I implemented this and changed the encoders to only
    lookup the error handler on the first error. The UCS1
    encoder now no longer uses the two-item stack strategy.
    (This strategy only makes sense for those encoder where
    the encoding itself is much more complicated than the
    looping/callback etc.) So now memory overflow tests are
    only done, when an unencodable error occurs, so now the
    UCS1 encoder should be as fast as it was without
    error callbacks.

    Do we want to enforce new_input_position>input_position,
    or should jumping back be allowed?

    No; moving backwards should be allowed (this may be useful
    in order to resynchronize with the input data).

    Here's is the current todo list:

    1. implement a new TranslateCharmap and fix the old.
    2. New encoding API for string objects too.
    3. Decoding
    4. Documentation
    5. Test cases

    I'm thinking about a different strategy for implementing
    callbacks
    (see http://mail.python.org/pipermail/i18n-sig/2001-
    July/001262.html)

    We coould have a error handler registry, which maps names
    to error handlers, then it would be possible to keep the
    errors argument as "const char *" instead of "PyObject *".
    Currently PyCodec_UnicodeEncodeHandlerForObject is a
    backwards compatibility hack that will never go away,
    because
    it's always more convenient to type
    u"...".encode("...", "strict")
    instead of
    import codecs
    u"...".encode("...", codecs.raise_encode_errors)

    But with an error handler registry this function would
    become the official lookup method for error handlers.
    (PyCodec_LookupUnicodeEncodeErrorHandler?)
    Python code would look like this:
    ---
    def xmlreplace(encoding, unicode, pos, state):
    return (u"&#%d;" % ord(uni[pos]), pos+1)

    import codec

    codec.registerError("xmlreplace",xmlreplace)
    ---
    and then the following call can be made:
    u"äöü".encode("ascii", "xmlreplace")
    As soon as the first error is encountered, the encoder uses
    its builtin error handling method if it recognizes the name
    ("strict", "replace" or "ignore") or looks up the error
    handling function in the registry if it doesn't. In this way
    the speed for the backwards compatible features is the same
    as before and "const char *error" can be kept as the
    parameter to all encoding functions. For speed common error
    handling names could even be implemented in the encoder
    itself.

    But for special one-shot error handlers, it might still be
    useful to pass the error handler directly, so maybe we
    should leave error as PyObject *, but implement the
    registry anyway?

    Good idea !

    One minor nit: codecs.registerError() should be named
    codecs.register_errorhandler() to be more inline with
    the Python coding style guide.

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    New version of the patch with the error handling callback
    registry.

    > OK, done, now there's a
    > PyCodec_EscapeReplaceUnicodeEncodeErrors/
    > codecs.escapereplace_unicodeencode_errors
    > that uses \u (or \U if x>0xffff (with a wide build
    > of Python)).

    Great!

    Now PyCodec_EscapeReplaceUnicodeEncodeErrors uses \x
    in addition to \u and \U where appropriate.

    > [...]
    > But for special one-shot error handlers, it might still
    be
    > useful to pass the error handler directly, so maybe we
    > should leave error as PyObject *, but implement the
    > registry anyway?

    Good idea !

    One minor nit: codecs.registerError() should be named
    codecs.register_errorhandler() to be more inline with
    the Python coding style guide.

    OK, but these function are specific to unicode encoding,
    so now the functions are called:
    codecs.register_unicodeencodeerrorhandler
    codecs.lookup_unicodeencodeerrorhandler

    Now all callbacks (including the new
    ones: "xmlcharrefreplace"
    and "escapereplace") are registered in the
    codecs.c/_PyCodecRegistry_Init so using them is really
    simple: u"gürk".encode("ascii", "xmlcharrefreplace")

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    Changing the decoding API is done now. There
    are new functions
    codec.register_unicodedecodeerrorhandler and
    codec.lookup_unicodedecodeerrorhandler.
    Only the standard handlers for 'strict',
    'ignore' and 'replace' are preregistered.

    There may be many reasons for decoding errors
    in the byte string, so I added an additional
    argument to the decoding API: reason, which
    gives the reason for the failure, e.g.:

    >>> "\\U1111111".decode("unicode_escape")
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeError: encoding 'unicodeescape' can't decode byte 
    0x31 in position 8: truncated \UXXXXXXXX escape
    >>> "\\U11111111".decode("unicode_escape")
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeError: encoding 'unicodeescape' can't decode byte 
    0x31 in position 9: illegal Unicode character
    
    For symmetry I added this to the encoding API too:
    >>> u"\xff".encode("ascii")
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeError: encoding 'ascii' can't decode byte 0xff in 
    position 0: ordinal not in range(128)

    The parameters passed to the callbacks now are:
    encoding, unicode, position, reason, state.

    The encoding and decoding API for strings has been
    adapted too, so now the new API should be usable
    everywhere:

    >>> unicode("a\xffb\xffc", "ascii", 
    ...    lambda enc, uni, pos, rea, sta: (u"<?>", pos+1))
    u'a<?>b<?>c'
    >>> "a\xffb\xffc".decode("ascii",
    ...    lambda enc, uni, pos, rea, sta: (u"<?>", 
    pos+1))            
    u'a<?>b<?>c'

    I had a problem with the decoding API: all the
    functions in _codecsmodule.c used the t# format
    specifier. I changed that to O! with
    &PyString_Type, because otherwise we would have
    the problem that the decoding API would must pass
    buffer object around instead of strings, and
    the callback would have to call str() on the
    buffer anyway to access a specific character, so
    this wouldn't be any faster than calling str()
    on the buffer before decoding. It seems that
    buffers aren't used anyway.

    I changed all the old function to call the new
    ones so bugfixes don't have to be done in two
    places. There are two exceptions: I didn't
    change PyString_AsEncodedString and
    PyString_AsDecodedString because they are
    documented as deprecated anyway (although they
    are called in a few spots) This means that I
    duplicated part of their functionality in
    PyString_AsEncodedObjectEx and
    PyString_AsDecodedObjectEx.

    There are still a few spots that call the old API:
    E.g. PyString_Format still calls PyUnicode_Decode
    (but with strict decoding) because it passes the
    rest of the format string to PyUnicode_Format
    when it encounters a Unicode object.

    Should we switch to the new API everywhere even
    if strict encoding/decoding is used?

    The size of this patch begins to scare me. I
    guess we need an extensive test script for all the
    new features and documentation. I hope you have time
    to do that, as I'll be busy with other projects in
    the next weeks. (BTW, I have't touched
    PyUnicode_TranslateCharmap yet.)

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    I think we ought to summarize these changes in a PEP to get some more feedback and testing from others as
    well.

    I'll look into this after I'm back from vacation on the 10.09.

    Given the release schedule I am not sure whether this feature will make it into 2.2. The size of the patch is huge
    and probably needs a lot of testing first.

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    I am postponing this patch until the PEP process has started. This feature won't make it into Python 2.2.

    Walter, you may want to reference this patch in the PEP.

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Walter, are you making any progress on the new scheme
    we discussed on the mailing list (adding an error handler
    registry much like the codec registry itself instead of trying
    to redo the complete codec API) ?

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    I started from scratch, and the current state is this:

    Encoding mostly works (except that I haven't changed
    TranslateCharmap and EncodeDecimal yet) and most of the
    decoding stuff works (DecodeASCII and DecodeCharmap are
    still unchanged) and the decoding callback helper isn't
    optimized for the "builtin" names yet (i.e. it still calls
    the handler).

    For encoding the callback helper knows how to
    handle "strict", "replace", "ignore"
    and "xmlcharrefreplace" itself and won't call the callback.
    This should make the encoder fast enough. As callback name
    string comparison results are cached it might even be
    faster than the original.

    The patch so far didn't require any changes to
    unicodeobject.h, stringobject.h or stringobject.c

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    I'm think about extending the API a little bit:

    Consider the following example:
    >>> "\\u1".decode("unicode-escape")
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeError: encoding 'unicodeescape' 
    can't decode byte 0x31 
    in position 2: truncated \uXXXX escape

    The error message is a lie: Not the '1'
    in position 2 is the problem, but the
    complete truncated sequence '\\u1'.
    For this the decoder should pass a start
    and an end position to the handler.

    For encoding this would be useful too: 
    Suppose I want to have an encoder that 
    colors the unencodable character via an 
    ANSI escape sequences. Then I could do 
    the following:
    >>> import codecs
    >>> def color(enc, uni, pos, why, sta):
    ...    return (u"\033[1m<%d>\033[0m" % ord(uni[pos]), pos+1)
    ... 
    >>> codecs.register_unicodeencodeerrorhandler("color", 
    color)
    >>> u"aäüöo".encode("ascii", "color")
    'a\x1b[1m<228>\x1b[0m\x1b[1m<252>\x1b[0m\x1b[1m<246>\x1b
    [0mo'

    But here the sequences "\x1b[0m\x1b[1m" are not needed.

    To fix this problem the encoder could collect as many
    unencodable characters as possible and pass those to
    the error callback in one go (passing a start and
    end+1 position).

    This fixes the above problem and reduces the number of
    calls to the callback, so it should speed up the
    algorithms in case of custom encoding names.
    (And it makes the implementation very interesting ;))

    What do you think?

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Sounds like a good idea. Please keep the encoder and
    decoder APIs symmetric, though, ie. add the slice
    information to both APIs. The slice should use the
    same format as Python's standard slices, that is
    left inclusive, right exclusive.

    I like the highlighting feature !

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    What should replace do: Return u"?" or (end-start)*u"?"

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Hmm, whatever it takes to maintain backwards
    compatibility. Do you have an example ?

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    For encoding it's always (end-start)*u"?":
    >>> u"ää".encode("ascii", "replace")
    '??'
    
    But for decoding, it is neither nor:
    >>> "\\Ux\\U".decode("unicode-escape", "replace")
    u'\ufffd\ufffd'

    i.e. a sequence of 5 illegal characters was replace by two
    replacement characters. This might mean that decoders can't
    collect all the illegal characters and call the callback
    once. They might have to call the callback for every single
    illegal byte sequence to get the old behaviour.

    (It seems that this patch would be much, much simpler, if
    we only change the encoders)

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    So this means that the encoder can collect illegal
    characters and pass it to the callback. "replace" will
    replace this with (end-start)*u"?".

    Decoders don't collect all illegal byte sequences, but call
    the callback once for every byte sequence that has been
    found illegal and "replace" will replace it with u"?".

    Does this make sense?

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    Another note: the patch will change the meaning of charmap
    encoding slightly: currently "replace" will put a ? into
    the output, even if ? is not in the mapping, i.e.
    codecs.charmap_encode(u"c", "replace", {ord("a"): ord
    ("b")}) will return ('?', 1).

    With the patch the above example will raise an exception.

    Off course with the patch many more replace characters can
    appear, so it is vital that for the replacement string the
    mapping is done.

    Is this semantic change OK? (I guess all of the existing
    codecs have a mapping ord("?")->ord("?"))

    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Sorry for the late response.

    About the difference between encoding and decoding: you shouldn't
    just look at the case where you work with Unicode and strings, e.g.
    take the rot-13 codec which works on strings only or other codecs
    which translate objects into strings and vice-versa.

    Error handling has to be flexible enough to handle all these
    situations. Since the codecs know best how to handle the situations,
    I'd make this an implementation detail of the codec and leave the
    behaviour undefined in the general case.

    For the existing codecs, backward compatibility should be
    maintained, if at all possible. If the patch gets overly complicated
    because of this, we may have to provide a downgrade solution
    for this particular problem (I don't think replace is used in any
    computational context, though, since you can never be sure
    how many replacement character do get inserted, so the case
    may not be that realistic).

    Raising an exception for the charmap codec is the right
    way to go, IMHO. I would consider the current behaviour
    a bug.

    For new codecs, I think we should suggest that replace
    tries to collect as much illegal data as possible before
    invoking the error handler. The handler should be aware
    of the fact that it won't necessarily get all the broken data
    in one call.

    About the codec error handling registry:
    You seem to be using a Unicode specific approach
    here. I'd rather like to see a generic approach which uses
    the API we discussed earlier. Would that be possible ?
    In that case, the codec API should probably be called
    codecs.register_error('myhandler', myhandler).

    Does that make sense ?

    BTW, the patch which uses the callback registry does not seem
    to be available on this SF page (the last patch still converts
    the errors argument to a PyObject, which shouldn't be needed
    anymore with the new approach). Can you please upload your
    latest version ?

    Note that the highlighting codec would make a nice example
    for the new feature.

    Thanks.

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    About the difference between encoding
    and decoding: you shouldn't just look
    at the case where you work with Unicode
    and strings, e.g. take the rot-13 codec
    which works on strings only or other
    codecs which translate objects into
    strings and vice-versa.

    unicode.encode encodes to str and
    str.decode decodes to unicode,
    even for rot-13:

    >>> u"gürk".encode("rot13")
    't\xfcex'
    >>> "gürk".decode("rot13")
    u't\xfcex'
    >>> u"gürk".decode("rot13")
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    AttributeError: 'unicode' object has no attribute 'decode'
    >>> "gürk".encode("rot13")
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "/home/walter/Python-current-
    readonly/dist/src/Lib/encodings/rot_13.py", line 18, in 
    encode
        return codecs.charmap_encode(input,errors,encoding_map)
    UnicodeError: ASCII decoding error: ordinal not in range
    (128)

    Here the str is converted to unicode
    first, before encode is called, but the
    conversion to unicode fails.

    Is there an example where something
    else happens?

    Error handling has to be flexible enough
    to handle all these situations. Since
    the codecs know best how to handle the
    situations, I'd make this an implementation
    detail of the codec and leave the
    behaviour undefined in the general case.

    OK, but we should suggest, that for encoding
    unencodable characters are collected
    and for decoding seperate byte sequences
    that are considered broken by the codec
    are passed to the callback: i.e for
    decoding the handler will never get
    all broken data in one call, e.g.
    for "\\u30\\Uffffffff".decode("unicode-escape")
    the handler will be called twice (once for
    "\\u30" and "truncated \\u escape" as the
    reason and once for "\\Uffffffff" and
    "illegal character" as the reason.)

    For the existing codecs, backward
    compatibility should be maintained,
    if at all possible. If the patch gets
    overly complicated because of this,
    we may have to provide a downgrade solution
    for this particular problem (I don't think
    replace is used in any computational context,
    though, since you can never be sure how
    many replacement character do get
    inserted, so the case may not be
    that realistic).

    Raising an exception for the charmap codec
    is the right way to go, IMHO. I would
    consider the current behaviour a bug.

    OK, this is implemented in PyUnicode_EncodeCharmap now,
    and collecting unencodable characters works too.

    I completely changed the implementation,
    because the stack approach would have
    gotten much more complicated when
    unencodable characters are collected.

    For new codecs, I think we should
    suggest that replace tries to collect
    as much illegal data as possible before
    invoking the error handler. The handler
    should be aware of the fact that it
    won't necessarily get all the broken
    data in one call.

    OK for encoders, for decoders see
    above.

    About the codec error handling
    registry: You seem to be using a
    Unicode specific approach here.
    I'd rather like to see a generic
    approach which uses the API
    we discussed earlier. Would that be possible?

    The handlers in the registry are all Unicode
    specific. and they are different for encoding
    and for decoding.

    I renamed the function because of your
    comment from 2001-06-13 10:05 (which
    becomes exceedingly difficult to find on
    this long page! ;)).

    In that case, the codec API should
    probably be called
    codecs.register_error('myhandler', myhandler).

    Does that make sense ?

    We could require that unique names
    are used for custom handlers, but
    for the standard handlers we do have
    name collisions. To prevent them, we
    could either remove them from the registry
    and require that the codec implements
    the error handling for those itself,
    or we could to some fiddling, so that
    u"üöä".encode("ascii", "replace")
    becomes
    u"üöä".encode("ascii", "unicodeencodereplace")
    behind the scenes.

    But I think two unicode specific
    registries are much simpler to handle.

    BTW, the patch which uses the callback
    registry does not seem to be available
    on this SF page (the last patch still
    converts the errors argument to a
    PyObject, which shouldn't be needed
    anymore with the new approach).
    Can you please upload your
    latest version?

    OK, I'll upload a preliminary version
    tomorrow. PyUnicode_EncodeDecimal and
    PyUnicode_TranslateCharmap are still
    missing, but otherwise the patch seems
    to be finished. All decoders work and
    the encoders collect unencodable characters
    and implement the handling of known
    callback handler names themselves.

    As PyUnicode_EncodeDecimal is only used
    by the int, long, float, and complex constructors,
    I'd love to get rid of the errors argument,
    but for completeness sake, I'll implement
    the callback functionality.

    Note that the highlighting codec
    would make a nice example
    for the new feature.

    This could be part of the codec callback test
    script, which I've started to write. We could
    kill two birds with one stone here:

    1. Test the implementation.
    2. Document and advocate what is
      possible with the patch.

    Another idea: we could have as an example
    a decoding handler that relaxes the
    UTF-8 minimal encoding restriction, e.g.

    def relaxedutf8(enc, uni, startpos, endpos, reason, data):
       if uni[startpos:startpos+2] == u"\xc0\x80":
          return (u"\x00", startpos+2)
       else:
          raise UnicodeError(...)

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    OK, here is the current version of the patch (diff7.txt).
    PyUnicode_EncodeDecimal and PyUnicode_TranslateCharmap are
    still missing.

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    And here is the test script (test_codeccallbacks.py)

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    A new idea for the interface between the
    codec and the callback:

    Maybe we could have new exception classes
    UnicodeEncodeError, UnicodeDecodeError
    and UnicodeTranslateError derived from
    UnicodeError. They have all the attributes
    that are passed as an argument
    tuple in the current version:
    string: the original string
    start: the start position of the
    unencodable characters/undecodable bytes
    end: the end position+1 of the unencodable
    characters/undecodable bytes.
    reason: the a string, that explains, why
    the encoding/decoding doesn't work.

    There is no data object, because when a codec
    wants to pass extended information to the
    callback it can do this via a derived
    class.

    It might be better to move these attributes
    to the base class UnicodeError, but this
    might have backwards compatibility
    problems.

    With this method we really can have one global
    registry for all callbacks, because for callback
    names that must work with encoding *and* decoding
    *and* translating (i.e. "strict", "replace" and
    "ignore"), the callback can check which type
    of exception was passed, so "replace" can
    e.g. look like this:

    def replace(exc):
       if isinstance(exc, UnicodeDecodeError):
          return ("?", exc.end)
       else:
          return (u"?"*(exc.end-exc.start), exc.end)

    Another possibility would be to do the commucation
    callback->codec by assigning to attributes
    of the exception object. The resyncronisation
    position could even be preassigned to end, so
    the callback only needs to specify the
    replacement in most cases:

    def replace(exc):
       if isinstance(exc, UnicodeDecodeError):
          exc.replacement = "?"
       else:
          exc.replacement = u"?"*(exc.end-exc.start)

    As many of the assignments can now be done on
    the C level without having to allocate Python
    objects (except for the replacement string
    and the reason), this version might even be
    faster, especially if we allow the codec to
    reuse the exception object for the next call
    to the callback.

    Does this make sense, or is this to fancy?

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    OK, PyUnicode_EncodeDecimal is done (diff8.txt), but as the
    errors argument can't be accessed from Python code, there's
    not much testing for this.

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    OK, PyUnicode_TranslateCharmap is finished too. As the
    errors argument is again not exposed to Python it can't
    really be tested. Should we add errors as an optional
    argument to unicode.translate?

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    This new version diff10.txt fixes a memory
    overwrite/reallocation bug in PyUnicode_EncodeCharmap and
    moves the error handling out of PyUnicode_EncodeCharmap.
    A new version of the test script is included too.

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    diff11.txt fixes two refcounting bugs in codecs.c.
    speedtest.py is a little test script, that checks to speed
    of various string/encoding/error combinations.

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    diff12.txt finally implements the PEP-293 specification (i.e.
    using exceptions for the communication between codec and
    handler)

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    Attached is a new version of the test script. But we need
    more tests. UTF-7 is completely untested and using codecs
    that pass wrong arguments to the handler and handler that
    return wrong or out of bounds results is untested too.

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    The attached new version of the test script add test for wrong
    parameter passed to the callbacks or wrong results returned
    from the callback. It also add tests to the long string
    tests for copies of the builtin error handlers, so the codec
    does not recognize the name and goes through the general
    callback machinery.

    UTF-7 decoding still has a flaw inherited from the current
    implementation:

    >>> "+xxx".decode("utf-7")                    
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'utf7' codec can't decode bytes in
    position 0-3: unterminated shift sequence
    *>>> "+xxx".decode("utf-7", "ignore")
    u'\uc71c'

    The decoder should consider the whole sequence "+xxx" as
    undecodable, so "Ignore" should return an empty string.
    Currently the correct sequence will be passed to the
    callback, but the faulty sequence has already been emitted
    to the result string.

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    This new version diff13.txt moves the initialization of
    codec.strict_errors etc. from Modules/_codecsmodule.c to
    Lib/codecs.py.

    The error logic for the accessor function is inverted (now
    its 0 for success and -1 for error).

    Updated the prototypes to use the new PyAPI_FUNC macro.

    Enhanced the docstrings for str.(de|en)code and unicode.encode.

    There seems to be a new string decoding function
    PyString_DecodeEscape in current CVS. This function has to
    be updated too.

    @doerwalter
    Copy link
    Contributor Author

    Logged In: YES
    user_id=89016

    Checked in as:
    (this is diff13.txt + the test script + dodumentation in two
    TeX files)

    Doc/lib/libcodecs.tex 1.11
    Doc/lib/libexcs.tex 1.49
    Include/codecs.h 2.5
    Include/pyerrors.h 2.58
    Lib/codecs.py 1.27
    Lib/test/test_codeccallbacks.py 1.1
    Misc/NEWS 1.476
    Modules/_codecsmodule.c 2.15
    Objects/stringobject.c 2.186
    Objects/unicodeobject.c 2.167
    Python/codecs.c 2.15
    Python/exceptions.c 1.35

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs)
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants