Skip to content

gh-92536: PEP 623: Remove wstr from unicode #92537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 5 additions & 42 deletions Doc/c-api/arg.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,48 +136,6 @@ which disallows mutable objects such as :class:`bytearray`.
attempting any conversion. Raises :exc:`TypeError` if the object is not
a :class:`bytearray` object. The C variable may also be declared as :c:type:`PyObject*`.

``u`` (:class:`str`) [const Py_UNICODE \*]
Convert a Python Unicode object to a C pointer to a NUL-terminated buffer of
Unicode characters. You must pass the address of a :c:type:`Py_UNICODE`
pointer variable, which will be filled with the pointer to an existing
Unicode buffer. Please note that the width of a :c:type:`Py_UNICODE`
character depends on compilation options (it is either 16 or 32 bits).
The Python string must not contain embedded null code points; if it does,
a :exc:`ValueError` exception is raised.

.. versionchanged:: 3.5
Previously, :exc:`TypeError` was raised when embedded null code points
were encountered in the Python string.

.. deprecated-removed:: 3.3 3.12
Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
:c:func:`PyUnicode_AsWideCharString`.

``u#`` (:class:`str`) [const Py_UNICODE \*, :c:type:`Py_ssize_t`]
This variant on ``u`` stores into two C variables, the first one a pointer to a
Unicode data buffer, the second one its length. This variant allows
null code points.

.. deprecated-removed:: 3.3 3.12
Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
:c:func:`PyUnicode_AsWideCharString`.

``Z`` (:class:`str` or ``None``) [const Py_UNICODE \*]
Like ``u``, but the Python object may also be ``None``, in which case the
:c:type:`Py_UNICODE` pointer is set to ``NULL``.

.. deprecated-removed:: 3.3 3.12
Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
:c:func:`PyUnicode_AsWideCharString`.

``Z#`` (:class:`str` or ``None``) [const Py_UNICODE \*, :c:type:`Py_ssize_t`]
Like ``u#``, but the Python object may also be ``None``, in which case the
:c:type:`Py_UNICODE` pointer is set to ``NULL``.

.. deprecated-removed:: 3.3 3.12
Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
:c:func:`PyUnicode_AsWideCharString`.

``U`` (:class:`str`) [PyObject \*]
Requires that the Python object is a Unicode object, without attempting
any conversion. Raises :exc:`TypeError` if the object is not a Unicode
Expand Down Expand Up @@ -247,6 +205,11 @@ which disallows mutable objects such as :class:`bytearray`.
them. Instead, the implementation assumes that the byte string object uses the
encoding passed in as parameter.

.. versionchanged:: 3.12
``u``, ``u#``, ``Z``, and ``Z#`` are removed because they used legacy ``Py_UNICODE*``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... because they used a/the legacy ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I made follow-up PR: #92756

representation.


Numbers
-------

Expand Down
177 changes: 21 additions & 156 deletions Doc/c-api/unicode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,26 +17,12 @@ of Unicode characters while staying memory efficient. There are special cases
for strings where all code points are below 128, 256, or 65536; otherwise, code
points must be below 1114112 (which is the full Unicode range).

:c:type:`Py_UNICODE*` and UTF-8 representations are created on demand and cached
in the Unicode object. The :c:type:`Py_UNICODE*` representation is deprecated
and inefficient.

Due to the transition between the old APIs and the new APIs, Unicode objects
can internally be in two states depending on how they were created:

* "canonical" Unicode objects are all objects created by a non-deprecated
Unicode API. They use the most efficient representation allowed by the
implementation.

* "legacy" Unicode objects have been created through one of the deprecated
APIs (typically :c:func:`PyUnicode_FromUnicode`) and only bear the
:c:type:`Py_UNICODE*` representation; you will have to call
:c:func:`PyUnicode_READY` on them before calling any other API.
UTF-8 representation is created on demand and cached in the Unicode object.

.. note::
The "legacy" Unicode object will be removed in Python 3.12 with deprecated
APIs. All Unicode objects will be "canonical" since then. See :pep:`623`
for more information.
The :c:type:`Py_UNICODE` representation has been removed since Python 3.12
with deprecated APIs.
See :pep:`623` for more information.


Unicode Type
Expand Down Expand Up @@ -101,18 +87,12 @@ access to internal read-only data of Unicode objects:

.. c:function:: int PyUnicode_READY(PyObject *o)

Ensure the string object *o* is in the "canonical" representation. This is
required before using any of the access macros described below.

.. XXX expand on when it is not required

Returns ``0`` on success and ``-1`` with an exception set on failure, which in
particular happens if memory allocation fails.
Returns ``0``. This API is kept only for backward compatibility.

.. versionadded:: 3.3

.. deprecated-removed:: 3.10 3.12
This API will be removed with :c:func:`PyUnicode_FromUnicode`.
.. deprecated:: 3.10
This API do nothing since Python 3.12. Please remove code using this function.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API does nothing..., and I think the 'Please remove ...' can be omitted



.. c:function:: Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o)
Expand All @@ -130,23 +110,21 @@ access to internal read-only data of Unicode objects:
Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4
integer types for direct character access. No checks are performed if the
canonical representation has the correct character size; use
:c:func:`PyUnicode_KIND` to select the right function. Make sure
:c:func:`PyUnicode_READY` has been called before accessing this.
:c:func:`PyUnicode_KIND` to select the right function.

.. versionadded:: 3.3


.. c:macro:: PyUnicode_WCHAR_KIND
PyUnicode_1BYTE_KIND
.. c:macro:: PyUnicode_1BYTE_KIND
PyUnicode_2BYTE_KIND
PyUnicode_4BYTE_KIND

Return values of the :c:func:`PyUnicode_KIND` macro.

.. versionadded:: 3.3

.. deprecated-removed:: 3.10 3.12
``PyUnicode_WCHAR_KIND`` is deprecated.
.. versionchanged:: 3.12
``PyUnicode_WCHAR_KIND`` has been removed.


.. c:function:: int PyUnicode_KIND(PyObject *o)
Expand All @@ -155,8 +133,6 @@ access to internal read-only data of Unicode objects:
bytes per character this Unicode object uses to store its data. *o* has to
be a Unicode object in the "canonical" representation (not checked).

.. XXX document "0" return value?

.. versionadded:: 3.3


Expand Down Expand Up @@ -208,49 +184,6 @@ access to internal read-only data of Unicode objects:
.. versionadded:: 3.3


.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)

Return the size of the deprecated :c:type:`Py_UNICODE` representation, in
code units (this includes surrogate pairs as 2 units). *o* has to be a
Unicode object (not checked).

.. deprecated-removed:: 3.3 3.12
Part of the old-style Unicode API, please migrate to using
:c:func:`PyUnicode_GET_LENGTH`.


.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)

Return the size of the deprecated :c:type:`Py_UNICODE` representation in
bytes. *o* has to be a Unicode object (not checked).

.. deprecated-removed:: 3.3 3.12
Part of the old-style Unicode API, please migrate to using
:c:func:`PyUnicode_GET_LENGTH`.


.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
const char* PyUnicode_AS_DATA(PyObject *o)

Return a pointer to a :c:type:`Py_UNICODE` representation of the object. The
returned buffer is always terminated with an extra null code point. It
may also contain embedded null code points, which would cause the string
to be truncated when used in most C functions. The ``AS_DATA`` form
casts the pointer to :c:type:`const char *`. The *o* argument has to be
a Unicode object (not checked).

.. versionchanged:: 3.3
This function is now inefficient -- because in many cases the
:c:type:`Py_UNICODE` representation does not exist and needs to be created
-- and can fail (return ``NULL`` with an exception set). Try to port the
code to use the new :c:func:`PyUnicode_nBYTE_DATA` macros or use
:c:func:`PyUnicode_WRITE` or :c:func:`PyUnicode_READ`.

.. deprecated-removed:: 3.3 3.12
Part of the old-style Unicode API, please migrate to using the
:c:func:`PyUnicode_nBYTE_DATA` family of macros.


.. c:function:: int PyUnicode_IsIdentifier(PyObject *o)

Return ``1`` if the string is a valid identifier according to the language
Expand Down Expand Up @@ -436,12 +369,17 @@ APIs:

Create a Unicode object from the char buffer *u*. The bytes will be
interpreted as being UTF-8 encoded. The buffer is copied into the new
object. If the buffer is not ``NULL``, the return value might be a shared
object, i.e. modification of the data is not allowed.
object.
The return value might be a shared object, i.e. modification of the data is
not allowed.

If *u* is ``NULL``, this function behaves like :c:func:`PyUnicode_FromUnicode`
with the buffer set to ``NULL``. This usage is deprecated in favor of
:c:func:`PyUnicode_New`, and will be removed in Python 3.12.
This function raises :exc:`SystemError` when:

* *size* < 0,
* *u* is ``NULL`` and *size* > 0

.. versionchanged:: 3.12
*u* == ``NULL`` with *size* > 0 is not allowed anymore.


.. c:function:: PyObject *PyUnicode_FromString(const char *u)
Expand Down Expand Up @@ -680,79 +618,6 @@ APIs:
.. versionadded:: 3.3


Deprecated Py_UNICODE APIs
""""""""""""""""""""""""""

.. deprecated-removed:: 3.3 3.12

These API functions are deprecated with the implementation of :pep:`393`.
Extension modules can continue using them, as they will not be removed in Python
3.x, but need to be aware that their use can now cause performance and memory hits.


.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)

Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u*
may be ``NULL`` which causes the contents to be undefined. It is the user's
responsibility to fill in the needed data. The buffer is copied into the new
object.

If the buffer is not ``NULL``, the return value might be a shared object.
Therefore, modification of the resulting Unicode object is only allowed when
*u* is ``NULL``.

If the buffer is ``NULL``, :c:func:`PyUnicode_READY` must be called once the
string content has been filled before using any of the access macros such as
:c:func:`PyUnicode_KIND`.

.. deprecated-removed:: 3.3 3.12
Part of the old-style Unicode API, please migrate to using
:c:func:`PyUnicode_FromKindAndData`, :c:func:`PyUnicode_FromWideChar`, or
:c:func:`PyUnicode_New`.


.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)

Return a read-only pointer to the Unicode object's internal
:c:type:`Py_UNICODE` buffer, or ``NULL`` on error. This will create the
:c:type:`Py_UNICODE*` representation of the object if it is not yet
available. The buffer is always terminated with an extra null code point.
Note that the resulting :c:type:`Py_UNICODE` string may also contain
embedded null code points, which would cause the string to be truncated when
used in most C functions.

.. deprecated-removed:: 3.3 3.12
Part of the old-style Unicode API, please migrate to using
:c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`,
:c:func:`PyUnicode_ReadChar` or similar new APIs.


.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size)

Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE`
array length (excluding the extra null terminator) in *size*.
Note that the resulting :c:type:`Py_UNICODE*` string
may contain embedded null code points, which would cause the string to be
truncated when used in most C functions.

.. versionadded:: 3.3

.. deprecated-removed:: 3.3 3.12
Part of the old-style Unicode API, please migrate to using
:c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`,
:c:func:`PyUnicode_ReadChar` or similar new APIs.


.. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)

Return the size of the deprecated :c:type:`Py_UNICODE` representation, in
code units (this includes surrogate pairs as 2 units).

.. deprecated-removed:: 3.3 3.12
Part of the old-style Unicode API, please migrate to using
:c:func:`PyUnicode_GET_LENGTH`.


.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj)

Copy an instance of a Unicode subtype to a new true Unicode object if
Expand Down
1 change: 0 additions & 1 deletion Doc/data/stable_abi.dat

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions Doc/howto/clinic.rst
Original file line number Diff line number Diff line change
Expand Up @@ -848,15 +848,15 @@ on the right is the text you'd replace it with.
``'s#'`` ``str(zeroes=True)``
``'s*'`` ``Py_buffer(accept={buffer, str})``
``'U'`` ``unicode``
``'u'`` ``Py_UNICODE``
``'u#'`` ``Py_UNICODE(zeroes=True)``
``'u'`` ``wchar_t``
``'u#'`` ``wchar_t(zeroes=True)``
``'w*'`` ``Py_buffer(accept={rwbuffer})``
``'Y'`` ``PyByteArrayObject``
``'y'`` ``str(accept={bytes})``
``'y#'`` ``str(accept={robuffer}, zeroes=True)``
``'y*'`` ``Py_buffer``
``'Z'`` ``Py_UNICODE(accept={str, NoneType})``
``'Z#'`` ``Py_UNICODE(accept={str, NoneType}, zeroes=True)``
``'Z'`` ``wchar_t(accept={str, NoneType})``
``'Z#'`` ``wchar_t(accept={str, NoneType}, zeroes=True)``
``'z'`` ``str(accept={str, NoneType})``
``'z#'`` ``str(accept={str, NoneType}, zeroes=True)``
``'z*'`` ``Py_buffer(accept={buffer, str, NoneType})``
Expand Down
26 changes: 25 additions & 1 deletion Doc/whatsnew/3.12.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,9 @@ Summary -- Release highlights

.. PEP-sized items next.

Important deprecations, removals or restrictions:

* :pep:`623`, Remove wstr from Unicode


New Features
Expand All @@ -91,7 +94,9 @@ Improved Modules
Optimizations
=============


* Removed ``wstr`` and ``wstr_length`` members from Unicode objects.
It reduces object size by 8 or 16 bytes on 64bit platform. (:pep:`623`)
(Contributed by Inada Naoki in :gh:`92536`.)


Deprecated
Expand Down Expand Up @@ -140,6 +145,13 @@ New Features
Porting to Python 3.12
----------------------

* Legacy Unicode APIs based on ``Py_UNICODE*`` representation has been removed.
Please migrate to APIs based on UTF-8 or ``wchar_t*``.

* Argument parsing functions like :c:func:`PyArg_ParseTuple` doesn't support
``Py_UNICODE*`` based format (e.g. ``u``, ``Z``) anymore. Please migrate
to other formats for Unicode like ``s``, ``z``, ``es``, and ``U``.

Deprecated
----------

Expand All @@ -150,3 +162,15 @@ Removed
API. The ``token.h`` header file was only designed to be used by Python
internals.
(Contributed by Victor Stinner in :gh:`92651`.)

* Leagcy Unicode APIs has been removed. See :pep:`623` for detail.

* :c:macro:`PyUnicode_WCHAR_KIND`
* :c:func:`PyUnicode_AS_UNICODE`
* :c:func:`PyUnicode_AsUnicode`
* :c:func:`PyUnicode_AsUnicodeAndSize`
* :c:func:`PyUnicode_AS_DATA`
* :c:func:`PyUnicode_FromUnicode`
* :c:func:`PyUnicode_GET_SIZE`
* :c:func:`PyUnicode_GetSize`
* :c:func:`PyUnicode_GET_DATA_SIZE`
Loading