Skip to content

datetime.strptime(dt.strftime("%c"), "%c")) fails when year is <1000. #124529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pganssle opened this issue Sep 25, 2024 · 16 comments
Open

datetime.strptime(dt.strftime("%c"), "%c")) fails when year is <1000. #124529

pganssle opened this issue Sep 25, 2024 · 16 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@pganssle
Copy link
Member

pganssle commented Sep 25, 2024

Bug report

Bug description:

>>> from datetime import datetime
>>> datetime.strptime(datetime(1000, 1, 1).strftime("%c"), "%c")
datetime.datetime(1000, 1, 1, 0, 0)
>>> datetime.strptime(datetime(999, 1, 1).strftime("%c"), "%c")
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    datetime.strptime(datetime(999, 1, 1).strftime("%c"), "%c")
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nlx5/Documents/Programming/Python/cpython/Lib/_strptime.py", line 573, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
                                    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/home/nlx5/Documents/Programming/Python/cpython/Lib/_strptime.py", line 352, in _strptime
    raise ValueError("time data %r does not match format %r" %
                     (data_string, format))
ValueError: time data 'Tue Jan  1 00:00:00 999' does not match format '%c'

Discovered this when adding some hypothesis tests for strptime/strftime. I doubt this is a real problem anyone is going to have in the real world, but maybe.

I do not know if this is locale-specific or OS specific.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Linked PRs

@pganssle pganssle added the type-bug An unexpected behavior, bug, or error label Sep 25, 2024
@terryjreedy
Copy link
Member

The year for datetime.datetime must be and is allowed to be anything in range MINYEAR <= year <= MAXYEAR, which is 1 <= year <= 9999. I expect that the format functions should handle any legal date.

@zuo
Copy link
Contributor

zuo commented Sep 26, 2024

Considering these results:

>>> datetime(999, 1, 1).strftime("%c")
'Tue Jan  1 00:00:00 999'

>>> datetime.strptime("Tue Jan  1 00:00:00 999", "%c")  # as from strftime() above => the error described above
[snip]
ValueError: time data 'Tue Jan  1 00:00:00 999' does not match format '%c'

>>> datetime.strptime("Tue Jan  1 00:00:00 999", "%c")  # adding 0 before 999 to have 4-digit width year => success
datetime.datetime(999, 1, 1, 0, 0)

...and the following fragment of the docs (https://docs.python.org/3/library/datetime.html#technical-detail):

  1. The strptime() method can parse years in the full [1, 9999] range, but years < 1000 must be zero-filled to 4-digit width.

...I am not sure if the proviso that years < 1000 must be zero-filled to 4-digit width intentionally covers also this case.

One could argue that it does, and there is nothing to fix here.

Another person, however, could argue that:

  • (1) it does not, as here we deal with a locale-specific way of formatting;
  • (2) successful round-trip behavior is an expected property, and dropping it would be surprising.

What do you think?

[EDIT] The quoted note refers to the %Y format code, not to the %c one. So I believe that that imaginary Another person would be right. :)

@zuo
Copy link
Contributor

zuo commented Sep 26, 2024

PS It seems that for time.{strftime,strptime}() the behavior is the same (as, apparently, time.strptime() uses the same implementation from _strptime):

$ ./python
Python 3.14.0a0 (heads/main:a4d1fdfb15, Sep 26 2024, 22:47:21) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> t_tuple = time.strptime("Tue Jan  1 00:00:00 0999", '%c')
>>> t_tuple
time.struct_time(tm_year=999, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=1, tm_yday=1, tm_isdst=-1)
>>> time.strftime('%c', t_tuple)
'Tue Jan  1 00:00:00 999'
>>> time.strptime(_, '%c')
Traceback (most recent call last):
  File "<python-input-4>", line 1, in <module>
    time.strptime(_, '%c')
    ~~~~~~~~~~~~~^^^^^^^^^
  File "/home/zuo/cpython/Lib/_strptime.py", line 567, in _strptime_time
    tt = _strptime(data_string, format)[0]
         ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/home/zuo/cpython/Lib/_strptime.py", line 352, in _strptime
    raise ValueError("time data %r does not match format %r" %
                     (data_string, format))
ValueError: time data 'Tue Jan  1 00:00:00 999' does not match format '%c'

@zuo
Copy link
Contributor

zuo commented Sep 26, 2024

Hypothesis

It seems that the source of the problem is that (at least typically – for the C.UTF-8 locale and at least some others, e.g. pl_PL.UTF-8; yet, it seems that also for any other locales...):

  • datetime.datetime.strftime() – when %c is used to format a date+time – does not use the datetime's way of formatting %Y (which would result in a 4-digit year, with leading zeros for year < 1000), but returns a string that contains the year number with minimum count of digits needed to represent that number (i.e., less than 4 for year < 1000).

...whereas...

  • datetime.datetime.strptime() – when %c is used to parse a date+time – uses, to parse the year fragment, an %Y-specific regex (see the _strptime module...) which requires that the year number has exactly 4 digits.

Observation

I checked that:

(1) When formatting that example year 999, the results are:

Function/Method For "%c" For "%Y"
time.strftime() "999" "999"
datetime.datetime.strftime() "999" "0999" [sic!]

Conclusion: datetime.datetime.strftime()'s %c formatting behaves like time.strftime(), therefore it is not based on datetime.datetime.strftime()'s formatting of %Y.

(2) When parsing that example year 999 (as well as, e.g., 9) – both as a part of full date (%c) and alone (%Y) – only the 4-digit year format is accepted. Smaller numbers of digits always cause the same ValueError from _strptime (whose machinery, as noted above, even for %c uses the %Y-specific stuff...).

Possible fix

In the _strptime module's machinery (which is used by datetime.datetime.strptime() and time.strptime()): decouple the %c's parsing regex from the %Y's one, making the former more liberal (accepting also 1, 2 or 3 digits in the year number).
[The fix implementation would be made in the _strptime module, probably somewhere in LocaleTime.__calc_date_time()/TimeRE.__init__()... in TimeRE's __init__() and pattern()]

(Another theoretically possible variant: just make the %Y's regex more liberal – however that seems too disruptive...)

@zuo
Copy link
Contributor

zuo commented Sep 26, 2024

@pganssle @terryjreedy

I'd happy to implement the fix – if you decide that this should be fixed.

@Mariatta
Copy link
Member

Mariatta commented Sep 26, 2024

No issue on my Macbook laptop

Python 3.14.0a0 (heads/main:162d152146a, Sep 25 2024, 10:45:28) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datetime import datetime
>>> datetime.strptime(datetime(1000, 1, 1).strftime("%c"), "%c")
datetime.datetime(1000, 1, 1, 0, 0)
>>> datetime.strptime(datetime(999, 1, 1).strftime("%c"), "%c")
datetime.datetime(999, 1, 1, 0, 0)
>>> 

@zuo
Copy link
Contributor

zuo commented Sep 27, 2024

@Mariatta

Could you please check what string is returned on you system from the following call?

>>> datetime(999, 1, 1).strftime("%c")

Thanx :)

PS My guess is that, for your locale, a %c-formatted date+time includes a 2-digit year variant (instead of the 4-digit one).

@Mariatta
Copy link
Member

Mariatta commented Sep 27, 2024

@zuo I just tried it just now

Python 3.14.0a0 (heads/main:162d152146a, Sep 25 2024, 10:45:28) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datetime import datetime
>>> datetime(999, 1, 1).strftime("%c")
'Tue Jan  1 00:00:00 0999'

@zuo
Copy link
Contributor

zuo commented Sep 27, 2024

@Mariatta

Thank you!

Yeah, that leading zero your platform/locale provides makes strftime's %c format digestible by strptime on your system. Apparently, that's not the case for Linux family [EDIT: or, probably, more generally – for glibc]. :-/

Anyway, now it's quite clear for me what the fix should be.

@zuo
Copy link
Contributor

zuo commented Sep 28, 2024

Proof of concept:

diff --git a/Lib/_strptime.py b/Lib/_strptime.py
index a3f8bb544d..6a2527b75c 100644
--- a/Lib/_strptime.py
+++ b/Lib/_strptime.py
@@ -213,8 +213,10 @@ def __init__(self, locale_time=None):
                                 'Z'),
             '%': '%'})
         base.__setitem__('W', base.__getitem__('U').replace('U', 'W'))
-        base.__setitem__('c', self.pattern(self.locale_time.LC_date_time))
-        base.__setitem__('x', self.pattern(self.locale_time.LC_date))
+        base.__setitem__(
+            'c', self.__pattern_with_lax_year(self.locale_time.LC_date_time))
+        base.__setitem__(
+            'x', self.__pattern_with_lax_year(self.locale_time.LC_date))
         base.__setitem__('X', self.pattern(self.locale_time.LC_time))

     def __seqToRE(self, to_convert, directive):
@@ -236,6 +238,21 @@ def __seqToRE(self, to_convert, directive):
         regex = '(?P<%s>%s' % (directive, regex)
         return '%s)' % regex

+    def __pattern_with_lax_year(self, format):
+        """Like pattern(), but making %y and %Y accept also fewer digits.
+
+        Necessary to ensure that strptime() is able to parse strftime()'s
+        output when the %c or %x format code is used -- considering that
+        for some locales/platforms (e.g., 'C.UTF-8' on Linux), formatting
+        with either %c or %x may cause year numbers, if a number is small,
+        to have fewer digits than usual (e.g., '999' instead of `0999', or
+        '9' instead of '0009' or '09').
+        """
+        pattern = self.pattern(format)
+        pattern = pattern.replace(self['y'], r"(?P<y>\d{1,2})")
+        pattern = pattern.replace(self['Y'], r"(?P<Y>\d{1,4})")
+        return pattern
+
     def pattern(self, format):
         """Return regex pattern for the format string.

[EDIT] After applying the above patch, the error does not occur anymore:

>>> import time
>>> t_tuple = time.strptime("Tue Jan  1 00:00:00 0999", '%c')
>>> t_tuple
time.struct_time(tm_year=999, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=1, tm_yday=1, tm_isdst=-1)
>>> time.strftime('%c', t_tuple)
'Tue Jan  1 00:00:00 999'
>>> time.strptime(_, '%c')
time.struct_time(tm_year=999, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=1, tm_yday=1, tm_isdst=-1)
>>> 
>>> from datetime import datetime
>>> datetime(999, 1, 1).strftime('%c')
'Tue Jan  1 00:00:00 999'
>>> datetime.strptime(_, '%c')
datetime.datetime(999, 1, 1, 0, 0)

@serhiy-storchaka
Copy link
Member

See also gh-120713 and gh-122272. datetime.strftime() was fixed for locale-independent supported formats.

@zuo
Copy link
Contributor

zuo commented Oct 2, 2024

@pganssle @terryjreedy @Mariatta @serhiy-storchaka

OK, I'd like to propose the fix, implemented in the linked PR #124778 – considering that:

  • The status quo – the error-causing discrepancy between strftime and strptime (in respect to %c and %x), concerning year numbers number ranges explicitly documented as valid (i.e., within the range 1..9999) – is suboptimal and it seems that, as @terryjreedy stated seems to suggest above, it should be considered a bug. The fact that the presence of the problematic behavior is platform-dependent makes the state of affairs even less consistent.
  • Altering strftime to produce only zero-padded year representations (when it comes to the %c/%x format codes) does not seem like a good idea to me, because:
    • the platform-dependent behavior of strftime is a well-established (and documented) fact of live (contract?); it is not clear, whether (and for whom, considering the wide range of Python uses/applications/platforms...) it would be acceptable or unacceptable to deviate from it – at least when it comes to these particular, locale-dependent, format codes %c and %x [EDIT: I admit that the changes made in gh-120713 and gh-122272 weaken this argument to some extent; but, on the other hand, the composite nature and the unpredictability caused by being locale-dependent makes these two format codes – %c and %x – somewhat different beasts (in this context) than %Y, %G, %C, %F; the latter (%F) is also composite, indeed, but still locale-indepentent and easily explainable in terms of other format codes]
    • the implementation would be non-trivial (re-parse and alter the output obtained from the underlying platform routines? would it be worth the complexity? what about the performance penalty?).
  • Altering strptime, in the spirit of the robustness principle, to accept also non-zero-padded year representations – only regarding %c and %x (i.e., not touching %Y/%y/etc.) – seems to be the right solution, considering that:
    • generally, the platform-independent (Python-devs-only-steered) behavior of strptime is a well-established (and documented) fact, so this is the natural place to adjust the stuff;
    • the compatibility impact seems negligible; [EDIT: indeed, starting to accept less typical and previously unacceptable inputs does pose some risk of parsing garbage as valid data, but – given the composite nature of %c and %x – isn't that risk minimized enough?]
    • the implementation is very simple (< 10 SLOC, excluding tests/docstrings/comments – see the PR).

@serhiy-storchaka
Copy link
Member

strptime() should be able to parse strings created by other programs, not only Python, so this is a bug, and the only solution is to alter strptime().

@serhiy-storchaka
Copy link
Member

I do not think that #124778 is a right solution. We should fix %Y, %G and maybe %y. This will automatically fix %c and %x. And I consider this a bugfix which should be backported.

@zuo
Copy link
Contributor

zuo commented Oct 3, 2024

I do not think that #124778 is a right solution. We should fix %Y, %G and maybe %y. This will automatically fix %c and %x. And I consider this a bugfix which should be backported.

But wouldn't making %Y/%G/%y accept even a 1-digit year be too aggressive change? I have no hard data, but can imagine quite easily, that a non-negligible number of users may rely, e.g., on %Y to distinguish valid dates (e.g., 2024/09/24) from garbage or highly ambiguous inputs (e.g. 24/09/24 or 1/10/1).

My proposal in #124778 is similar, just much more conservative (as %c/%x are always supposed to be used in the realm of a particular locale; that, I believe – together with the fact that internally they are always composed of several fields – makes the change much safer), and limited just to those cases in which the strftime/strptime parity cannot [or at least not easily] be gained by fixing the strftime's "side of the equation" (which has already been done for %Y/%G in gh-120713).

@zuo
Copy link
Contributor

zuo commented Oct 3, 2024

Let me also emphasize that strptime's %Y and %y are explicitly documented as requiring the leading zeros.

(And there are no such statements in the docs when it comes to %c and %x; only that we deal with a locale’s appropriate representation.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
Development

No branches or pull requests

5 participants