Skip to content

BUG: pd.offsets.MonthEnd() delevers wrong month #48779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
pjousma opened this issue Sep 26, 2022 · 32 comments · Fixed by #49958
Closed
3 tasks done

BUG: pd.offsets.MonthEnd() delevers wrong month #48779

pjousma opened this issue Sep 26, 2022 · 32 comments · Fixed by #49958
Assignees

Comments

@pjousma
Copy link

pjousma commented Sep 26, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
ts = pd.Timestamp(2017, 1, 31)
ts + pd.offsets.MonthEnd()

Issue Description

I expect that the answer is: Timestamp('2017-01-31 00:00:00')
but I get: Timestamp('2017-02-28 00:00:00')

Expected Behavior

Timestamp('2017-01-31 00:00:00')

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.9.12.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Dutch_Netherlands.1252

pandas : 1.4.2
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 61.2.0
Cython : 0.29.28
pytest : 7.1.1
hypothesis : None
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 8.2.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.4
brotli :
fastparquet : None
fsspec : 2022.02.0
gcsfs : None
markupsafe : 2.0.1
matplotlib : 3.5.1
numba : 0.55.1
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
snappy :
sqlalchemy : 1.4.32
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.20.1
xlrd : 2.0.1
xlwt : None
zstandard : None

@pjousma pjousma added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 26, 2022
@MarcoGorelli
Copy link
Member

Thanks for the report

MonthEnd goes to the next date which is an end of the month, so this seems correct

You probably want MonthEnd(0) to get the end of the current month

Might be good to add this as an example to the docs as it seems to be a common source of confusion https://stackoverflow.com/questions/37354105/find-the-end-of-the-month-of-a-pandas-dataframe-series

@MarcoGorelli MarcoGorelli added good first issue Frequency DateOffsets and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 26, 2022
@ZDenizYStenhaug
Copy link

Hi! :) I'd love to help with the documentation if this issue is still relevant. If so, could you please assign it to me?
Thanks, have a good day.

@MarcoGorelli
Copy link
Member

thanks, go ahead! if you comment "take" the issue will be assigned to you, do reach out if you have questions

@ZDenizYStenhaug
Copy link

take

@ChhaSahil
Copy link

take

@jtimko16
Copy link

jtimko16 commented Oct 9, 2022

Hello, I am looking for my very first contribution. Is there still something I can help will (I would probably need a little bit of guidance). Thanks

@MarcoGorelli
Copy link
Member

@jtimko16
Copy link

take

@pjousma
Copy link
Author

pjousma commented Oct 11, 2022 via email

@jtimko16
Copy link

Thanks, I found a equivalent solution: ts.replace(day=1) + pd.offsets.MonthEnd(0) For me it is not logical to go to the next month when no parameters are given. For me it is logical that the default parameter should be '0' when no parameters are given by the user. But thanks for the quick response Op ma 26 sep. 2022 om 10:51 schreef Marco Edward Gorelli < @.>:

Thanks for the report MonthEnd goes to the next date which is an end of the month, so this seems correct You probably want MonthEnd(0) to get the end of the current month Might be good to add this as an example to the docs as it seems to be a common source of confusion https://stackoverflow.com/questions/37354105/find-the-end-of-the-month-of-a-pandas-dataframe-series — Reply to this email directly, view it on GitHub <#48779 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3IXQLDZWOMAEFHTFZOVUQDWAFPZDANCNFSM6AAAAAAQVRXV34 . You are receiving this because you authored the thread.Message ID: @.
>

Hello, pjousma

Thanks for your comment. I've been thinking about it, and also agree, that the default n parameter should be zero. And also the same for pandas.tseries.offsets.YearEnd() function (it has the same issue).

Who can decide whether we should change parameter to zero or keep it as is?

Thanks,
J.

@MarcoGorelli
Copy link
Member

this'd be a breaking change and would have to go through a deprecation cycle, not sure it'd be worth it. Would be more in favour of just clearly documenting the behaviour

@jtimko16
Copy link

this'd be a breaking change and would have to go through a deprecation cycle, not sure it'd be worth it. Would be more in favour of just clearly documenting the behaviour

Ok, ok. I am going to update the documentation in the next few days. I'll let you know in case of any questions/issues. Thanks.

@jtimko16
Copy link

Hi @MarcoGorelli , please may I ask you for a clarification.

Do I understand correctly, I should is update this page of documentation?: https://pandas.pydata.org/docs/reference/api/pandas.tseries.offsets.MonthEnd.html

Therefore I'll be modifying this part of code
https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/tslibs/offsets.pyx (class MonthEnd; starting from row 2746 - I am going to write more examples into the comment block below). Is it correct? Thanks

image

@MarcoGorelli
Copy link
Member

yeah looks fine

@jtimko16
Copy link

I created a Pull Request #49133

However as it is my very Pull Request, therefore for now, I marked it as a draft. Please can you look at it and give me some feedback? Thanks

@pjousma
Copy link
Author

pjousma commented Oct 19, 2022 via email

@jtimko16
Copy link

@pjousma Yes, I agree that the current behavior is little bit confusing, and what you suggested is definitely more logical.

However, I believe that as quick immediate solution we can document current logic (if everything goes well we can close this PR in next several days) and in longer term we can open discussion about changing functionality of the 'n' parameter. Not sure about processes here in Panda, but it make take some time to approve an implement the change.

@MarcoGorelli
Copy link
Member

Yeah changing behaviour shouldn't be taken lightly - for a start, would there be a way to communicate the change to users in such a way as to not flood them with warnings every time they use MonthEnd?

But certainly it should be clearly documented with examples so thanks @honza26 for taking this on!

@jtimko16 jtimko16 removed their assignment Oct 29, 2022
@jtimko16
Copy link

Hello, unfortunately I am facing technical issues with setting up environment and will not be able to finish (my laptop is quite old and slow, therefore cannot install a lot of tools that will slow it done further)

PR in progress for anyone who want to take it from there:
#49133

If any question, feel free to reach out. And @MarcoGorelli thanks a lot for help and guidance with it (although I didn't manage to cross the finishing line).

Cheers,
Jan

@MarcoGorelli
Copy link
Member

hey @honza26 - if you want to come to the next new contributors meeting, you could share you screen and we could resolve your technical issues with you?

@justin4480
Copy link

take

@justin4480
Copy link

justin4480 commented Oct 29, 2022

Hi. I'm a long time user of pandas, but this is my first contribution. Can I update both the docstring and the docs?

I plan to:

  1. make clear that normalize is an ignored parameter
  2. clarify the unintuitive logic when MonthEnd n > 0 and added to a Timestamp not at monthend

normalize is an ignored parameter

docstring states (class) Docstring shows MonthEnd(n: int = ..., normalize: bool = ...), however normalize doesn't appear to do anything:

for date in pd.date_range(start='2000-01-01', end='2001-01-01'):
    for n in [-1, 0, 1, 2]:
        a = date + pd.offsets.MonthEnd(n=n, normalize=True)
        b = date + pd.offsets.MonthEnd(n=n, normalize=False)
        assert a == b

unintuitive logic

for n in range(-2, 3):
    print(f"\n{n=}")
    print(f"2000-05-01 -> {(pd.Timestamp('2000-05-01') + pd.offsets.MonthEnd(n=n)).strftime('%Y-%m-%d')}")
    print(f"2000-05-15 -> {(pd.Timestamp('2000-05-15') + pd.offsets.MonthEnd(n=n)).strftime('%Y-%m-%d')}")
    print(f"2000-05-31 -> {(pd.Timestamp('2000-05-31') + pd.offsets.MonthEnd(n=n)).strftime('%Y-%m-%d')}")
# n=-2
+ 2000-05-01 -> 2000-03-31
+ 2000-05-15 -> 2000-03-31
+ 2000-05-31 -> 2000-03-31

# n=-1
+ 2000-05-01 -> 2000-04-30
+ 2000-05-15 -> 2000-04-30
+ 2000-05-31 -> 2000-04-30

# n=0
+ 2000-05-01 -> 2000-05-31
+ 2000-05-15 -> 2000-05-31
+ 2000-05-31 -> 2000-05-31

# n=1
! 2000-05-01 -> 2000-05-31
! 2000-05-15 -> 2000-05-31
+ 2000-05-31 -> 2000-06-30

# n=2
! 2000-05-01 -> 2000-06-30
! 2000-05-15 -> 2000-06-30
+ 2000-05-31 -> 2000-07-31

I appreciate the breaking changes impact, but this really does feel like a bug to me as I would expect MonthEnd() should behave like excels =EOMONTH function.

Regardless, I will at least document as clearly as possible with examples, but it's going to be hard to explain away given results are not even consistent, i.e. n = -1 Vs n = +1 (as shown above).

@jtimko16
Copy link

hey @honza26 - if you want to come to the next new contributors meeting, you could share you screen and we could resolve your technical issues with you?

Thanks for your offer. I will try to build the environment again, and potentially join the next contributors meeting.

@jtimko16
Copy link

@justin4480 Thanks for taking this further. A good initiative!

A good explanation of general offset logic is here: (in case it helps)
https://pandas.pydata.org/docs/user_guide/timeseries.html

image

@MarcoGorelli
Copy link
Member

That's a good reference, it's just not very easy to discover from the docstrings.

I'd be all for just adding a simple example to the docstrings of each of those offsets (MonthEnd, MonthBegin, WeekEnd)

@justin4480
Copy link

@justin4480 Thanks for taking this further. A good initiative!

A good explanation of general offset logic is here: (in case it helps) https://pandas.pydata.org/docs/user_guide/timeseries.html

image

@honza26 Thanks for sending this over. This was very helpful in understanding the logic & design intent!

However, one doubt for me remains regarding the normalize parameter. The docs claim:
image
source: https://pandas.pydata.org/docs/user_guide/timeseries.html#parametric-offsets

However the below runs perfectly fine.

for dt in pd.date_range(start='2000-01-01', end='2001-01-01'):
    for n in range(-2, 3):
        a = dt + pd.offsets.MonthEnd(n=n, normalize=True)
        b = dt + pd.offsets.MonthEnd(n=n, normalize=False)
        assert a == b
        c = dt - pd.offsets.MonthEnd(n=n, normalize=True)
        d = dt - pd.offsets.MonthEnd(n=n, normalize=False)
        assert c == d

@jtimko16
Copy link

jtimko16 commented Oct 30, 2022

@justin4480
If normalize is True, then datetime is normalized to midnight. If you dont have any 'time' part of in your date/datetime variables, it makes no difference.

My examples to show functionality:

`dt_now = datetime.datetime.now()
print(f'Datetime now: {dt_now}\n')

a = dt_now + pd.offsets.MonthEnd(n=1, normalize=True)
b = dt_now + pd.offsets.MonthEnd(n=1, normalize=False)

print(f'Offset to MonthEnd; normalize = True: {a}')
print(f'Offset to MonthEnd; normalize = False: {b}')
`

And output is:

`Datetime now: 2022-10-30 20:42:39.526462

Offset to MonthEnd; normalize = True: 2022-10-31 00:00:00
Offset to MonthEnd; normalize = False: 2022-10-31 20:42:39.526462`

@justin4480 justin4480 removed their assignment Nov 7, 2022
@justin4480
Copy link

Sorry I was determined to do this, but have spent far too long trying to get my environment setup. I tried both the recommended docker & mamba dev environment setup, but just couldn't get the pytest suite to run without tones of errors (before making any code edits). I will revisit when I have more time as I do want to contribute, but have unassigned this particular issue for now.

@zemnly
Copy link
Contributor

zemnly commented Nov 12, 2022

Hi, I'd like to take a shot at this. Gonna be my first contribution here. I've been using Pandas on and off for my job since the past 3 years (:

@zemnly
Copy link
Contributor

zemnly commented Nov 12, 2022

take

@jtimko16
Copy link

@zemnly Nice :) Good luck!!

@zemnly zemnly removed their assignment Nov 13, 2022
@natmokval
Copy link
Contributor

Hi, I am working on this issue as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants