Skip to content

Incorrect resampling from hourly to monthly values #2665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Cd48 opened this issue Jan 9, 2013 · 11 comments
Closed

Incorrect resampling from hourly to monthly values #2665

Cd48 opened this issue Jan 9, 2013 · 11 comments
Assignees
Labels
Bug Datetime Datetime data dtype Resample resample method

Comments

@Cd48
Copy link

Cd48 commented Jan 9, 2013

EDIT: Modern reproducible example:

In [1]: import pandas as pd

In [2]: dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')

In [4]: ts = pd.Series(range(len(dates)), index=dates)

In [5]: def start(x):
   ...:     return x.index[0]
   ...:
   ...: def end(x):
   ...:     return x.index[-1]
   ...:

In [6]: ts.resample('1M', closed='right', label='right').apply(start)
Out[6]:
2000-03-31   2000-03-20
2000-04-30   2000-04-01
2000-05-31   2000-05-01
Freq: M, dtype: datetime64[ns]

In [7]: ts.resample('1M', closed='right', label='right').apply(end)
Out[7]:
2000-03-31   2000-03-31 23:00:00
2000-04-30   2000-04-30 23:00:00
2000-05-31   2000-05-20 00:00:00
Freq: M, dtype: datetime64[ns]

In [8]: ts.resample('1M', closed='left', label='right').apply(start)
Out[8]:
2000-03-31   2000-03-20
2000-04-30   2000-03-31
2000-05-31   2000-04-30
Freq: M, dtype: datetime64[ns]

In [9]: ts.resample('1M', closed='left', label='right').apply(end)
Out[9]:
2000-03-31   2000-03-30 23:00:00
2000-04-30   2000-04-29 23:00:00
2000-05-31   2000-05-20 00:00:00
Freq: M, dtype: datetime64[ns]

Expected behavior: #2665 (comment)

There seems to be something wrong when resampling hourly values to monthly values. The 'closed=' argument does not do what it should. I am using pandas 0.10.0.

import pandas as pd
from pandas import TimeSeries
import numpy as np

# Timeseries with hourly values
dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
ts = TimeSeries(np.arange(len(dates)), index=dates)

# I want monthly values equal to the summed values of the hours in that month
# With "closed=left" I want the hourly value (1 value) at midnight to be 
# included in the month after that very midnight
data_monthly_1 = ts.resample('1M', how='sum', closed='left', label='right') 
sum_april1 = data_monthly_1.ix[1]

# If we check this, the summed values per month do not match : 
data_hourly_april2 = ts[ts.index.month == 4]
sum_april2 = sum(data_hourly_april2)

# Instead, "closed=left" above resulted in an inclusion of the entire last day of the
# previous month (25 values !) (and exclusion of the last day of the month)
# This is also strange as I am nowhere concerned with using 'days'.
data_hourly_april3 = ts[   ((ts.index.month == 3) & (ts.index.day == 31)) | 
                           ((ts.index.month == 4) & (ts.index.day < 30))      ]
sum_april3 = sum(data_hourly_april3)

# PS: "closed=right" results in the answer I would expect for "closed = left":
data_monthly_4 = ts.resample('1M', how='sum', closed='right', label='right') 
sum_april4 = data_monthly_4.ix[1]

When resampling these hourly to daily values, there was no problem.

@changhiskhan
Copy link
Contributor

If you leave out the closed/label arguments, pandas will choose the right one for you:

In [27]: ts.resample('M', how='sum')
Out[27]: 
2000-03-31     41328
2000-04-30    466200
2000-05-31    564852
Freq: M

edit: I thought it about it some more and am not certain this is a bug. If you're thinking about it terms of monthly periods, you should leave out the closed argument and let pandas choose the right one. That's the easiest way.

If you want explicit control, because this is timestamp format, the monthend datetime is midnight of the last day of the month. This is so it'll play nice with daily frequencies. It's possible to hack it so IF the input data is intraday frequency then we adjust the bin edges, but it is often the case that you might have intraday data without an explicit frequency. So the result would be inconsistent behavior.

@ghost ghost assigned changhiskhan Jan 20, 2013
@changhiskhan
Copy link
Contributor

I'm moving this to 0.10.2 in case more discussion is needed/desired

@filmackay
Copy link

Isn't the above example results wrong - the dates indicated are actually midnight of the start of the relevant date. Is not the correct range for March 2000:

2000-03-01 0:00 (left)
2000-04-01 0:00 (right)

I would have thought mathematically, saying that a monthly period finishes on 2000-03-31 is incorrect, since this represents the midnight between 2000-03-30 and 2000-03-31, not the midnight after 2000-03-31 (which is in fact 2000-04-01 0:00).

?

@wesm
Copy link
Member

wesm commented Apr 12, 2013

Well that's look at this here:

def start(x):
    return x.index[0]

def end(x):
    return x.index[-1]

In [18]: ts.resample('1M', how=[start, end], closed='right', label='right')
Out[18]: 
                         start                 end
2000-03-31 2000-03-20 00:00:00 2000-03-31 23:00:00
2000-04-30 2000-04-01 00:00:00 2000-04-30 23:00:00
2000-05-31 2000-05-01 00:00:00 2000-05-20 00:00:00

In [19]: ts.resample('1M', how=[start, end], closed='left', label='right')
Out[19]: 
                         start                 end
2000-03-31 2000-03-20 00:00:00 2000-03-30 23:00:00
2000-04-30 2000-03-31 00:00:00 2000-04-29 23:00:00
2000-05-31 2000-04-30 00:00:00 2000-05-20 00:00:00

So I think the closed='left' behavior is what you want. I think there's a tweak in the code to account for the convention of using the last day of the month at midnight (even though the month isn't "technically" over) as the date.

Closing this issue ...

@kdebrab
Copy link
Contributor

kdebrab commented Nov 24, 2013

This issue keeps on coming back... in my eyes this is not yet ready to be closed...

@changhiskhan, leaving out the closed argument may be ok when you are dealing with (intraday) source data that has start time stamp (or left) labels, but it is not giving correct results when your (intraday) source data has end time stamp (or right) labels.

@wesm, the closed='left' example you provide is not the desired behavior. On the contrary, it confirms the problem raised by @Cd48 and commented by @filmackay: the entire last day (24 values) is aggregated into the next month!

Instead, I believe the desired behavior should be:

In [18]: ts.resample('1M', how=[start, end], closed='right', label='right')
Out[18]: 
                         start                 end
2000-03-31 2000-03-20 00:00:00 2000-04-01 00:00:00
2000-04-30 2000-04-01 01:00:00 2000-05-01 00:00:00
2000-05-31 2000-05-01 01:00:00 2000-05-20 00:00:00

In [19]: ts.resample('1M', how=[start, end], closed='left', label='right')
Out[19]: 
                         start                 end
2000-03-31 2000-03-20 00:00:00 2000-03-31 23:00:00
2000-04-30 2000-04-01 00:00:00 2000-04-30 23:00:00
2000-05-31 2000-05-01 00:00:00 2000-05-20 00:00:00

I did some research into the code... Actually, the behavior for source data with start time stamp labels previously had a very similar problem (see http://stackoverflow.com/questions/11018120/bug-in-resampling-with-pandas-0-8). It was handled by introducing the following code (#1471):

    if self.freq != 'D' and is_superperiod(self.freq, 'D'):
        day_nanos = _delta_to_nanoseconds(timedelta(1))
        if self.closed == 'right':
            bin_edges = bin_edges + day_nanos - 1
        else:
            bin_edges = bin_edges + day_nanos

of which the else close was later removed (#1726).

Shouldn't the correct code be:

    if self.freq != 'D' and is_superperiod(self.freq, 'D'):
        day_nanos = _delta_to_nanoseconds(timedelta(1))
        if self.closed == 'right':
            bin_edges = bin_edges + day_nanos
        else:
            bin_edges = bin_edges + day_nanos - 1

That indeed results in the desired behavior shown above and also solves #5440. Though that raises again issue #1726... I think #1726 should be solved separately, maybe by adjusting the range edge first?

@jreback
Copy link
Contributor

jreback commented Nov 24, 2013

why don't u do a PR with the new code and some tests to validate?

@kdebrab
Copy link
Contributor

kdebrab commented Nov 24, 2013

It also involves changing the default behavior of closed.

Changing the default to closed='left' for all frequencies (as of now the default is already left for intraday frequencies), but keeping the default label as it is now (i.e. 'left' for intraday, 'right' otherwise) would result in the same default behaviour for most users.

@kdebrab
Copy link
Contributor

kdebrab commented Nov 24, 2013

I'm afraid a PR is too much for me...

I'm not accustomed with the PR process yet (and I also lack the time). Moreover, a solution for #1726 (and probably other issues) still needs to be found.

As a side note: the resample code looks to me quite complicated with already many hacks. Instead of adding another hack, maybe a profound (and time-consuming) code refactoring is desirable?

@jreback jreback reopened this Nov 24, 2013
@jreback
Copy link
Contributor

jreback commented Nov 24, 2013

ok will reopen this
not sure if this will be addressed in the near future so would appreciate a pr from you at some point as you appear pretty familiar

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 9, 2014
@jreback jreback modified the milestones: 0.16.0, 0.17.0 Jan 26, 2015
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 28, 2023

This looks correct to me

If we have

dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
ts = pd.Series(range(len(dates)), index=dates)

and we do ts.resample('1M', closed='right', label='left'), then we have the bins:

  • '2000-02-29': ('2000-02-29', '2000-03-31']
  • '2000-03-31': ('2000-03-31', '2000-04-30']
  • '2000-04-30': ('2000-04-30', '2000-05-31']

If we do

dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
ts = pd.Series(range(len(dates)), index=dates)

bins = [
    ('2000-02-29', '2000-03-31'),
    ('2000-03-31', '2000-04-30'),
    ('2000-04-30', '2000-05-31'),
]

# left, right
result = []
for left, right in bins:
    df = ts[(ts.index.normalize()>left)&(ts.index.normalize()<=right)].copy()
    result.append(df.sum())
print(result)
print(ts.resample('1M', label='left', closed='right').sum())

then result and ts.resample('1M', label='left', closed='right').sum() show the same numbers

I'm marking as 'closing candidate' for now then, but I'll double-check later

@MarcoGorelli
Copy link
Member

closing then, but please do let me know if I've misunderstood

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Resample resample method
Projects
None yet
Development

No branches or pull requests

8 participants