Incorrect resampling from hourly to monthly values #2665

Cd48 · 2013-01-09T10:37:45Z

EDIT: Modern reproducible example:

In [1]: import pandas as pd

In [2]: dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')

In [4]: ts = pd.Series(range(len(dates)), index=dates)

In [5]: def start(x):
   ...:     return x.index[0]
   ...:
   ...: def end(x):
   ...:     return x.index[-1]
   ...:

In [6]: ts.resample('1M', closed='right', label='right').apply(start)
Out[6]:
2000-03-31   2000-03-20
2000-04-30   2000-04-01
2000-05-31   2000-05-01
Freq: M, dtype: datetime64[ns]

In [7]: ts.resample('1M', closed='right', label='right').apply(end)
Out[7]:
2000-03-31   2000-03-31 23:00:00
2000-04-30   2000-04-30 23:00:00
2000-05-31   2000-05-20 00:00:00
Freq: M, dtype: datetime64[ns]

In [8]: ts.resample('1M', closed='left', label='right').apply(start)
Out[8]:
2000-03-31   2000-03-20
2000-04-30   2000-03-31
2000-05-31   2000-04-30
Freq: M, dtype: datetime64[ns]

In [9]: ts.resample('1M', closed='left', label='right').apply(end)
Out[9]:
2000-03-31   2000-03-30 23:00:00
2000-04-30   2000-04-29 23:00:00
2000-05-31   2000-05-20 00:00:00
Freq: M, dtype: datetime64[ns]

Expected behavior: #2665 (comment)

There seems to be something wrong when resampling hourly values to monthly values. The 'closed=' argument does not do what it should. I am using pandas 0.10.0.

import pandas as pd
from pandas import TimeSeries
import numpy as np

# Timeseries with hourly values
dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
ts = TimeSeries(np.arange(len(dates)), index=dates)

# I want monthly values equal to the summed values of the hours in that month
# With "closed=left" I want the hourly value (1 value) at midnight to be 
# included in the month after that very midnight
data_monthly_1 = ts.resample('1M', how='sum', closed='left', label='right') 
sum_april1 = data_monthly_1.ix[1]

# If we check this, the summed values per month do not match : 
data_hourly_april2 = ts[ts.index.month == 4]
sum_april2 = sum(data_hourly_april2)

# Instead, "closed=left" above resulted in an inclusion of the entire last day of the
# previous month (25 values !) (and exclusion of the last day of the month)
# This is also strange as I am nowhere concerned with using 'days'.
data_hourly_april3 = ts[   ((ts.index.month == 3) & (ts.index.day == 31)) | 
                           ((ts.index.month == 4) & (ts.index.day < 30))      ]
sum_april3 = sum(data_hourly_april3)

# PS: "closed=right" results in the answer I would expect for "closed = left":
data_monthly_4 = ts.resample('1M', how='sum', closed='right', label='right') 
sum_april4 = data_monthly_4.ix[1]

When resampling these hourly to daily values, there was no problem.

The text was updated successfully, but these errors were encountered:

changhiskhan · 2013-01-20T04:17:02Z

If you leave out the closed/label arguments, pandas will choose the right one for you:

In [27]: ts.resample('M', how='sum')
Out[27]: 
2000-03-31     41328
2000-04-30    466200
2000-05-31    564852
Freq: M

edit: I thought it about it some more and am not certain this is a bug. If you're thinking about it terms of monthly periods, you should leave out the closed argument and let pandas choose the right one. That's the easiest way.

If you want explicit control, because this is timestamp format, the monthend datetime is midnight of the last day of the month. This is so it'll play nice with daily frequencies. It's possible to hack it so IF the input data is intraday frequency then we adjust the bin edges, but it is often the case that you might have intraday data without an explicit frequency. So the result would be inconsistent behavior.

changhiskhan · 2013-01-21T20:28:47Z

I'm moving this to 0.10.2 in case more discussion is needed/desired

filmackay · 2013-03-08T03:33:53Z

Isn't the above example results wrong - the dates indicated are actually midnight of the start of the relevant date. Is not the correct range for March 2000:

2000-03-01 0:00 (left)
2000-04-01 0:00 (right)

I would have thought mathematically, saying that a monthly period finishes on 2000-03-31 is incorrect, since this represents the midnight between 2000-03-30 and 2000-03-31, not the midnight after 2000-03-31 (which is in fact 2000-04-01 0:00).

?

wesm · 2013-04-12T04:10:29Z

Well that's look at this here:

def start(x):
    return x.index[0]

def end(x):
    return x.index[-1]

In [18]: ts.resample('1M', how=[start, end], closed='right', label='right')
Out[18]: 
                         start                 end
2000-03-31 2000-03-20 00:00:00 2000-03-31 23:00:00
2000-04-30 2000-04-01 00:00:00 2000-04-30 23:00:00
2000-05-31 2000-05-01 00:00:00 2000-05-20 00:00:00

In [19]: ts.resample('1M', how=[start, end], closed='left', label='right')
Out[19]: 
                         start                 end
2000-03-31 2000-03-20 00:00:00 2000-03-30 23:00:00
2000-04-30 2000-03-31 00:00:00 2000-04-29 23:00:00
2000-05-31 2000-04-30 00:00:00 2000-05-20 00:00:00

So I think the closed='left' behavior is what you want. I think there's a tweak in the code to account for the convention of using the last day of the month at midnight (even though the month isn't "technically" over) as the date.

Closing this issue ...

kdebrab · 2013-11-24T13:44:10Z

This issue keeps on coming back... in my eyes this is not yet ready to be closed...

@changhiskhan, leaving out the closed argument may be ok when you are dealing with (intraday) source data that has start time stamp (or left) labels, but it is not giving correct results when your (intraday) source data has end time stamp (or right) labels.

@wesm, the closed='left' example you provide is not the desired behavior. On the contrary, it confirms the problem raised by @Cd48 and commented by @filmackay: the entire last day (24 values) is aggregated into the next month!

Instead, I believe the desired behavior should be:

In [18]: ts.resample('1M', how=[start, end], closed='right', label='right')
Out[18]: 
                         start                 end
2000-03-31 2000-03-20 00:00:00 2000-04-01 00:00:00
2000-04-30 2000-04-01 01:00:00 2000-05-01 00:00:00
2000-05-31 2000-05-01 01:00:00 2000-05-20 00:00:00

In [19]: ts.resample('1M', how=[start, end], closed='left', label='right')
Out[19]: 
                         start                 end
2000-03-31 2000-03-20 00:00:00 2000-03-31 23:00:00
2000-04-30 2000-04-01 00:00:00 2000-04-30 23:00:00
2000-05-31 2000-05-01 00:00:00 2000-05-20 00:00:00

I did some research into the code... Actually, the behavior for source data with start time stamp labels previously had a very similar problem (see http://stackoverflow.com/questions/11018120/bug-in-resampling-with-pandas-0-8). It was handled by introducing the following code (#1471):

    if self.freq != 'D' and is_superperiod(self.freq, 'D'):
        day_nanos = _delta_to_nanoseconds(timedelta(1))
        if self.closed == 'right':
            bin_edges = bin_edges + day_nanos - 1
        else:
            bin_edges = bin_edges + day_nanos

of which the else close was later removed (#1726).

Shouldn't the correct code be:

    if self.freq != 'D' and is_superperiod(self.freq, 'D'):
        day_nanos = _delta_to_nanoseconds(timedelta(1))
        if self.closed == 'right':
            bin_edges = bin_edges + day_nanos
        else:
            bin_edges = bin_edges + day_nanos - 1

That indeed results in the desired behavior shown above and also solves #5440. Though that raises again issue #1726... I think #1726 should be solved separately, maybe by adjusting the range edge first?

jreback · 2013-11-24T14:00:19Z

why don't u do a PR with the new code and some tests to validate?

kdebrab · 2013-11-24T14:15:19Z

It also involves changing the default behavior of closed.

Changing the default to closed='left' for all frequencies (as of now the default is already left for intraday frequencies), but keeping the default label as it is now (i.e. 'left' for intraday, 'right' otherwise) would result in the same default behaviour for most users.

kdebrab · 2013-11-24T15:51:01Z

I'm afraid a PR is too much for me...

I'm not accustomed with the PR process yet (and I also lack the time). Moreover, a solution for #1726 (and probably other issues) still needs to be found.

As a side note: the resample code looks to me quite complicated with already many hacks. Instead of adding another hack, maybe a profound (and time-consuming) code refactoring is desirable?

jreback · 2013-11-24T16:49:13Z

ok will reopen this
not sure if this will be addressed in the near future so would appreciate a pr from you at some point as you appear pretty familiar

MarcoGorelli · 2023-02-28T16:29:12Z

This looks correct to me

If we have

dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
ts = pd.Series(range(len(dates)), index=dates)

and we do ts.resample('1M', closed='right', label='left'), then we have the bins:

'2000-02-29': ('2000-02-29', '2000-03-31']
'2000-03-31': ('2000-03-31', '2000-04-30']
'2000-04-30': ('2000-04-30', '2000-05-31']

If we do

dates = pd.date_range('3/20/2000', '5/20/2000', freq='1h')
ts = pd.Series(range(len(dates)), index=dates)

bins = [
    ('2000-02-29', '2000-03-31'),
    ('2000-03-31', '2000-04-30'),
    ('2000-04-30', '2000-05-31'),
]

# left, right
result = []
for left, right in bins:
    df = ts[(ts.index.normalize()>left)&(ts.index.normalize()<=right)].copy()
    result.append(df.sum())
print(result)
print(ts.resample('1M', label='left', closed='right').sum())

then result and ts.resample('1M', label='left', closed='right').sum() show the same numbers

I'm marking as 'closing candidate' for now then, but I'll double-check later

MarcoGorelli · 2023-02-28T17:38:05Z

closing then, but please do let me know if I've misunderstood

ghost assigned changhiskhan Jan 20, 2013

wesm closed this as completed Apr 12, 2013

Cd48 mentioned this issue Nov 5, 2013

resampling closed='left' incorrect ? #5440

Closed

jreback reopened this Nov 24, 2013

jreback modified the milestones: 0.15.0, 0.14.0 Mar 9, 2014

jreback added the Resample label Mar 9, 2014

jreback modified the milestones: 0.16.0, 0.17.0 Jan 26, 2015

shoyer mentioned this issue Mar 4, 2015

Resampling uses inconsistent labeling for sub-daily and super-daily frequencies #9586

Closed

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

MarcoGorelli closed this as completed Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect resampling from hourly to monthly values #2665

Incorrect resampling from hourly to monthly values #2665

Cd48 commented Jan 9, 2013 •

edited by mroeschke

Loading

changhiskhan commented Jan 20, 2013

changhiskhan commented Jan 21, 2013

filmackay commented Mar 8, 2013

wesm commented Apr 12, 2013

kdebrab commented Nov 24, 2013

jreback commented Nov 24, 2013

kdebrab commented Nov 24, 2013

kdebrab commented Nov 24, 2013

jreback commented Nov 24, 2013

MarcoGorelli commented Feb 28, 2023 •

edited

Loading

MarcoGorelli commented Feb 28, 2023

Incorrect resampling from hourly to monthly values #2665

Incorrect resampling from hourly to monthly values #2665

Comments

Cd48 commented Jan 9, 2013 • edited by mroeschke Loading

changhiskhan commented Jan 20, 2013

changhiskhan commented Jan 21, 2013

filmackay commented Mar 8, 2013

wesm commented Apr 12, 2013

kdebrab commented Nov 24, 2013

jreback commented Nov 24, 2013

kdebrab commented Nov 24, 2013

kdebrab commented Nov 24, 2013

jreback commented Nov 24, 2013

MarcoGorelli commented Feb 28, 2023 • edited Loading

MarcoGorelli commented Feb 28, 2023

Cd48 commented Jan 9, 2013 •

edited by mroeschke

Loading

MarcoGorelli commented Feb 28, 2023 •

edited

Loading