Skip to content

ENH: Effecient resample/reindex by index #7267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shura-v opened this issue May 28, 2014 · 3 comments
Open

ENH: Effecient resample/reindex by index #7267

shura-v opened this issue May 28, 2014 · 3 comments

Comments

@shura-v
Copy link

shura-v commented May 28, 2014

Here is the idea. I have predefined index and I'm grouping by timestamps in it, aggregating preceding values of the interval:

index = DatetimeIndex([datetime(2014, 1, 11, 11, 30), datetime(2014, 1, 12, 15), datetime(2014, 1, 15, 22)])

s = Series({
    datetime(2014, 1, 10, 0): 1,
    datetime(2014, 1, 11, 10): 2,
    datetime(2014, 1, 11, 11): 3,
    datetime(2014, 1, 12, 12): 3,
    datetime(2014, 1, 15, 16): 4,
    datetime(2014, 1, 15, 22): 5
})

print s.groupby(lambda x: index[(index >= x).argmax()]).sum()

2014-01-11 11:30:00    6
2014-01-12 15:00:00    3
2014-01-15 22:00:00    9

I don't know how effecient this code, but I think it might be a very useful feature as long as "resample" accepts only offsets or deltas. And "reindex" method doesn't have 'how' argument.

@jreback
Copy link
Contributor

jreback commented May 28, 2014

so this is defacto doing this:

Essentially you are grouping by a Categorical give then time

In [75]: def ct(x):
   ....:     return x.hour*3600+x.minute*60+x.second
   ....: 

In [77]: pd.cut(map(ct,s.index.time),bins=[0] + map(ct,index.time))
Out[77]: 
            NaN
     (0, 41400]
     (0, 41400]
 (41400, 54000]
 (54000, 79200]
 (54000, 79200]
Levels (3): Index(['(0, 41400]', '(41400, 54000]', '(54000, 79200]'], dtype=object)

In [78]: s.groupby(pd.cut(map(ct,s.index.time),bins=[0] + map(ct,index.time))).sum()
Out[78]: 
(0, 41400]        5
(41400, 54000]    3
(54000, 79200]    9
dtype: int64

And if we had a properly constructed TimeIndex this would work nicely (ignore edge effects).
related: #4023

@jreback jreback added this to the 0.15.0 milestone May 28, 2014
@shura-v
Copy link
Author

shura-v commented May 30, 2014

Series meant to be something like this:

s = Series({
    datetime(2014, 1, 10, 0): 1,
    datetime(2014, 1, 11, 10): 2,
    datetime(2014, 1, 11, 11): 3,
    datetime(2014, 1, 12, 12): 3,
    datetime(2014, 1, 13, 1): 100,
    datetime(2014, 1, 15, 16): 4,
    datetime(2014, 1, 15, 22): 5
})

But I got the idea, thanks. Anyway, I think it would extremely useful feature that would resample by index (with presence of 'closed', 'label' and 'how' arguments) using some high-level method.

@shura-v
Copy link
Author

shura-v commented May 30, 2014

Here is the solution I came up with, based on your suggestion.
Think of index_to as some random (predefined) DatetimeIndex:

index_from = pd.date_range('2009-01-01', '2016-01-01', freq='T')
s = pd.Series(np.arange(len(index_from)), index_from)

index_to = pd.date_range('2010-01-01', '2015-05-30', freq='H')
bins = index_to.insert(0, datetime.fromtimestamp(0)).astype(np.long)
labels = pd.cut(s.index.astype(np.long), bins=bins).labels

agg = s.groupby(labels).last().ix[0:]
pd.Series(agg.values, index_to)

2010-01-01 00:00:00    525600
2010-01-01 01:00:00    525660
2010-01-01 02:00:00    525720
2010-01-01 03:00:00    525780
2010-01-01 04:00:00    525840
2010-01-01 05:00:00    525900
2010-01-01 06:00:00    525960
2010-01-01 07:00:00    526020
2010-01-01 08:00:00    526080
2010-01-01 09:00:00    526140
2010-01-01 10:00:00    526200
2010-01-01 11:00:00    526260
2010-01-01 12:00:00    526320
2010-01-01 13:00:00    526380
2010-01-01 14:00:00    526440
...
2015-05-29 10:00:00    3368760
2015-05-29 11:00:00    3368820
2015-05-29 12:00:00    3368880
2015-05-29 13:00:00    3368940
2015-05-29 14:00:00    3369000
2015-05-29 15:00:00    3369060
2015-05-29 16:00:00    3369120
2015-05-29 17:00:00    3369180
2015-05-29 18:00:00    3369240
2015-05-29 19:00:00    3369300
2015-05-29 20:00:00    3369360
2015-05-29 21:00:00    3369420
2015-05-29 22:00:00    3369480
2015-05-29 23:00:00    3369540
2015-05-30 00:00:00    3369600
Freq: H, Length: 47401

I think it's the same as s.reindex(index_to, method='pad'), but with ability to use any aggregate function instead of just picking last value.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@mroeschke mroeschke added Enhancement Groupby and removed API Design Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves labels Apr 11, 2021
@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants