-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Strange Behavior of mean() with timedelta64 #9442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@tjcrone please add an example of the output you see from a command and what exactly you expected. |
When the dataframe is very large, in the case of the first example, the output is:
I expect the mean in tdiff to mirror fdiff. When the dataframe is not large we get:
Which is the expected result. The second column is calculated by taking the seconds and converting them to floats. I believe that the timedelta64 object being used in the mean function may be overflowing. |
Ah, yes. This does look like an overflow issue. In fact, we do not allow aggregations like mean for datetime64 objects for exactly these sort of reasons:
In this case, it looks like we have only inconsistently disabled it for timedelta64. Compare:
So the immediate consistency fix would be to make aggregation on timedelta64 objects always raises an error/skips. In the long term, it would be nice to fully support appropriate aggregation operations for datetime and timedelta types. Let me if you're interested in working on that... |
You showed that the mean of a datetime64 object is not returned. That seems appropriate. With timedelta64 objects, an incorrect value is returned when calling mean directly. I don't see any inconsistencies, just incorrect values: In [2]: df.tdiff.mean()
Out[2]:
Timedelta('0 days 02:40:53.248336') What is especially strange is that std() returns the correct value, because std() should require the correct mean value: In [4]: df.tdiff.std()
Out[4]:
Timedelta('2 days 21:06:29.824063') A correct calculation of the mean for timedeltas is a one-liner. Please feel free to incorporate: In [12]: pd.to_timedelta(df.tdiff.apply(lambda x: float(x.item())).sum()/len(df.tdiff.index))
Out[12]:
Timedelta('4 days 23:33:47.520090') |
the This is because precision is at the ns level. and when summing these overflow is pretty easy. To work-around you can change the precision and it will work.
@tjcrone you generally do not want to use |
@tjcrone if you are interested in doing a pull-request. Would be ok with doing something like the above (its actually easy as you don't have to worry about conversions, just divide the i8 by say 1e6 before and after mean), in |
BUG: Bug in incorrection computation of .mean() on timedelta64[ns] because of overflow #9442
xref #6549
When a dataframe with a timedelta64 is very large, the mean() function does not work as expected. In the following code, the mean is incorrect but all the other stats are fine:
By making the length smaller, by changing the start date in the above example:
The correct result is obtained. (The mean of the timedelta should be about 5 days.) Is this an open bug?
Version:
The text was updated successfully, but these errors were encountered: