-
Notifications
You must be signed in to change notification settings - Fork 21
specify multiple aggregation functions #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I guess we should close this. Reasons: In terms of syntax, we're not benefiting a lot. It would always be possible to also write:
Our call syntax for In terms of speed, it only makes sense to talk about the fast numba (and weave) options. If you're concerned about speed, you won't take the slower versions into consideration anyways. Let's also leave weave aside, as I'd consider it kind of deprecated anyways. So, talking about the numba implementation. What would have to be changed in order to actually to that:
Both things come with major speed concerns. While a plain sum, plain mean etc will already be compiled when run, a sum-mean combo or even more complex usually won't. Having done some benchmarking lately, the compile time is magnitudes higher, than running the function itself on a midsized array like in our benchmarks. Also, the loop over the given functions is a big speed hindrance, as every loop comes with it's own overhead. Inlining might be possible, not sure to which extend it might happen automatically, but in worst case, you'll have to create separate functions for 2, 3, or whatever amount of given functions. And this would make the uglification of the already not that simple code in the numba implementation even uglier. So, concluding, I guess the negatives outweight. I'd also like to close any questions concerning API, aiming for a 1.0 version. Let's get this off the table! |
Do any of the aggregates allow multiple funcs to be passed in a single call? Also even if multiple reduction funcs were passed, would it even be faster than looping?
Thanks. |
No, we had thought about implementing that, but in the end we decided to keep it as is, as for the majority of functions there would be no speed gain from this batching. |
Discussed initially in this other issue.
Design of syntax
Suggested syntax for input:
For output, you could argue that a dict, class or
namedtuple
would be a safer solution as the user is less likely to mix up the order. Actually, it seems thatnamedtuple
is probably a pretty good solution because it will naturally unpack if the user wants it to, otherwise the user can treat it like a class/dict. Incidentally, thefield_names
arg fornamedtuple
supports basically the same set of input syntaxes described above foraggregate
. I guess you would need to dynamically generate thenamedtuple
classes based on the requestedfuncs
list, and then store the class definitions in a little cache - but that's easy enough to do.If output is named tuple the following is possible:
As previously discussed, the advantage of accepting multiple functions is that the aggregation machinery will then have scope for a variety of optimizations (though it could do the simplest thing and loop over each function in turn, or simply throw
NotImplemented
). Any alternative syntax (e.g. wrapping the aggregation into an object and then providing methods on that object in the way thatpandas
does) is likely to require some degree of explicit caching and a bunch of complex logic to deal with it, whereas this syntax should keep any logic reasonably straightforward and permit optimal use of memory.Some suggestions on possible optimizations
In JIT-C implementations there is the option of squashing everything into a single pass over the array, which hopefully would offer very nice speed ups. To get the most out of the cache it would probably make sense to arrange each of the multiple outputs for a given group contiguously in memory, e.g. for
min max
, the output would be[min_0 max_0 min_1 max_1 .. min_n max_n]
. Whether or not the output is produced in this manner will actually not be visible to the user as they will only get views onto it, which can be provided with basic numpy indexingreturn custom_tuple(min=combined[:,0], max=combined[:,1])
. I think that will be safe, because there is no overlapping of the views into thecombined
array, so the user could never accidentally overwrite one variable and unexpectedly effect another. One thing to note though is that different outputs may need differentdtype
s - it is still possible to achieve this with numpy ndarrays, but it's an extra implementation detail to get right.In the
numpy
implementation there are a few functions which go nicely together:sum mean var std
all start withsum
, so that can be reused etc.min max
can be run usingargsort(a)
combined with the first/last trick.any all
both need to work ona.astype(bool)
, andallnan anynan
both need to work onisnan(a)
.Of course, if
nan-
versions are used, then that overhead only needs to appear once for multiple functions. The same is true ofmulti_ravel_index
and potentially all bounds-checking only needs to be done once.edit: and if custom functions are involved, then it's obviously easier to loop over each of them for a given array rather than doing all the sorting etc for each function...although admittedly that has got to be a pretty rare usage case, and the custom function could have dealt with it internally anyway...though I guess this is relevant to the pure-python implementation as everything is done in this manner.
The text was updated successfully, but these errors were encountered: