Skip to content

Create a single API/test/benchmarking but with multiple independent implementations #2

Closed
@d1manson

Description

@d1manson

I've created a separate branch for working on this.

So far I've spent a few hours trying to write a fairly comprehensive readme plus I've done some very basic work to split up the implementations into separate files. Note though that I've made basically no attempt at actually enforcing the API yet, or even checking if the code runs ok...it doesn't!

Obviously there is still quite a bit of work to be done.

In terms of agreeing on a single API, I think the main questions are:

1. What to do with dtypes? It's always nice to give the user full control over this, but it's also great when the program is able to automatically choose the most sensible option. For example, summing int8's probably need to be upgraded to int32 or int64, maybe even float..the program could guess what is best based on some heuristic, although this would add extra time - the simplest heuristic would be to assume the worst case i.e. all values are already maximum at the current bitdepth and all will be added to the same group, thus if len(idx)=100 and vals.dtype=int8 then we assume the sum is going to need log2(100*255) bits. This calculations is specific to sum, mean var std have different requirements, and min max first last all any are easier to guess. The other question is what to do about the dtype of the fillvalue - I have chosen to enforce bools for the fillvalue of any all but, the user may legitimately want some form of nan or pseudo-nan (I remember reading a discussion about efficient nans for non-floats somewhere in pandas docs or maybe it was to do with numpy's MaskedArray ..anyway there's no simple answer to that question). If the fillvalue does require using float in the output, it may still be more efficient to do the main computation with some for of int and then only upgrade to float at the end. Perhaps the easiest thing to do is upgrade all the ambiguous stuff to at least float32 and then encourage the user to request a simpler dtype if they fell that would suffice...but then what about detecting overflows etc..I'm not sure what error handling of that kind is built into numpy?

2. Support for non-flat vals, matched to idx Although matlab allows this, I would prefer not to. Assuming both idx and vals have the same memory layout, it costs nothing for the user to flatten them, and it isn't too ugly syntactically. By enforcing that it then makes it simpler to identify when 2D idx is being used to request multidimensional output...this was always a bit confusing in matlab...see next point...

3. Support for multidimensional output, using 2D idx This is available in matlab and is a very helpful feature. I use it a lot for 2d binning, but it could be n-dimensional. I feel strongly that we should provide it, at least in the API (i.e. throw NotImplemented if need be..but it's actually very easy to do once numpy is involved!).

4. Support for scalar vals - I wrote a small section on this in the readme. I think it's not really that useful in general, but helpful for sum of vals=1...you might argue that the user could just go and use bincount, but Matlab accepts scalars and this version of accumarray can help with multidimensional bincounting, so it's probably a feature worth having. If need be, you could put a simple hack at the top of the function, which expands the scalar out to be a full vector..or throw NotImplemented.

5. Support broadcasting vals to idx This is complicated to implement and causes ambiguities if you also want to support 2D idx for multidimensional output (disucssed above). Matlab does not offer this, and I've never felt the need to try and use/implement it, so I strongly feel we should explicitly choose not to do it. It also means that the ufunc.at optimizations (if they ever happen) can be kept simpler....to be honest I'm not even sure whether broadcasting makes any sense in the accumarray context?!

6. Support for contiguous/sorted mode. I haven't mentioned this in the readme and I haven't implemented it yet, but I agree that something along these lines would be nice. There are a variety of things you might want, for example if the user says that idx is pre-sorted, various optimisations may be possible (i.e. the code still treats the inputs the same way, but is now able to assume that groups are already laid out contiguously in memory). Alternatively the user may want your contiguous concept, where idx is no longer an actual index, but now just a general label for contiguous sections. A variation on this would be to say that sections where idx=0, should be treated as belonging to a common "null" group, but all other contiguous blocks should be treated as being labeled 1, 2, ..., n. ..this can work nicely if idx is a bool array, as each true block is treated as a new label, but all the false blocks are grouped together...but sometimes the user will want a way to have two different groups adjacent to one another (i.e. no false element between them)..which you can't do with a simple bool array....anyway, the point is that there's a lot you could do here to make the user happy, some of which will offer extra optimizations, and some of which would just allow the user to be lazier.

7. User-specified size and bounds checking - I think it's definitely worth allowing the user to specify the size of the output, Matlab does this. I often need to do this when I'm doing analysis because I use accumarray twice, the second time with a subset of the original data, and I want the outputs to be the same size. I'm not sure exactly how much control you can give the user over bounds-checking...I guess with your weave stuff you could completely disable all checking if the user really wanted. If that's something you want to offer we should encapsulate it in a proper kwarg, however I think in general people will expect basic checking as I think all numpy code does. edit 0: by the way I don't like your choice of "incontiguous" as default...the english is _non_contiguous, but it's a bit of a confusing choice of default. I also haven't looked into the details what your downscaled mode offers...I'm not sure where it fits in to the picture.


edit 1:

Some of the above stuff probably warrants common scripts for numpy and weave, though pure python solutions would have to be kept separate (and in many cases may be simple/not applicable at all.). I'm not sure of the best way to deal with that given that we want to try and keep stuff as separate as possible...there may need to be some kind of compromise.

One thing that should be fairly easy to agree on is the order of args, I vote for following Matlab exactly and then adding on all the extra kwargs after that (order, dtype, mode).

Also, we should agree on the aliasing rules and exactly what the nan- versions of functions are supposed to do. I tried to describe that fairly explicitly in the readme...let me know if you disagree and/or want to add stuff.

Regarding the name of the main two arguments, I definitely prefer idx and vals, but if we permit a range of modes then idx is no longer necessarily such a great choice...but I'll stick with it for now!

Regarding the weave stuff, is there a good place to point people to in order to get it installed quickly...so far I haven't said anything about this in the readme...it's probably worth giving people a one-sentence explanation as to exactly why weave is so good.

I think you also mentioned a numba implementation..what did you say the status of that was? And what about a Cython one?..although I'm not sure there's going to be much to be gained from Cython. I have a small amount of experience with Cython and a tiny amount of experience with numba, but I don't feel like I really know enough to compare the pros and cons of each...if you are going to provide all these implementations, we should have a proper explanation as to why you might want each of them....I guess one of the advantages worth mentioning is that (some/all? of these things) work nicely on the GPU...though I think the simplicity of going to the GPU varies greatly?

In terms of making the C versions more useful to people, might it be worth providing some binaries..I guess that's a pretty big job, but it might increase uptake.


edit 2:

Ok, I didn't realise that the the non-numpy implementation did actually use numpy. ...I've now re-written it to be truly pure-python. It's relatively simple, though obviously not exactly performant.

edit 3:

I think it may be worth trying to wrap pandas. It looks like pandas uses cython under the hood for its groupby- see ..pandas/ib.pyx and ..core/groupby.py


edit 4:

Right, I've implemented a simple pandas version. I didn't spend any time trying to squeeze the best performance out of pandas, so it may be possible to do better. I've put benchmarking stats in the readme, but the basic story is that it doesn't beat the optimized numpy version, except where numpy has to rely on ufunc.at (ie. min max prod).

I've also implemented a ufunc-at version, for use with benchmarking.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions