Create a single API/test/benchmarking but with multiple independent implementations

I've created a separate branch for working on this.

So far I've spent a few hours trying to write a fairly comprehensive readme plus I've done some very basic work to split up the implementations into separate files. Note though that I've made basically no attempt at actually enforcing the API yet, or even checking if the code runs ok...it doesn't!

Obviously there is still quite a bit of work to be done.

In terms of agreeing on a single API, I think the main questions are:

**1. What to do with `dtype`s?**  It's always nice to give the user full control over this, but it's also great when the program is able to automatically choose the most sensible option. For example, summing `int8`'s probably need to be upgraded to `int32` or `int64`, maybe even `float`..the program could guess what is best based on some heuristic, although this would add extra time - the simplest heuristic would be to assume the worst case i.e. all values are already maximum at the current bitdepth and all will be added to the same group, thus if `len(idx)=100` and `vals.dtype=int8` then we assume the `sum` is going to need log2(100*255) bits.  This calculations is specific to `sum`, `mean var std` have different requirements, and `min max first last all any` are easier to guess. The other question is what to do about the dtype of the fillvalue - I have chosen to enforce `bool`s  for the `fillvalue` of `any all` but, the user may legitimately want some form of `nan` or pseudo-nan (I remember reading a discussion about efficient `nan`s for non-floats somewhere in pandas docs or maybe it was to do with `numpy`'s `MaskedArray` ..anyway there's no simple answer to that question).  If the fillvalue does require using `float` in the output, it may still be more efficient to do the main computation with some for of `int` and then only upgrade to `float` at the end.  Perhaps the easiest thing to do is upgrade all the ambiguous stuff to at least `float32` and then encourage the user to request a simpler `dtype` if they fell that would suffice...but then what about detecting overflows etc..I'm not sure what error handling of that kind is built into `numpy`?

**2. Support for non-flat `vals`, matched to `idx`** Although matlab allows this, I would prefer not to.  Assuming both `idx` and `vals` have the same memory layout, it costs nothing for the user to flatten them, and it isn't too ugly syntactically.  By enforcing that it then makes it simpler to identify when 2D `idx` is being used to request multidimensional output...this was always a bit confusing in matlab...see next point...

**3. Support for multidimensional output, using 2D `idx`**  This is available in matlab and is a very helpful feature. I use it a lot for 2d binning, but it could be n-dimensional. I feel strongly that we should provide it, at least in the API (i.e. throw `NotImplemented` if need be..but it's actually very easy to do once numpy is involved!).

**4. Support for scalar `vals`** - I wrote a small section on this in the readme. I think it's not really that useful in general, but helpful for `sum` of `vals=1`...you might argue that the user could just go and use `bincount`, but Matlab accepts scalars and this version of `accumarray` can help with multidimensional bincounting, so it's probably a feature worth having.  If need be, you could put a simple hack at the top of the function, which expands the scalar out to be a full vector..or throw `NotImplemented`.

**5. Support broadcasting `vals` to `idx`**  This is complicated to implement and causes ambiguities if you also want to support 2D `idx` for multidimensional output (disucssed above). Matlab does not offer this, and I've never felt the need to try and use/implement it, so I strongly feel we should explicitly choose not to do it. It also means that the `ufunc.at` optimizations (if they ever happen) can be kept simpler....to be honest I'm not even sure whether broadcasting makes any sense in the `accumarray` context?!

**6. Support for contiguous/sorted mode**. I haven't mentioned this in the readme and I haven't implemented it yet, but I agree that something along these lines would be nice.  There are a variety of things you might want, for example if the user says that `idx` is pre-sorted, various optimisations may be possible (i.e. the code still treats the inputs the same way, but is now able to assume that groups are already laid out contiguously in memory).  Alternatively the user may want your contiguous concept, where `idx` is no longer an actual index, but now just a general label for contiguous sections.  A variation on this would be to say that sections where `idx=0`, should be treated as belonging to a common "null"  group, but all other contiguous blocks should be treated as being labeled `1, 2, ..., n`. ..this can work nicely if `idx` is a bool array, as each `true` block is treated as a new label, but all the `false` blocks are grouped together...but sometimes the user will want a way to have two different groups adjacent to one another (i.e. no `false` element between them)..which you can't do with a simple `bool` array....anyway, the point is that there's a lot you could do here to make the user happy, some of which will offer extra optimizations, and some of which would just allow the user to be lazier.

**7. User-specified size and bounds checking** - I think it's definitely worth allowing the user to specify the size of the output, Matlab does this.  I often need to do this when I'm doing analysis because I use accumarray twice, the second time with a subset of the original data, and I want the outputs to be the same size.  I'm not sure exactly how much control you can give the user over bounds-checking...I guess with your `weave` stuff you could completely disable all checking if the user really wanted.  If that's something you want to offer we should encapsulate it in a proper `kwarg`, however I think in general people will expect basic checking as I think all `numpy` code does.  **edit 0:** by the way I don't like your choice of `"incontiguous"` as default...the english is _non_contiguous, but it's a bit of a confusing choice of default.  I also haven't looked into the details what your `downscaled` mode offers...I'm not sure where it fits in to the picture.

---

**edit 1:**

Some of the above stuff probably warrants common scripts for `numpy` and `weave`, though pure python solutions would have to be kept separate (and in many cases may be simple/not applicable at all.).  I'm not sure of the best way to deal with that given that we want to try and keep stuff as separate as possible...there may need to be some kind of compromise.

One thing that should be fairly easy to agree on is the order of args, I vote for following Matlab exactly and then adding on all the extra `kwargs` after that (`order, dtype, mode`).

Also, we should agree on the aliasing rules and exactly what the nan- versions of functions are supposed to do.  I tried to describe that fairly explicitly in the readme...let me know if you disagree and/or want to add stuff.

Regarding the name of the main two arguments, I definitely prefer `idx` and `vals`, but if we permit a range of modes then `idx` is no longer necessarily such a great choice...but I'll stick with it for now!

Regarding the `weave` stuff, is there a good place to point people to in order to get it installed quickly...so far I haven't said anything about this in the readme...it's probably worth giving people a one-sentence explanation as to exactly why `weave` is so good.

I think you also mentioned a numba implementation..what did you say the status of that was? And what about a Cython one?..although I'm not sure there's going to be much to be gained from Cython.  I have a small amount of experience with Cython and a tiny amount of experience with numba, but I don't feel like I really know enough to compare the pros and cons of each...if you are going to provide all these implementations, we should have a proper explanation as to why you might want each of them....I guess one of the advantages worth mentioning is that (some/all? of these things) work nicely on the GPU...though I think the simplicity of going to the GPU varies greatly?

In terms of making the C versions more useful to people, might it be worth providing some binaries..I guess that's a pretty big job, but it might increase uptake.

---

**edit 2:** 

Ok, I didn't realise that the the non-numpy implementation did actually use numpy.  ...I've now re-written it to be truly pure-python.  It's relatively simple, though obviously not exactly performant.
## 

**edit 3:**

I think it may be worth trying to wrap `pandas`.  It looks like pandas uses cython under the hood for its groupby- see [..pandas/ib.pyx](https://github.com/pydata/pandas/blob/master/pandas/lib.pyx) and [..core/groupby.py](https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py)

---

**edit 4:**

Right, I've implemented a simple `pandas` version.  I didn't spend any time trying to squeeze the best performance out of pandas, so it may be possible to do better. I've put benchmarking stats in the readme, but the basic story is that it doesn't beat the optimized numpy version, except where numpy has to rely on `ufunc.at` (ie. `min max prod`).  

I've also implemented a ufunc-at version, for use with benchmarking.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create a single API/test/benchmarking but with multiple independent implementations #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Create a single API/test/benchmarking but with multiple independent implementations #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions