-
Notifications
You must be signed in to change notification settings - Fork 21
Create a single API/test/benchmarking but with multiple independent implementations #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Wow, great job so far! Sorry for the delay, I was on a conference over the weekend. I'll go through step by step, hope I don't miss anything. 1. dtypes For bool and int, I wouldn't bother with a separate nan type. If in doubt, the user can run an 2. Matching vals to idx 3. ndim 4. Scalar vals 5. Broadcasting 6. Contiguous mode For the contiguous mode, this was my initial implementation, requireing an already sorted input. After hacking around a bit longer and improving the implementation for unsorted accmaps, it turned out, the speed advantages were actually not that great as I had thought, and I could get comparable speeds for arbitrary accmaps. It's basically one pointer indirection, that can get skipped, when accessing the data. Looks like the prefetching etc. is that great, that this is not a big cost anymore. As it is mainly a subcase of the unsorted map. It would make sense to throw that out completely. Renaming incontiguous to something better, I'm totally fine with that. I'm not a native speaker, so I beg my pardon for the naming. For your other suggestions on this, I'd say KISS principle again. If the user wants some 0-group, simply set some 0-values in the accmap, before running 7. Bounds checking So, I'm totally fine adding the sz argument. 8. Input variable naming It may be a good idea to change fillvalue to fill_value, as used in numpy e.g. for masked arrays. |
For getting scipy weave to run, it's actually a single line with linux, not more than "apt-get install python-scipy" and everything works. I also know, what kind of pain in the ass it can be, to get toolchains or math libraries installed in windows. I haven't used windows for ages, and can't help too much with setting up scipy for it. The documentation there should cover that I hope. What weave generally does, is creating C-code on the fly, depending on the input data type, and storing the compilation results for reuse. A working C compiler is necessary for all this to work, so there is no way to provide binaries, that would run without a working toolchain. I also wouldn't worry at all about the purepy implementation. Any sane person, trying to do some math with python will install numpy. If not - not my problem. No need to reinvent the wheel, just for a demonstration version of accumarray. If some things don't work there, so be it. In terms of aliasing, you had the clearer approach. We should have a shared function, of converting any func-keyword-input into a unique function name string, with which every implementation has to deal itself then. Your implementation is a good start for that. |
Your English is pretty much perfect (I actually spent a year in Moscow as an EFL teacher so I should know..it's certainly a lot lot better than my Russian!). Ok... I think we need to flesh out the 1. dtypes 6. modes So, are you suggesting that the only thing we should support is something like this: if <whatever_the_special_test_is>:
idx_is_used = np.zeros(sz,dtype=bool)
idx_is_used[idx] = True
new_idx = np.cumsum(idx_is_used) - idx_is_used
idx = new_idx[idx]
sz = max(idx) If that's what you mean, then could we treat it as a special case of Just to clarify, if the above is true, then there is no 8. variable naming I really didn't like I'm probably not going to do any more work on this in the next few days, so you can safely commit to the branch I was working on and it shouldn't be too confusing. |
9. benchmarking and testing I'm sure your stuff was good, but I was lazy and ended up basically starting from scratch in a separate file rather than trying to understand everything you were doing. |
With start from scratch you mean the test_and_bench_simple module? I agree that testing and benchmarking should be separated. I'm also fine when the testing part with pytest will be up to me. I guess it could require some more comments. It's surely not too intuitive, what's going on there. For basic understanding I'd recommend pytest fixture and parametrization reads. accmap_all and accmap_compare are different fixtures. accmap_all simply returns different accum implementations for the tests. That are the first bunch of tests, testing basic functionality that has to be guaranteed by all implementations. This could also be achieved, by parametrizing all these tests separately with all implementations. Parametrizing the fixture just does this in a more convenient way. accmap_compare provides a whole namespace with a bunch of testing ingredients, data with and without nan, depending on what shall be tested, a reference function etc. test_compare then runs all the listed functions in func_list against any combination of implementations provided by the accmap_compare fixture. |
accmap = np.unique(accmap, return_inverse=True)[1] |
Btw. please let's be more specific with the thrown errors in the future rework. It's really bad practice to only throw |
that all sounds good to me.. I stand by what I said above - I'm not going to touch anything for a couple of days at least so it should be easy for you to commit straight onto this branch |
Ok, I'll see where I get with this the next days. I guess the last open thing is a final decision in terms of variable naming, I got the names initially from here: http://wiki.scipy.org/Cookbook/AccumarrayLike . So it's not really my invention. As it's probably the oldest public python accumarray implementation (dating back to 2010) and was promoted on scipy, it likely got some reach. I'd rather stick with that, then inventing something completely new, without good reason.
For some unclear reason I had taken the fillvalue without underscore, even if it's commonly used with it. This should be fixed into About the
About the keyword order, I initially would have tended to a agree to keep matlab ordering. Thinking about it now, I guess this should be talked about again. I was once told, when I suggested some accumarray thing for numpy, that numpy is not matlab, and does not need to repeat it's quirks. func seems to me to be clearly the most used keyword argument, nearly being worth to be a mandatory, positional argument. If we could agree on all this, we would finally end up with: def accumarray(accmap, a, func='sum', size=None, fill_value=0, dtype=None):
... This are the conventions we talked about already, just to sum it up:
Are you fine with this? |
In no particular order... I suspect that the cookbook
With the first two args, I guess you're right that I'm not too fussed about whether or not we match up with Matlab beyond the first two args, as long as we provide basically the same functionality. Python's superior argument passing syntax is basically doing a pretty good job already - my feeling is that different users are likely to use the other I'm not sure what you mean by "a[accmap >= size] is treated as nan"...where To clarify the And what did you mean by the last point - "no implicit matching of a to accmap"? |
C/F is not a big deal in pure numpy operations to support, as numpy takes care of all iteration issues related to it. Having to take care about this in C is quite an annoyance, I wouldn't like to take care of without good reason. Everyone really taking care about the order can easily change the order afterwards imho. I had a bug in the C version once, not correctly recognizing F arrays, producing wrong results, so that's why I said autodetection and acting upon must be granted. Nothing to do for the numpy version here. Let's pick What I meant with treat like nan was, ignore these values. If the size of the output is already given with If dtype is specified, I guess we agree that specifying an invalid fill_value should simply raise a ValueError. If the user specifies dtype=int and fill_value=-inf, what else to do with it? For more vague situations, fill_value=-inf for a bool operation, no dtype specified, there are points for both. I don't like too much implicit behaviour. Who knows, where the fill_value came from. Some calculation, that went wrong? A bool operation simply doesn't produce nan (or any other float) as an output. Neither should accumarray do, if not explicitly told to do so. Same with sums/prod on int and float values as fill_value. If the user insists on specifying +Inf as a fill_value, I'd expect at least the output dtype to be properly specified, so it's clear what should happen here. The analogy would be setting float values to an int. If it's a more or less sane operation, the value is rounded and set. If it's nan/inf, there is an error. Anyways, this is probably too much detail, we should probably skip this point, until there is some implementation for the ambiguation function. No implicit matching, if the shapes don't fit together for a sane operation, there shouldn't be any fancy magic making it somehow work, but just a clear error. |
One more thing. I totally don't care about my name being presented anywhere in the code, readme whatever, or someone copying. This is such a small snippet, not worth to claim any authorship on anything imho. So I'd suggest scratching any name mentionings in the readme and the code. The git history is descriptive enough, if someone really cares. If you'd prefer to have it in the code itself, I'd rather suggest something explicit for that like contributors.txt. |
I don't like silently cropping for out-of-range indices. It's likely to hide bugs in users' code. If they really want to do that then they can do a simple masking before they pass the values in to accumarray - we could give an example of this in the docs. Alternatively you could provide a I think I agree with everything you just said about dtypes, let's see what the code ends up looking like and discuss further if need be. With the
I'm not sure how you were planning on doing the multidimensional indexing, but if you use Regarding authorship - I'm fine with removing all references in code, but it would be good to leave one line in the readme somewhere...not a big deal though. |
Have only one single large value (maybe due to overflow) in About ordering for ndim output, I would have defaulted to C and left it to the user to wrap it into |
Cropping is easy to use and easy to understand, but I feel very strongly that silently dropping some data is a recipe for disaster...I don't know what kind of data you deal with normally, but in my field you often find weird data that needs to be carefully dealt with rather than simply having "bad" values "swept under the carpet"...using |
For bincount, I'm not getting the point of the minlength argument at all, to be honest. I haven't used matlab in a while and don't have a working installation here. If it simply throws an error for group indices out of range ... why not. I just had checked our codebase, within roughly 400 different calls to accum, not a single one used the size keyword. I had basically implemented it as it was there in matlab, though the details aren't well described in their manual pages. I guess it makes more sense in a ndim context anyways. In terms of downscaling, getting the corresponding original indices is possible using This may well fit together with some other little helpers we're currently using like uaccum, for accum and unpack in one go, and a bunch of other utils. Looks like we're pretty much through then. One last thing, I'd be a big fan of staying with |
Regarding I agree with the concept of a helper library. Potentially you could attach them as attributes to the main function, e.g. Anyway, to address the question of |
Hmm, I just feel like there will be a time after matlab, and probably it has already started. So it may be a good time to get rid of some old luggage. I don't really like to compromise with any matlab shortcomings. Pandas goes with It's fine when the module keeps accumarray as the name. This way someone finds it, when looking for this functionality. Inside the module I suppose, we're free to pick another more fitting name, more in line with good naming practice and numpy conventions. |
After writing the previous post, I thought about it a bit more and decided that really I would also prefer If we go with that though, then I think we should try and drop all mention of "accumarray", except when referring explicitly to Matlab...so maybe rename the repo to |
Agreed with removing all mentions of accumarray. I just wouldn't like to rename the repo right now. Google search etc. any accumarray links point here. Maybe later. |
I see there's also this other question that you've answered...looks like |
When breaking with matlab traditions ... something like |
Hmm, I'm not sure you gain much syntactically by doing that. In terms of optimization, I guess there may be something to be said for creating an |
I'd rather throw out the weave stuff asap ;) Nevertheless, yes, there may be more room for improvement. E.g. collecting max(group_id) would only be needed to be done once. Though, no one guarantees, that nobody messes around with group_id in the mean time between different calls ... suppose we should stick with the syntax discussed so far, and see if we want to add something later. |
Well, I do quite like the idea of offering tuple/list support for |
I'm not against it, though I don't plan to implement any extensions to the C code besides the 2D indexing in the near future. It's a quite messy thing anyways already. |
Most testing and benchmarking is mainly working again now after the refactoring. I was a bit surprised about the bad pandas performance compared to the C implementation, as they claim to be fast. Initially I thought, maybe the construction of the dataframe would be a big overhead, but doubling the array size nearly doubled the pandas time. So the initial setup time seems to be quite small, close to being neglectable, compared to the iteration time, and pandas still being a magnitude slower than the C version. Maybe they suffer the same problems like ufunc.at in having a generic solution. I also added the rudimentary start of my numba implementation. Don't know exactly anymore where I had left it. It was some numba versions ago, and I had stopped working further on it, as there were some annoying issues with numba, that I didn't want to bother with further. Chances are, this is fixed now already. Would be worth to give this another try now. My plan for the next days now: I want to have the current features and testing working again. Most of the xfails should be gone first. After that, I'd go on with adding 2d group_idx for C etc. ideally in a testdriven way, so designing tests first, implementation later. |
It's all looking nice so far. I think I've actually got a working version of numba installed, so I should be able to play with that as and when it's ready. One thought...if people don't want the whole repo will all the implementations and testing and benching etc. is there going to be an easy way for them to just download a single file, or perhaps two files..I guess we could suggest copy-pasting the contents of Getting the |
The func-tuple would also have the benefit, that it would be clear how long to keep temporary data. No problem with invalidating a cache, changes in the array, while cached data wasn't invalidated and such hazzle. If we really wanted to do that, all the functors should become classes, so that they could have attributes which temporary results they produce/require, and that the dependencies could be calculated optimally. Though, it would be a major hazzle to have that elegantly in C in maximum performance in this generality. I won't do that now, also not in the near future, and afterwards the current numpy C-API will likely be completely dropped. For the other implementations, I see much less optimization potential, as most of the speedup would come from making many things in one go, within one single iteration. This detail would be only possible by hand-crafted auto-generated C-code, and maybe, only maybe, with some numba fanciness. In the end, I don't think it is worth it right now. All in all, I don't think any further syntax change makes too much sense right now. Neither the one I had talked about, nor the tuple thing. So let's just see that as a kind of thoughts collection, and get the current stuff back working, fix the not implemented things as good as possible, and make it the new master version. |
Ok, I've put the multiple functions thing in a new issue...#3 |
Ok, done for today. Benchmarks and tests working again. Tomorrow 2d for C funcs + tests. Current benchmarks:
|
Looking at Ideally, we should have a range of different input values and shapes etc. in order to get a better sense of benching. |
Indeed somewhat strange. Right now it's also not such a useful test at all for the |
Nice to see, what a little path optimization can do. Performance crown for all disciplines back to The current implementation profits, if the
|
Yes, lookin' good. Do you know whether you're getting full SIMD vectorisation? I know numpy doesn't always manage it - numpy/numpy#5144 - at least 1.9 doesn't. I'm trying to understand why min and max can't quite compete with
Are you still planning on doing the btw...did you catch that comment I left on |
Good morning Daniel! Yes, I saw that comment. I'd be fine to only catch well known numpy functions. I used to use bottleneck for some operations, I'd probably add their functions to the detection as well, if they are importable, so not being dependent on them. I can imagine the pandas guys definitely want to iron that out, maybe we should just open a bug report there. Either now or after we made this branch the new master. I saw already, that you're very good at that ... ;) I'm quite interested in getting the numba version done in some mid term perspective. The C API for numpy, that In terms of SIMD, I turned
|
The previous test wasn't that fair. I unfortunately had used
|
With SIMD I guess it shouldn't really be a surprise that there's not much change - neither LLVM nor GCC claims to do any magic for multiple simultaneous reductions, only for single reductions. One thing you could try, if you were really serious, is for the case where the size of the output is pretty small. In that case, you could duplicate each bin, giving one slot for each of the elements within a SIMD register, i.e. for sums your output might be: |
I meant to say in the previous comment that one obvious case where the above optimisation is probably used is in image libraries, where they produce histograms for int8 data on each channel...might be worth trying to find an example and see if they do indeed do that. |
I'm also not really an expert for this bit fiddling. Not sure if it's worth to dig here further. So far I'm more or less fine with the current speed. If it should be even faster, it would probably better throwing that into the graphics card and getting it multithreaded first. Current benchmarks:
|
[I was about to post the following, before I saw your comment] Ok, well since I've dug myself a bit of a hole, here's a proper paper dealing with this properly:
|
numba lets you do multithreading apparently..dunno about weave |
Btw. did you manage to get weave to run with windows? |
Pushed some more changes now. Some test cases for multi-indexing are still missing. Besides that, we should be quite good to go for making this master. |
Haven't had a chance to look at this properly yet, but seems pretty tidy. I guess the readme and docstrings could do with a bit a of a further going over - much of what I put in the readme could now be removed or simplified, especially the ugly table with all the functions etc. - now that we have the benchmarking stuff shown the table is mostly redundant. Also, can we concatenate the two utils files, with a Once you've finished with all the major changes I'll have a "proper" look myself sometime later in the week, hopefully. Nice work! |
What I didn't like about the |
Pushed some updates for scalar values and ndim input. Please have a look especially on the testing there. Maybe you have some more/better ideas there. Besides that, I'd like to make this master soon now. |
Was there a reason behind not accepting I'm not sure about the ndim test, is it comparing the numpy group-loop version against other versions? If so, it's not really a test, because they are all using essentially the same bit of code to deal with mutiple dimensions. I think you would need to use some inputs with known outputs, i.e. either use very small input dataset where you can manually write the correct output or construct a specialised input with properties like, np.all(result[2:3,:]==0), and such like (that specific example is too simple to be the only test though). |
This should work for Agreed with a simple test sample. Would be nice to add that. Do you want to do that? |
Ok, I haven't added the test sample, but I have done some work on the readme and docs - there's now one central version of the docstring, but each implementation is free to add it's own extra bit before (or after) that. I don't know if the syntax I've used for setting the docstring is ok, I don't see why not? Regarding tests, do you know whether we check for nan-behavior of non-nan functions. i.e. if there are nans in the data but the user calls ...I'm not sure that those However there is definitely scope for putting more examples into the readme here, if that's what you want. |
The docstring was crying to be moved into The nan behaviour of the not nan-functions is not well tested indeed. I'm not expecting surprises here, but you're right, the tests should be there.
|
I think the repo's name should be changed one way or another. Some suggestions for going more general: If you can see a better way of doing the docstring/imply-dict then definitely give it a shot. I'll leave the repo alone again for the next few days. The main thing I'd like to work on is the multiple funcs thing, but that will require a fairly stable starting point. If you want to switch to the main branch and set a |
What exactly do you mean with multiple funcs thing? Didn't get to work on anything here yesterday. Probably not today either, but I'm quite optimistic to have a neat solution for this. |
I meant - #3. What did you think of "numpy-groupies" for the name of the repo? I thought it was a pretty good suggestion. |
Sounds fine to me. Nevertheless, we should make sure somehow, that it remains findable for any accumarray searches. I have added some tests for nans with not-nan functions and pushed everything to master now. Besides the renaming, everything in here sould be solved now. If I forgot something, suppose a separate issue would be appropriate. |
I've created a separate branch for working on this.
So far I've spent a few hours trying to write a fairly comprehensive readme plus I've done some very basic work to split up the implementations into separate files. Note though that I've made basically no attempt at actually enforcing the API yet, or even checking if the code runs ok...it doesn't!
Obviously there is still quite a bit of work to be done.
In terms of agreeing on a single API, I think the main questions are:
1. What to do with
dtype
s? It's always nice to give the user full control over this, but it's also great when the program is able to automatically choose the most sensible option. For example, summingint8
's probably need to be upgraded toint32
orint64
, maybe evenfloat
..the program could guess what is best based on some heuristic, although this would add extra time - the simplest heuristic would be to assume the worst case i.e. all values are already maximum at the current bitdepth and all will be added to the same group, thus iflen(idx)=100
andvals.dtype=int8
then we assume thesum
is going to need log2(100*255) bits. This calculations is specific tosum
,mean var std
have different requirements, andmin max first last all any
are easier to guess. The other question is what to do about the dtype of the fillvalue - I have chosen to enforcebool
s for thefillvalue
ofany all
but, the user may legitimately want some form ofnan
or pseudo-nan (I remember reading a discussion about efficientnan
s for non-floats somewhere in pandas docs or maybe it was to do withnumpy
'sMaskedArray
..anyway there's no simple answer to that question). If the fillvalue does require usingfloat
in the output, it may still be more efficient to do the main computation with some for ofint
and then only upgrade tofloat
at the end. Perhaps the easiest thing to do is upgrade all the ambiguous stuff to at leastfloat32
and then encourage the user to request a simplerdtype
if they fell that would suffice...but then what about detecting overflows etc..I'm not sure what error handling of that kind is built intonumpy
?2. Support for non-flat
vals
, matched toidx
Although matlab allows this, I would prefer not to. Assuming bothidx
andvals
have the same memory layout, it costs nothing for the user to flatten them, and it isn't too ugly syntactically. By enforcing that it then makes it simpler to identify when 2Didx
is being used to request multidimensional output...this was always a bit confusing in matlab...see next point...3. Support for multidimensional output, using 2D
idx
This is available in matlab and is a very helpful feature. I use it a lot for 2d binning, but it could be n-dimensional. I feel strongly that we should provide it, at least in the API (i.e. throwNotImplemented
if need be..but it's actually very easy to do once numpy is involved!).4. Support for scalar
vals
- I wrote a small section on this in the readme. I think it's not really that useful in general, but helpful forsum
ofvals=1
...you might argue that the user could just go and usebincount
, but Matlab accepts scalars and this version ofaccumarray
can help with multidimensional bincounting, so it's probably a feature worth having. If need be, you could put a simple hack at the top of the function, which expands the scalar out to be a full vector..or throwNotImplemented
.5. Support broadcasting
vals
toidx
This is complicated to implement and causes ambiguities if you also want to support 2Didx
for multidimensional output (disucssed above). Matlab does not offer this, and I've never felt the need to try and use/implement it, so I strongly feel we should explicitly choose not to do it. It also means that theufunc.at
optimizations (if they ever happen) can be kept simpler....to be honest I'm not even sure whether broadcasting makes any sense in theaccumarray
context?!6. Support for contiguous/sorted mode. I haven't mentioned this in the readme and I haven't implemented it yet, but I agree that something along these lines would be nice. There are a variety of things you might want, for example if the user says that
idx
is pre-sorted, various optimisations may be possible (i.e. the code still treats the inputs the same way, but is now able to assume that groups are already laid out contiguously in memory). Alternatively the user may want your contiguous concept, whereidx
is no longer an actual index, but now just a general label for contiguous sections. A variation on this would be to say that sections whereidx=0
, should be treated as belonging to a common "null" group, but all other contiguous blocks should be treated as being labeled1, 2, ..., n
. ..this can work nicely ifidx
is a bool array, as eachtrue
block is treated as a new label, but all thefalse
blocks are grouped together...but sometimes the user will want a way to have two different groups adjacent to one another (i.e. nofalse
element between them)..which you can't do with a simplebool
array....anyway, the point is that there's a lot you could do here to make the user happy, some of which will offer extra optimizations, and some of which would just allow the user to be lazier.7. User-specified size and bounds checking - I think it's definitely worth allowing the user to specify the size of the output, Matlab does this. I often need to do this when I'm doing analysis because I use accumarray twice, the second time with a subset of the original data, and I want the outputs to be the same size. I'm not sure exactly how much control you can give the user over bounds-checking...I guess with your
weave
stuff you could completely disable all checking if the user really wanted. If that's something you want to offer we should encapsulate it in a properkwarg
, however I think in general people will expect basic checking as I think allnumpy
code does. edit 0: by the way I don't like your choice of"incontiguous"
as default...the english is _non_contiguous, but it's a bit of a confusing choice of default. I also haven't looked into the details what yourdownscaled
mode offers...I'm not sure where it fits in to the picture.edit 1:
Some of the above stuff probably warrants common scripts for
numpy
andweave
, though pure python solutions would have to be kept separate (and in many cases may be simple/not applicable at all.). I'm not sure of the best way to deal with that given that we want to try and keep stuff as separate as possible...there may need to be some kind of compromise.One thing that should be fairly easy to agree on is the order of args, I vote for following Matlab exactly and then adding on all the extra
kwargs
after that (order, dtype, mode
).Also, we should agree on the aliasing rules and exactly what the nan- versions of functions are supposed to do. I tried to describe that fairly explicitly in the readme...let me know if you disagree and/or want to add stuff.
Regarding the name of the main two arguments, I definitely prefer
idx
andvals
, but if we permit a range of modes thenidx
is no longer necessarily such a great choice...but I'll stick with it for now!Regarding the
weave
stuff, is there a good place to point people to in order to get it installed quickly...so far I haven't said anything about this in the readme...it's probably worth giving people a one-sentence explanation as to exactly whyweave
is so good.I think you also mentioned a numba implementation..what did you say the status of that was? And what about a Cython one?..although I'm not sure there's going to be much to be gained from Cython. I have a small amount of experience with Cython and a tiny amount of experience with numba, but I don't feel like I really know enough to compare the pros and cons of each...if you are going to provide all these implementations, we should have a proper explanation as to why you might want each of them....I guess one of the advantages worth mentioning is that (some/all? of these things) work nicely on the GPU...though I think the simplicity of going to the GPU varies greatly?
In terms of making the C versions more useful to people, might it be worth providing some binaries..I guess that's a pretty big job, but it might increase uptake.
edit 2:
Ok, I didn't realise that the the non-numpy implementation did actually use numpy. ...I've now re-written it to be truly pure-python. It's relatively simple, though obviously not exactly performant.
edit 3:
I think it may be worth trying to wrap
pandas
. It looks like pandas uses cython under the hood for its groupby- see ..pandas/ib.pyx and ..core/groupby.pyedit 4:
Right, I've implemented a simple
pandas
version. I didn't spend any time trying to squeeze the best performance out of pandas, so it may be possible to do better. I've put benchmarking stats in the readme, but the basic story is that it doesn't beat the optimized numpy version, except where numpy has to rely onufunc.at
(ie.min max prod
).I've also implemented a ufunc-at version, for use with benchmarking.
The text was updated successfully, but these errors were encountered: