-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Using groupby with custom index #1308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you share the shape and dask chunking for |
Hello Stephan, The shape of the full data, if I read from within xarray, is (time, level, lat, lon), with level=60, lat=41, lon=480. time is I am chunking only along longitude, using lon=100. I previously chunked along time, but that used too much memory (~45GB out of 128 GB) since the data is split into one file per month, and reading annual data would require reading many files into memory. Superficially, I would think that both of the above would take similar amounts of time. In fact, calculating a daily climatology also requires grouping the four 6 hourly data points into a single day as well, which seems to be more complicated. However, it seems to run faster! Thanks, |
Slightly OT observation: Performance issues are increasingly being raised here (see also #1301). Wouldn't it be great if we had shared space somewhere in the cloud to host these big-ish datasets and run performance benchmarks in a controlled environment? |
We currently do all the groupby handling ourselves, which means that when you group over smaller units the dask graph gets bigger and each of the tasks gets smaller. Given that each chunk in the grouped data is only about ~250,000 elements, it's not surprising that things get a bit slower -- that's near the point where Python overhead starts to get significant. It would be useful to benchmark graph creation and execution separately (especially using dask-distributed's profiling tools) to understand where the slow-down is. One thing that might help quite a bit in cases like this where the individual groups are small is to rewrite xarray's groupby to do some groupby operations inside dask, rather than in a loop outside of dask. That would allow executing tasks on bigger chunks of arrays at once, which could significantly reduce scheduler overhead. |
@shoyer If I increase the size of the longitude chunk anymore, it will almost like using no chunking at all. I guess this dataset is a corner case. I will try increasing doubling that value and see what happens. I hadn't realised that doing a groupby would also reduce the effective chunk size, thanks for pointing that out. I'm using dask without distributed as of now, is there still some way to do the benchmark? I would be more than happy to run it. @rabernat I would definitely favour a cloud based sandbox to try these things out. What would be the stumbling block towards actually setting it up? I have had some recent experience setting up jupyterhub, I can help set that up so that notebooks can be used easily in such an environment. |
I've had some troubles with 6-Hrly ERA-Interim data myself recently. I wonder if the fact that the data is highly compressed ( |
Memory consumption, yes, performance, not so much. Scale/offset (de)compression can be applied super fast, unlike zlib compression which can be 10x slower than reading from disk. |
Not sure if this helps, but I did a For the 6 hourly thing, It takes around 4x more time, which makes sense because there are 4x more groups. The ratio of user to system time is more or less constant, so nothing untoward seems to be happening in between the two runs. I think it is just good to remember that the time to use scales linearly with the number of groups. I guess this is what @shoyer was talking about when he mentioned that since grouping is done within xarray, the dask graph grows, making things slower. Thanks again! |
Hello,
I have 6 hourly data (ERA Interim) for around 10 years. I want to calculate the annual 6 hourly climatology, i.e, 366*4 values, with each value corresponding to a 6 hourly interval. I am chunking the data along longitude.
I'm using xarray 0.9.1 with Python 3.6 (Anaconda).
For a daily climatology on this data, I do the usual:
For the 6 hourly version, I am trying the following:
The first one (daily climatology) takes around 15 minutes for my data, whereas the second one ran for almost 30 minutes after which I gave up and killed the process.
Is there some obvious reason why the first is much faster than the second?
data
in both cases is the 6 hourly dataset. And is there an alternative way of expressing this computation which would make it faster?TIA,
Joy
The text was updated successfully, but these errors were encountered: