Skip to content

Add zonal stats user story for #428 #435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ dependencies:
- matplotlib-base
- myst-parser
- myst-nb
- sparse
- sphinx
- sphinx-remove-toctrees
- furo>=2024.08
Expand Down
1 change: 1 addition & 0 deletions docs/source/user-stories.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@
user-stories/climatology-hourly-cubed.ipynb
user-stories/custom-aggregations.ipynb
user-stories/nD-bins.ipynb
user-stories/large-zonal-stats.ipynb
```
194 changes: 194 additions & 0 deletions docs/source/user-stories/large-zonal-stats.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "0",
"metadata": {},
"source": [
"# Large Raster Zonal Statistics\n",
"\n",
"\"Zonal statistics\" spans a large range of problems. \n",
"\n",
"This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n",
"\n",
"The last 3 groupers are [GADM](https://gadm.org/) level 0, 1, 2 administrative area polygons rasterized to this grid; with 248, 86, and 854 unique labels respectively (arrays `adm0`, `adm1`, and `adm2`). These correspond to country-level, state-level, and county-level administrative boundaries. "
]
},
{
"cell_type": "markdown",
"id": "1",
"metadata": {},
"source": [
"## Example dataset"
]
},
{
"cell_type": "markdown",
"id": "2",
"metadata": {},
"source": [
"Here is a representative version of the dataset (in terms of size and chunk sizes)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3",
"metadata": {},
"outputs": [],
"source": [
"import dask.array\n",
"import numpy as np\n",
"import xarray as xr\n",
"\n",
"from flox.xarray import xarray_reduce\n",
"\n",
"sizes = {\"y\": 560_000, \"x\": 1440_000}\n",
"chunksizes = {\"y\": 2_000, \"x\": 2_000}\n",
"dims = (\"y\", \"x\")\n",
"shape = tuple(sizes[d] for d in dims)\n",
"chunks = tuple(chunksizes[d] for d in dims)\n",
"\n",
"ds = xr.Dataset(\n",
" {\n",
" \"areas\": (dims, dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n",
" \"tcl_year\": (\n",
" dims,\n",
" 1 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n",
" ),\n",
" \"drivers\": (dims, 2 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
" \"tcd_thresholds\": (\n",
" dims,\n",
" 3 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n",
" ),\n",
" \"adm0\": (dims, 4 + dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n",
" \"adm1\": (dims, 5 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
" \"adm2\": (dims, 6 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n",
" }\n",
")\n",
"ds"
]
},
{
"cell_type": "markdown",
"id": "4",
"metadata": {},
"source": [
"## Zonal Statistics"
]
},
{
"cell_type": "markdown",
"id": "5",
"metadata": {},
"source": [
"Next define the grouper arrays and expected group labels"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6",
"metadata": {},
"outputs": [],
"source": [
"by = (ds.tcl_year, ds.drivers, ds.tcd_thresholds, ds.adm0, ds.adm1, ds.adm2)\n",
"expected_groups = (\n",
" np.arange(23),\n",
" np.arange(1, 6),\n",
" np.arange(1, 8),\n",
" np.arange(248),\n",
" np.arange(86),\n",
" np.arange(854),\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7",
"metadata": {},
"outputs": [],
"source": [
"result = xarray_reduce(\n",
" ds.areas,\n",
" *by,\n",
" expected_groups=expected_groups,\n",
" func=\"sum\",\n",
")\n",
"result"
]
},
{
"cell_type": "markdown",
"id": "8",
"metadata": {},
"source": [
"Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. \n",
"\n",
"We end up with one humoungous 56GB chunk, that is mostly empty.\n",
"\n",
"## We can do better using a sparse array\n",
"\n",
"Since the results are very sparse, we can instruct flox to constructing dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n",
"\n",
"```python\n",
"ReindexStrategy(\n",
" # do not reindex to the full output grid at the blockwise aggregation stage\n",
" blockwise=False,\n",
" # when combining intermediate results after blockwise aggregation, reindex to the\n",
" # common grid using a sparse.COO array type\n",
" array_type=ReindexArrayType.SPARSE_COO\n",
")\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9",
"metadata": {},
"outputs": [],
"source": [
"from flox import ReindexArrayType, ReindexStrategy\n",
"\n",
"result = xarray_reduce(\n",
" ds.areas,\n",
" *by,\n",
" expected_groups=expected_groups,\n",
" func=\"sum\",\n",
" reindex=ReindexStrategy(\n",
" blockwise=False,\n",
" array_type=ReindexArrayType.SPARSE_COO,\n",
" ),\n",
")\n",
"result"
]
},
{
"cell_type": "markdown",
"id": "10",
"metadata": {},
"source": [
"The output is a sparse array (see the **Data type** section)! Note that the size of this array cannot be estimated without computing it.\n",
"\n",
"The computation runs smoothly with low memory."
]
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading