diff --git a/ci/docs.yml b/ci/docs.yml index da9833133..572d0992d 100644 --- a/ci/docs.yml +++ b/ci/docs.yml @@ -15,6 +15,7 @@ dependencies: - matplotlib-base - myst-parser - myst-nb + - sparse - sphinx - sphinx-remove-toctrees - furo>=2024.08 diff --git a/docs/source/user-stories.md b/docs/source/user-stories.md index eb0cf3334..a3abe1efd 100644 --- a/docs/source/user-stories.md +++ b/docs/source/user-stories.md @@ -10,4 +10,5 @@ user-stories/climatology-hourly-cubed.ipynb user-stories/custom-aggregations.ipynb user-stories/nD-bins.ipynb + user-stories/large-zonal-stats.ipynb ``` diff --git a/docs/source/user-stories/large-zonal-stats.ipynb b/docs/source/user-stories/large-zonal-stats.ipynb new file mode 100644 index 000000000..f4a9acdc8 --- /dev/null +++ b/docs/source/user-stories/large-zonal-stats.ipynb @@ -0,0 +1,194 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0", + "metadata": {}, + "source": [ + "# Large Raster Zonal Statistics\n", + "\n", + "\"Zonal statistics\" spans a large range of problems. \n", + "\n", + "This one is inspired by [this issue](https://github.com/xarray-contrib/flox/issues/428), where a cell areas raster is aggregated over 6 different groupers and summed. Each array involved has shape 560_000 x 1440_000 and chunk size 10_000 x 10_000. Three of the groupers `tcl_year`, `drivers`, and `tcd_thresholds` have a small number of group labels (23, 5, and 7). \n", + "\n", + "The last 3 groupers are [GADM](https://gadm.org/) level 0, 1, 2 administrative area polygons rasterized to this grid; with 248, 86, and 854 unique labels respectively (arrays `adm0`, `adm1`, and `adm2`). These correspond to country-level, state-level, and county-level administrative boundaries. " + ] + }, + { + "cell_type": "markdown", + "id": "1", + "metadata": {}, + "source": [ + "## Example dataset" + ] + }, + { + "cell_type": "markdown", + "id": "2", + "metadata": {}, + "source": [ + "Here is a representative version of the dataset (in terms of size and chunk sizes)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3", + "metadata": {}, + "outputs": [], + "source": [ + "import dask.array\n", + "import numpy as np\n", + "import xarray as xr\n", + "\n", + "from flox.xarray import xarray_reduce\n", + "\n", + "sizes = {\"y\": 560_000, \"x\": 1440_000}\n", + "chunksizes = {\"y\": 2_000, \"x\": 2_000}\n", + "dims = (\"y\", \"x\")\n", + "shape = tuple(sizes[d] for d in dims)\n", + "chunks = tuple(chunksizes[d] for d in dims)\n", + "\n", + "ds = xr.Dataset(\n", + " {\n", + " \"areas\": (dims, dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n", + " \"tcl_year\": (\n", + " dims,\n", + " 1 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n", + " ),\n", + " \"drivers\": (dims, 2 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n", + " \"tcd_thresholds\": (\n", + " dims,\n", + " 3 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32),\n", + " ),\n", + " \"adm0\": (dims, 4 + dask.array.ones(shape, chunks=chunks, dtype=np.float32)),\n", + " \"adm1\": (dims, 5 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n", + " \"adm2\": (dims, 6 + dask.array.zeros(shape, chunks=chunks, dtype=np.float32)),\n", + " }\n", + ")\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "4", + "metadata": {}, + "source": [ + "## Zonal Statistics" + ] + }, + { + "cell_type": "markdown", + "id": "5", + "metadata": {}, + "source": [ + "Next define the grouper arrays and expected group labels" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6", + "metadata": {}, + "outputs": [], + "source": [ + "by = (ds.tcl_year, ds.drivers, ds.tcd_thresholds, ds.adm0, ds.adm1, ds.adm2)\n", + "expected_groups = (\n", + " np.arange(23),\n", + " np.arange(1, 6),\n", + " np.arange(1, 8),\n", + " np.arange(248),\n", + " np.arange(86),\n", + " np.arange(854),\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7", + "metadata": {}, + "outputs": [], + "source": [ + "result = xarray_reduce(\n", + " ds.areas,\n", + " *by,\n", + " expected_groups=expected_groups,\n", + " func=\"sum\",\n", + ")\n", + "result" + ] + }, + { + "cell_type": "markdown", + "id": "8", + "metadata": {}, + "source": [ + "Formulating the three admin levels as orthogonal dimensions is quite wasteful --- not all countries have 86 states or 854 counties per state. \n", + "\n", + "We end up with one humoungous 56GB chunk, that is mostly empty.\n", + "\n", + "## We can do better using a sparse array\n", + "\n", + "Since the results are very sparse, we can instruct flox to constructing dense arrays of intermediate results on the full 23 x 5 x 7 x 248 x 86 x 854 output grid.\n", + "\n", + "```python\n", + "ReindexStrategy(\n", + " # do not reindex to the full output grid at the blockwise aggregation stage\n", + " blockwise=False,\n", + " # when combining intermediate results after blockwise aggregation, reindex to the\n", + " # common grid using a sparse.COO array type\n", + " array_type=ReindexArrayType.SPARSE_COO\n", + ")\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9", + "metadata": {}, + "outputs": [], + "source": [ + "from flox import ReindexArrayType, ReindexStrategy\n", + "\n", + "result = xarray_reduce(\n", + " ds.areas,\n", + " *by,\n", + " expected_groups=expected_groups,\n", + " func=\"sum\",\n", + " reindex=ReindexStrategy(\n", + " blockwise=False,\n", + " array_type=ReindexArrayType.SPARSE_COO,\n", + " ),\n", + ")\n", + "result" + ] + }, + { + "cell_type": "markdown", + "id": "10", + "metadata": {}, + "source": [ + "The output is a sparse array (see the **Data type** section)! Note that the size of this array cannot be estimated without computing it.\n", + "\n", + "The computation runs smoothly with low memory." + ] + } + ], + "metadata": { + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}