Restructure DataArray internals to not use a Dataset? #117

shoyer · 2014-05-06T21:02:26Z

It would be nice to allow DataArray objects without named dimensions (#116). But it doesn't make much sense to put arrays without named dimensions into a Dataset.

This suggests that we should change the current model for the internals of DataArray, which currently works by applying operations to an internal Dataset, and keeping track of the name of the name of the array of interest.

An alternate representation would use a fixed size list-like attribute coordinates to keep track of coordinates. Putting a DataArray without named dimensions into a Dataset will raise an error.

Positives:

This is a more transparent and obvious model for directly working with DataArray objects.
It will simplify making DataArrays without named dimensions.
It will make choices like when to drop other dataset variables in an data array operation more obvious: other variables will always be dropped, because we won't bother keeping track of a dataset anymore.
Related to my bullet 1, this will have positive performance implications for array indexing, since it will more obvious exactly which arrays you are indexing (currently indexing indexes every array in a dataset).

Negatives:

This will certainly add lines of code and complexity. Making an operation work for both Datasets and DataArrays will no longer be quite so simple.
It will no longer be as straightforward to access other related variables in a DataArray. In particular, it won't work to do ds['foo'].groupby('bar') if "bar" is not a dimension in ds['foo'], unless we keep around some sort of reference to the dataset in the array. Perhaps this tradeoff is worth it: ds['foo'].groupby(ds['bar']) isn't so terrible.

CC @mrocklin, I mentioned this up briefly in the context of #116 during PyData.

The text was updated successfully, but these errors were encountered:

shoyer · 2014-09-02T03:46:15Z

Closing this as "won't fix". As of #197/#221, DataArray internals have been restructured to (a) not expose the underlying dataset publicly and (b) use the notion of "coordinates" (vs. variables) which clarifies things tremendously.

updates: - [github.com/pre-commit/pre-commit-hooks: v4.2.0 → v4.3.0](pre-commit/pre-commit-hooks@v4.2.0...v4.3.0) - [github.com/psf/black: 22.3.0 → 22.6.0](psf/black@22.3.0...22.6.0) - [github.com/pre-commit/mirrors-mypy: v0.960 → v0.961](pre-commit/mirrors-mypy@v0.960...v0.961) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

shoyer added the question label May 6, 2014

shoyer mentioned this issue Jun 9, 2014

Data array constructor #149

Merged

shoyer added the wontfix label Sep 2, 2014

shoyer closed this as completed Sep 2, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure DataArray internals to not use a Dataset? #117

Restructure DataArray internals to not use a Dataset? #117

shoyer commented May 6, 2014

shoyer commented Sep 2, 2014

Restructure DataArray internals to not use a Dataset? #117

Restructure DataArray internals to not use a Dataset? #117

Comments

shoyer commented May 6, 2014

shoyer commented Sep 2, 2014