Skip to content

Add tensordot to dataarray class also add its test to test_dataarray #731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 4, 2016

Conversation

deanpospisil
Copy link

Resolving issue #723

@deanpospisil
Copy link
Author

It seems in both unsuccesful cases it was trying to import dask and failed, I can't quite figure out what I could have changed to prevent dask from importing:

>       da = da.chunk()
xarray/test/test_dataarray.py:1618: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
xarray/core/dataarray.py:569: in chunk
    ds = self._to_temp_dataset().chunk(chunks)
        try:
            from dask.base import tokenize
        except ImportError:
>           import dask  # raise the usual error if dask is entirely missing
E           ImportError: No module named 'dask'

@@ -1369,6 +1369,27 @@ def real(self):
@property
def imag(self):
return self._replace(self.variable.imag)

def tensordot( self, b, dims):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call the second argument "other" rather than b

@shoyer
Copy link
Member

shoyer commented Jan 28, 2016

This also needs docs, minimally including a note in what's new (for v0.7.1), a docstring and a reference in the API docs.

def tensordot( self, b, dims):
a = self
if not (isinstance(a, DataArray) and isinstance(b, DataArray)):
raise ValueError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error should be more descriptive -- and probably a TypeError instead

@shoyer
Copy link
Member

shoyer commented Jan 29, 2016

I'll just note that one other logical way to do this would be to implement tensordot as an xray.Variable method, and implement DataArray.tensordot in terms of it. This is a little closer to how we've written existing methods in xarray. It would probably help eventually for if/when we implement this on Dataset (by consolidating code) but it's not really necessary now.

@shoyer
Copy link
Member

shoyer commented Jan 29, 2016

The test is failing because we don't have dask installed in all builds -- because dask is an optional dependency, and we want to make things still work when you don't have it installed.

We do have tests setup to skip if dask is not installed, by using the @requires_dask decorator. In this case, I would recommend moving the dask-specific tests to one of the existing test classes we've set up in test_dask.py. Take a look at this test case for an example of what this could look like:
https://github.com/pydata/xarray/blob/v0.7.0/xarray/test/test_dask.py#L264-L267

@deanpospisil
Copy link
Author

I've implemented all your comments, except:

  1. In the dask test I'm having a little trouble with:
    self.assertLazyAndAllClose(eager, lazy)
    working with the DataArrays I make. Does it only take transformations of:
self.eager_array 
self.lazy_array 

2 Your comment:
Use the genetic type(self) instead
Which I didn't understand.

3 Need to figure out how to edit the docs.

And I was thinking:
The default (when no dims arg is given) should be to tensordot over all shared dims between DataArrays.
And it should prevent you from not summing over all shared dims, since that will return a DataArray with repeated labels. I think it would be rare for people to want to tensor dot over different dimensions.

Thanks for all your help getting me going!

@shoyer
Copy link
Member

shoyer commented Jan 29, 2016

In the dask test I'm having a little trouble with: self.assertLazyAndAllClose(eager, lazy) working with the DataArrays I make. Does it only take transformations of:
self.eager_array
self.lazy_array

Nope, it can handle any eager (numpy) and lazy (dask) xarray objects. Something like this should work:

# depending on exactly what syntax we support
eager = self.eager_array.tensordot(self.eager_array[0])
lazy = self.lazy_array.tensordot(self.lazy_array[0]))
self.assertLazyAndAllClose(eager, lazy) 

The default (when no dims arg is given) should be to tensordot over all shared dims between DataArrays.
And it should prevent you from not summing over all shared dims, since that will return a DataArray with repeated labels. I think it would be rare for people to want to tensor dot over different dimensions.

This is a good point! Arrays with redundant dimensions are not very useful.

The sane thing to do is to broadcast over dimensions that aren't being summed einsum style (e.g., like i in einsum('ij,ij->i', x, y)). This is similar to the way that @/np.matmul works.

That said, this is difficult to implement with numpy's dot/tensordot, so perhaps it's better to simply error or omit the dims argument entirely for now. Eventually, we might be able to do this using @/np.matmul (only in numpy 1.10 and newer).

I also wonder if perhaps we should rename this from tensordot to simply dot. I don't think we would want to use dot for anything else, and it might also be nice to support @ syntax as an alias for this in Python 3 (again, at some later point).

@deanpospisil
Copy link
Author

That said, this is difficult to implement with numpy's dot/tensordot, so perhaps it's better to simply error or omit the dims argument entirely for now. Eventually, we might be able to do this using @/np.matmul (only in numpy 1.10 and newer).

Yeah I'm going to go for omitting the dims entirely for now, I think it makes the function easy to call, and really covers what I imagine most people would use the function for, getting the relationship between two DataArrays along their common dimensions. Thats all I'm going to use it for.

I also wonder if perhaps we should rename this from tensordot to simply dot. I don't think we would want to use dot for anything else, and it might also be nice to support @ syntax as an alias for this in Python 3 (again, at some later point).

Yeah that makes sense. tensordot in xarray seems like it would be the function that allows more flexibility in how a sum product is applied, dot can be thought of as a reduced version of tensordot taking no dims argument.


def dot( self, other):
"""Perform sum product of two DataArrays along their shared dims.
Equivalent to taking taking tensor dot over all shared dims'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should end in ., not '''

Also, add a line break to separate it from the first line.

@shoyer
Copy link
Member

shoyer commented Feb 2, 2016

Take a look at PEP8 guidelines on spaces -- you are inserting a few extras around your parentheses :)

@shoyer
Copy link
Member

shoyer commented Feb 2, 2016

a few more comments forthcoming, generally looks very good, though

@deanpospisil
Copy link
Author

Thanks! Good comments, should help with my next pull request. : )

raise TypeError('dot only operates on DataArrays.')

#sum over the common dims
dims = list(set(s.dims) & set(other.dims) )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to turn these into a list -- a set is a perfectly valid sequence for use below, as well

@shoyer
Copy link
Member

shoyer commented Feb 2, 2016

I recommend running a tool like PEP8 to check the style here: https://pypi.python.org/pypi/pep8


Enhancements
~~~~~~~~~~~~
-xarray version of np.dot :py:meth:`~DataArray.dot`. Takes the sum product over the shared dimensions of two DataArrays. Can be useful for measuring correlation over common dimensions of two DataArrays.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need "Can be useful for..."

Dot products are pretty broadly useful in science :)

@shoyer
Copy link
Member

shoyer commented Feb 2, 2016

You'll need to squash these into one commit (git rebase -i) and then rebase on master to fix what's new. See here for more details: http://pandas.pydata.org/pandas-docs/stable/contributing.html#combining-commits

@deanpospisil deanpospisil force-pushed the feature/tensordot_issue723 branch 2 times, most recently from 685a050 to cc0c24a Compare February 14, 2016 01:12
@deanpospisil
Copy link
Author

Alright I think I got rebase to work, apologies for the delay.
There were some comments I wasn't sure about how to integrate, I asked about them, but I figured the best way to keep moving forward was to push my best guess at what you meant. So we might need to go through one more round of comments. Thanks for being patient!

@@ -1369,6 +1369,83 @@ def real(self):
@property
def imag(self):
return self._replace(self.variable.imag)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still have the tensordot method defined -- you should delete it now that it's replaced by dot :)

@shoyer
Copy link
Member

shoyer commented Feb 15, 2016

I'm pleased to see how this is coming along. No need to apologize for the delay -- this stuff takes time and practice to figure out :).

I'd like to see more test cases so we can be confident this works properly. At the very least, we should test a dot product with multiple dimensions in common. Other edge cases worth checking:

  • all dimensions in common (e.g., dot product of an array with itself)
  • no dimensions in common (this should give a sensible error message -- does it?)
  • dot product with a scalar DataArray (should also error)
  • dot product with a Dataset (verify that this raises NotImplementedError using self.assertRaises or self.assertRaisesRegexp)
  • dot product with a numpy array (this should also error)

In generally, it's hard to have too many tests. One indication that you may not have enough tests comes from the "coveralls" status check, which you can find if you click on "Show all checks" at the bottom of the PR. Ideally, each PR should only increase, not decrease code coverage -- the idea is that unit tests should run over every possible code pathway.

@deanpospisil deanpospisil force-pushed the feature/tensordot_issue723 branch 4 times, most recently from 68b711b to 6539fcd Compare February 23, 2016 07:09
@deanpospisil
Copy link
Author

Ready for review.

raise NotImplementedError('dot products are not yet supported '
'with Dataset objects.')
if not isinstance(other, DataArray):
raise TypeError('dot only operates on DataArrays.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be fine to remove lines 1413:1415, and change 1417 to something like:

if not isinstance(other, DataArray):
    raise TypeError('dot only operates on DataArrays, got {}'.format(type(other))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, could go either way here. I like raising NotImplementedError because it makes it more obvious to users that this behavior might change in the future. Besides, if we get lucky a user will notice the error and then implement the missing functionality themselves ;).

@shoyer
Copy link
Member

shoyer commented Feb 26, 2016

I made two very minor suggestions for improvement inline, but I think the code here looks ready!

The only thing we need at now is hook up the documentation! Please add dot to api.rst and a release note in whats-new.rst.

@deanpospisil deanpospisil force-pushed the feature/tensordot_issue723 branch from 6539fcd to ebee516 Compare March 3, 2016 18:28
@deanpospisil
Copy link
Author

Alright, I think it's ready!

shoyer added a commit that referenced this pull request Mar 4, 2016
Add tensordot to dataarray class also add its test to test_dataarray
@shoyer shoyer merged commit 4dba0f9 into pydata:master Mar 4, 2016
@shoyer
Copy link
Member

shoyer commented Mar 4, 2016

OK, let's get this in. @deanpospisil thanks for your contribution!

@jhamman I'm thinking it's probably worth issuing a 0.7.2 release shortly so we can get dot and rolling out into the wild?

@jhamman
Copy link
Member

jhamman commented Mar 4, 2016

+1 on releasing 0.7.2. If we can get #782 in there too, that would be great.

@deanpospisil deanpospisil deleted the feature/tensordot_issue723 branch March 4, 2016 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants