Skip to content

Conversation

@HyukjinKwon
Copy link
Member

This PR implements __array_ufunc__ (see https://docs.scipy.org/doc/numpy/reference/ufuncs.html#output-type-determination) to allow some of basic ufunc can run against Koalas Series and Index (some dunder APIs).

>>> import databricks.koalas as ks
>>> import numpy as np
>>> kdf = ks.range(10)
>>> kdf = np.add(kdf.id, kdf.id)
>>> type(kdf)
<class 'databricks.koalas.series.Series'>
>>> kdf
0     0
1     2
2     4
3     6
4     8
5    10
6    12
7    14
8    16
9    18
Name: id, dtype: int64

@HyukjinKwon HyukjinKwon requested a review from ueshin December 3, 2019 02:29
@codecov-io
Copy link

codecov-io commented Dec 3, 2019

Codecov Report

Merging #1096 into master will decrease coverage by 0.02%.
The diff coverage is 85.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1096      +/-   ##
==========================================
- Coverage    95.2%   95.18%   -0.03%     
==========================================
  Files          34       34              
  Lines        6889     6913      +24     
==========================================
+ Hits         6559     6580      +21     
- Misses        330      333       +3
Impacted Files Coverage Δ
databricks/koalas/base.py 94.88% <85.71%> (-1.02%) ⬇️
databricks/koalas/series.py 96.44% <0%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a7a640...10c7845. Read the comment docs.

@softagram-bot
Copy link

Softagram Impact Report for pull/1096 (head commit: 10c7845)

⚠️ Copy paste found

ℹ️ test_numpy_compat.py: Copy paste fragment on line 24 shared with ../test_dataframe.py:


    @property
    def pdf(self):
        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, ...(truncated 160 chars)

ℹ️ test_numpy_compat.py: Copy paste fragment on line 24 shared with ../test_dataframe.py, ../test_indexes.py:


    @property
    def pdf(self):
        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, ...(truncated 160 chars)

ℹ️ test_numpy_compat.py: Copy paste fragment on line 27 shared with ../test_dataframe.py, ../test_indexes.py, ../test_ops_on_diff_frames.py:

        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0],
        }, index=[0, 1, 3, 5, 6, 8, 9, 9, 9])

    @propert...(truncated 20 chars)

ℹ️ test_numpy_compat.py: Copy paste fragment on line 27 shared with ../test_dataframe.py, ../test_indexes.py, ../test_ops_on_diff_frames.py:

        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0],
        }, index=[0, 1, 3, 5, 6, 8, 9, 9, 9])

ℹ️ test_numpy_compat.py: Copy paste fragment on line 24 shared with ../test_dataframe.py, ../test_indexes.py, ../test_indexing.py:


    @property
    def pdf(self):
        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, ...(truncated 9 chars)

ℹ️ test_numpy_compat.py: Copy paste fragment on line 27 shared with ../test_dataframe.py, ../test_indexes.py, ../test_ops_on_diff_frames.py:

        return pd.DataFrame({
            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0],

ℹ️ test_numpy_compat.py: Copy paste fragment on line 28 shared with ../test_dataframe.py, ../test_indexes.py, ../test_ops_on_diff_frames.py:

            'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
            'b': [4, 5, 6, 3, 2, 1, 0, 0, 0],
        }, index=[0, 1, 3, 5, 6, 8, 9, 9, 9])

ℹ️ base.py: Copy paste fragment inside the same file on lines 709, 772:

        if axis != 0:
            raise ValueError('axis should be either 0 or \"index\" currently.')

        sdf = self._internal._sdf.select(self._scol)
        col...(truncated 380 chars)

Now that you are on the file, it would be easier to pay back some tech. debt.

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

@HyukjinKwon
Copy link
Member Author

Tests passed

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super cool!
LGTM.

if result is not NotImplemented:
return result
else:
# TODO: support more APIs?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can delegate to pandas UDF?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's nice suggestion. Let me investigate a bit more about this. It will only work when the output is n to n but I'm sure there will be the case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we can delegate to pandas UDF in a general way, we can take time to add more Spark native functions like np.sqrt or np.log.

name = flipped.get(op_name, "__r{}__".format(op_name))
return getattr(self, name, not_implemented)(inputs[0])
else:
return NotImplemented
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will be able to add more functions supported in Spark natively.

@ueshin
Copy link
Collaborator

ueshin commented Dec 3, 2019

Thanks! I'd merge this for now as a basic of NumPy ufunc compatibility.

@ueshin ueshin merged commit be8d5b0 into databricks:master Dec 3, 2019
ueshin pushed a commit that referenced this pull request Dec 4, 2019
This PR completes NumPy's ufunc support (followup of #1096).

See also https://docs.scipy.org/doc/numpy/reference/arrays.classes.html#standard-array-subclasses

E.g.:

```python
>>> import databricks.koalas as ks
>>> import numpy as np
>>> kdf = ks.range(10)
>>> kser = np.sqrt(kdf.id)
>>> type(kser)
<class 'databricks.koalas.series.Series'>
>>> kser
0    0.000000
1    1.000000
2    1.414214
3    1.732051
4    2.000000
5    2.236068
6    2.449490
7    2.645751
8    2.828427
9    3.000000
```
@HyukjinKwon HyukjinKwon deleted the np-compat branch September 11, 2020 07:52
rising-star92 added a commit to rising-star92/databricks-koalas that referenced this pull request Jan 27, 2023
This PR completes NumPy's ufunc support (followup of databricks/koalas#1096).

See also https://docs.scipy.org/doc/numpy/reference/arrays.classes.html#standard-array-subclasses

E.g.:

```python
>>> import databricks.koalas as ks
>>> import numpy as np
>>> kdf = ks.range(10)
>>> kser = np.sqrt(kdf.id)
>>> type(kser)
<class 'databricks.koalas.series.Series'>
>>> kser
0    0.000000
1    1.000000
2    1.414214
3    1.732051
4    2.000000
5    2.236068
6    2.449490
7    2.645751
8    2.828427
9    3.000000
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants