Skip to content

Conversation

@itholic
Copy link
Contributor

@itholic itholic commented Oct 13, 2019

Like pandas Series.where (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html)

implemented function where for series.

>>> s1 = ks.Series([0, 1, 2, 3, 4])
>>> s2 = ks.Series([100, 200, 300, 400, 500])
>>> s1.where(s1 > 0)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
Name: 0, dtype: float64


>>> s1.where(s1 > 1, 10)
0    10
1    10
2     2
3     3
4     4
Name: 0, dtype: int64

>>> s1.where(s1 > 1, s1 + 50)
0    50
1    51
2     2
3     3
4     4
Name: 0, dtype: int64


>>> s1.where(s1 > 1, s2)
0    100
1    200
2      2
3      3
4      4
Name: 0, dtype: int64

@codecov-io
Copy link

codecov-io commented Oct 13, 2019

Codecov Report

Merging #922 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #922      +/-   ##
==========================================
+ Coverage   94.52%   94.53%   +<.01%     
==========================================
  Files          34       34              
  Lines        6465     6476      +11     
==========================================
+ Hits         6111     6122      +11     
  Misses        354      354
Impacted Files Coverage Δ
databricks/koalas/missing/series.py 100% <ø> (ø) ⬆️
databricks/koalas/series.py 96.15% <100%> (+0.05%) ⬆️
databricks/koalas/internal.py 96.38% <0%> (ø) ⬆️
databricks/koalas/namespace.py 86.83% <0%> (ø) ⬆️
databricks/koalas/frame.py 96.02% <0%> (ø) ⬆️
databricks/koalas/indexes.py 96.44% <0%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8dcb64...b620849. Read the comment docs.

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add more tests in test_series to check various patterns? e.g.,

>>> s1 = pd.Series([0, 1, 2, 3, 4])
>>> s2 = pd.Series([100, 200, 300, 400, 500])

>>> s1.where(s2 > 100)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

and negative cases?


return self._with_new_scol(current)

def where(self, cond, other=np.nan):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic seems like pandas shares the same implementation internally. After this PR is merged, can you move this into _Frame class and implement DataFrame.where as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, i'm going to work right after this PR is merged

@HyukjinKwon
Copy link
Member

Seems fine to me otherwise.

@softagram-bot
Copy link

Softagram Impact Report for pull/922 (head commit: b620849)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

@HyukjinKwon HyukjinKwon merged commit 709b928 into databricks:master Oct 28, 2019
@itholic itholic deleted the s_where branch November 6, 2019 05:32
# | 4| 4| true| 500|
# +-----------------+---+----------------+-----------------+
data_col_name = self._internal.column_name_for(self._internal.column_index[0])
index_column = self._internal.index_columns[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic, I think this doesn't support multi-level index cases. Can you fix this please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index_columns can be multiple and we cannot just use the first one only.

set_option("compute.ops_on_diff_frames", True)

@classmethod
def tearDownClass(cls):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic disable this. compute.ops_on_diff_frames is disabled by default because it costs a lot. We should move the test cases into OpsOnDiffFramesEnabledTest

kser.drop_duplicates().sort_values())

def test_where(self):
pser1 = pd.Series([0, 1, 2, 3, 4], name=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test when compute.ops_on_diff_frames is off? I think we can still use a scalar values for other such as int.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants