Skip to content

Conversation

@ueshin
Copy link
Collaborator

@ueshin ueshin commented Jan 27, 2020

Koalas doesn't guarantee the row ordering, so head could return some rows from distributed partitions and the result is not deterministic, which might confuse users. (See #411, #1064, or #1219)

>>> import databricks.koalas as ks
>>> spark = ks.utils.default_session({'spark.master': 'local-cluster[4, 2, 1024]'})
>>> spark.conf.set('spark.sql.execution.arrow.enabled', True)
>>> kdf = ks.DataFrame({'a': range(10)})
>>> pdf = kdf.to_pandas().head(3)
>>> for _ in range(100):
...     assert kdf.head(3).to_pandas().equals(pdf), 'NOT equals.'
... else:
...     print('equals.')
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
AssertionError: NOT equals.

If compute.ordered_head is set to True, Koalas performs natural ordering beforehand, but it will cause a performance overhead.

>>> ks.options.compute.ordered_head = True
>>>
>>> for _ in range(100):
...     assert kdf.head(3).to_pandas().equals(pdf), 'NOT equals.'
... else:
...     print('equals.')
...
equals.

I don't think we should enable it by default due to the overhead, but it will help users debug.

@codecov-io
Copy link

codecov-io commented Jan 27, 2020

Codecov Report

Merging #1231 into master will increase coverage by 0.25%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1231      +/-   ##
==========================================
+ Coverage   95.18%   95.43%   +0.25%     
==========================================
  Files          35       35              
  Lines        7265     7740     +475     
==========================================
+ Hits         6915     7387     +472     
- Misses        350      353       +3
Impacted Files Coverage Δ
databricks/koalas/config.py 98.98% <ø> (ø) ⬆️
databricks/koalas/frame.py 97.48% <80%> (+0.51%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77da5b9...76a7f7a. Read the comment docs.

@HyukjinKwon
Copy link
Member

Yeah, I think adding this configuration wouldn't hurt.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth enough to try.

@ueshin
Copy link
Collaborator Author

ueshin commented Jan 28, 2020

Thanks! I'd merge this for now. Please feel free to leave comments if any.

@ueshin ueshin merged commit dd41b7f into databricks:master Jan 28, 2020
@ueshin ueshin deleted the head_ordering branch January 28, 2020 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants