Add index_col for spark IO reads. #769

itholic · 2019-09-11T08:36:17Z

related with #765 , I've added index_col for spark IO reads

If we know the index column already, We can prevent the creation of a default index by explicitly typing an index column as function arguments.

For example, now we can use 'read_table' like below.

>>> ks.read_table('test_table1', index_col='i32')
     i64     f  bhello
i32
0      0   0.0  people
2      1  11.0   hello
0      2  12.0  people
1      0  10.0   hello
0      0  15.0   hello
2      0   5.0   hello
0      3   3.0      yo
1      4   4.0  people
0      1   6.0  people
1      2   7.0      yo
0      3  18.0      yo
1      4  19.0   hello
1      1   1.0   hello
2      2   2.0      yo
1      1  16.0   hello
2      2  17.0      yo
2      3   8.0      yo
0      4   9.0   hello
1      3  13.0      yo
2      4  14.0   hello
>>> ks.read_table('test_table1', index_col=['i32', 'i64'])
            f  bhello
i32 i64
0   0     0.0  people
2   1    11.0   hello
0   2    12.0  people
1   0    10.0   hello
0   0    15.0   hello
2   0     5.0   hello
0   3     3.0      yo
1   4     4.0  people
0   1     6.0  people
1   2     7.0      yo
0   3    18.0      yo
1   4    19.0   hello
    1     1.0   hello
2   2     2.0      yo
1   1    16.0   hello
2   2    17.0      yo
    3     8.0      yo
0   4     9.0   hello
1   3    13.0      yo
2   4    14.0   hello

Currently only added to 'read_table' functions.

And If you think this way is okay, I'm going to create a PR with all the other functions.

itholic · 2019-09-11T08:40:44Z

@ueshin , @HyukjinKwon Could you take a look at this maybe if you available? :)

codecov-io · 2019-09-11T09:01:27Z

Codecov Report

Merging #769 into master will decrease coverage by <.01%.
The diff coverage is 92.3%.

@@            Coverage Diff             @@
##           master     #769      +/-   ##
==========================================
- Coverage   93.83%   93.82%   -0.01%     
==========================================
  Files          32       32              
  Lines        5744     5753       +9     
==========================================
+ Hits         5390     5398       +8     
- Misses        354      355       +1

Impacted Files	Coverage Δ
databricks/koalas/namespace.py	`81.22% <92.3%> (+0.22%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1125f9...0c05081. Read the comment docs.

databricks/koalas/namespace.py

softagram-bot · 2019-09-13T05:44:48Z

Softagram Impact Report for pull/769 (head commit: `0c05081`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/769

Impact Report explained. Give feedback on this report to [email protected]

ueshin

LGTM, pending tests.

ueshin · 2019-09-13T06:30:21Z

Thanks! merging.

Resolves #765 , I applied the same logic(as worked on #769) to all of the functions mentioned in above issue. So when we work with spark IO read, and also know about index column name, now we can use these functions with index_col like below and avoid creation of default index: ```python >>> ks.read_parquet(path, index_col=['i32', 'i64']) f bhello i32 i64 0 1 6.0 people 1 2 7.0 yo ```

itholic added 2 commits September 11, 2019 16:36

Add index_col for spark IO reads

8e2ab70

Fix for support index_col as list

b66658e

ueshin reviewed Sep 12, 2019

View reviewed changes

databricks/koalas/namespace.py Outdated Show resolved Hide resolved

To keep existing structure of DataFrame

ae68826

ueshin reviewed Sep 13, 2019

View reviewed changes

databricks/koalas/namespace.py Outdated Show resolved Hide resolved

Fix build fail related with Typing

c2ca3b4

itholic force-pushed the add_index_col branch from 4b7e673 to c2ca3b4 Compare September 13, 2019 04:01

Fix hint Typing

5eec6b5

itholic force-pushed the add_index_col branch from 1179fe3 to 5eec6b5 Compare September 13, 2019 05:13

ueshin reviewed Sep 13, 2019

View reviewed changes

databricks/koalas/namespace.py Outdated Show resolved Hide resolved

Add comments to index_map to avoid lint fail

0c05081

ueshin approved these changes Sep 13, 2019

View reviewed changes

ueshin merged commit b2cfd3f into databricks:master Sep 13, 2019

itholic deleted the add_index_col branch September 13, 2019 06:59

itholic mentioned this pull request Sep 13, 2019

Add index_col for spark IO reads #775

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add index_col for spark IO reads. #769

Add index_col for spark IO reads. #769

Uh oh!

itholic commented Sep 11, 2019 •

edited by HyukjinKwon

Loading

Uh oh!

itholic commented Sep 11, 2019

Uh oh!

codecov-io commented Sep 11, 2019 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

softagram-bot commented Sep 13, 2019

Uh oh!

ueshin left a comment

Uh oh!

ueshin commented Sep 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add index_col for spark IO reads. #769

Add index_col for spark IO reads. #769

Uh oh!

Conversation

itholic commented Sep 11, 2019 • edited by HyukjinKwon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itholic commented Sep 11, 2019

Uh oh!

codecov-io commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

softagram-bot commented Sep 13, 2019

Softagram Impact Report for pull/769 (head commit: 0c05081)

⭐ Change Overview

📄 Full report

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

ueshin commented Sep 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

itholic commented Sep 11, 2019 •

edited by HyukjinKwon

Loading

codecov-io commented Sep 11, 2019 •

edited

Loading

Softagram Impact Report for pull/769 (head commit: `0c05081`)