Handle int64 columns with missing data in SQL Lab #8226

betodealmeida · 2019-09-14T22:18:55Z

SUMMARY

When a column has int64 integers and missing data, Pandas will cast it to float64, resulting in loss of precision and possibly returning incorrect numbers.

This PR fixes the bug by adding a method to the DB engine specs that returns a dtype based on the cursor description, currently implemented in Presto only. With the dtype, we can create a Pandas Series for each column, and create a DataFrame that has the proper types.

Note that in order to represent the column correctly we need to use a nullable data type, introduced in Pandas 0.240. Unfortunately, PyArrow is unable to serialize the resulting data frame, so msgpack has to be disabled.

TEST PLAN

Added unit test.

ADDITIONAL INFORMATION

Has associated issue: Pandas casting int64 to float64, misrepresenting value #8225
Changes UI
Requires DB Migration.
Confirm DB Migration upgrade and downgrade tested.
Introduces new feature or API
Removes existing feature or API

REVIEWERS

@villebro @robdiciuccio

betodealmeida · 2019-09-15T03:40:52Z

superset/dataframe.py

@@ -183,7 +205,7 @@ def agg_func(cls, dtype, column_name):
        if (
            hasattr(dtype, "type")
            and issubclass(dtype.type, np.generic)
-            and np.issubdtype(dtype, np.number)
+            and dtype._is_numeric


Actually, this method is Pandas specific. I'll have to use this and the previous one together.

mistercrunch · 2019-09-16T04:03:18Z

@robdiciuccio ^^^

robdiciuccio · 2019-09-16T19:05:03Z

The dataframe implementation looks good, as does the Presto engine dtype fix, but does this fully address #8225 if only Presto is handled? Are there other databases this fix should be implemented for (even in a separate PR)?

Also curious about your thoughts on the feasibility of the PyArrow workaround here.

betodealmeida · 2019-09-17T06:30:38Z

The dataframe implementation looks good, as does the Presto engine dtype fix, but does this fully address #8225 if only Presto is handled? Are there other databases this fix should be implemented for (even in a separate PR)?

You're right, this only fixes Presto. I'll do a separate PR addressing the other DBs.

Also curious about your thoughts on the feasibility of the PyArrow workaround here.

Looks like it would solve our problem, but I don't know if it would be better to monkey patch PyArrow (or if it can be done), or if we should create a light wrapper around it.

betodealmeida added 4 commits September 14, 2019 14:38

Handle int64 columns with missing data in SQL Lab

412b3ab

Fix docstring

8a196a1

Add unit test

b7735cc

Small fix

2798767

pull-request-size bot added the size/M label Sep 14, 2019

betodealmeida commented Sep 15, 2019

View reviewed changes

betodealmeida added 5 commits September 16, 2019 23:43

Small fixes

7eaec7c

Fix cursor description update

cf3c69d

Better fix

4ef9256

Fix unit test, black

f7e7466

Fix nan comparison in unit test

14c4d84

betodealmeida merged commit b9be01f into apache:master Sep 17, 2019

betodealmeida mentioned this pull request Sep 18, 2019

Fix array casting #8253

Merged

12 tasks

betodealmeida mentioned this pull request Oct 1, 2019

Add improved typed casting to BigQuery #8331

Merged

12 tasks

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.35.0 labels Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle int64 columns with missing data in SQL Lab #8226

Handle int64 columns with missing data in SQL Lab #8226

Uh oh!

betodealmeida commented Sep 14, 2019

Uh oh!

betodealmeida Sep 15, 2019

Uh oh!

mistercrunch commented Sep 16, 2019

Uh oh!

robdiciuccio commented Sep 16, 2019

Uh oh!

betodealmeida commented Sep 17, 2019

Uh oh!

Uh oh!

Handle int64 columns with missing data in SQL Lab #8226

Handle int64 columns with missing data in SQL Lab #8226

Uh oh!

Conversation

betodealmeida commented Sep 14, 2019

CATEGORY

SUMMARY

TEST PLAN

ADDITIONAL INFORMATION

REVIEWERS

Uh oh!

betodealmeida Sep 15, 2019

Choose a reason for hiding this comment

Uh oh!

mistercrunch commented Sep 16, 2019

Uh oh!

robdiciuccio commented Sep 16, 2019

Uh oh!

betodealmeida commented Sep 17, 2019

Uh oh!

Uh oh!