Skip to content

Handle int64 columns with missing data in SQL Lab #8226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 17, 2019

Conversation

betodealmeida
Copy link
Member

CATEGORY

Choose one

  • Bug Fix
  • Enhancement (new features, refinement)
  • Refactor
  • Add tests
  • Build / Development Environment
  • Documentation

This PR fixes #8225.

SUMMARY

When a column has int64 integers and missing data, Pandas will cast it to float64, resulting in loss of precision and possibly returning incorrect numbers.

This PR fixes the bug by adding a method to the DB engine specs that returns a dtype based on the cursor description, currently implemented in Presto only. With the dtype, we can create a Pandas Series for each column, and create a DataFrame that has the proper types.

Note that in order to represent the column correctly we need to use a nullable data type, introduced in Pandas 0.240. Unfortunately, PyArrow is unable to serialize the resulting data frame, so msgpack has to be disabled.

TEST PLAN

Added unit test.

ADDITIONAL INFORMATION

REVIEWERS

@villebro @robdiciuccio

@@ -183,7 +205,7 @@ def agg_func(cls, dtype, column_name):
if (
hasattr(dtype, "type")
and issubclass(dtype.type, np.generic)
and np.issubdtype(dtype, np.number)
and dtype._is_numeric
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this method is Pandas specific. I'll have to use this and the previous one together.

@mistercrunch
Copy link
Member

@robdiciuccio ^^^

@robdiciuccio
Copy link
Member

The dataframe implementation looks good, as does the Presto engine dtype fix, but does this fully address #8225 if only Presto is handled? Are there other databases this fix should be implemented for (even in a separate PR)?

Also curious about your thoughts on the feasibility of the PyArrow workaround here.

@betodealmeida
Copy link
Member Author

The dataframe implementation looks good, as does the Presto engine dtype fix, but does this fully address #8225 if only Presto is handled? Are there other databases this fix should be implemented for (even in a separate PR)?

You're right, this only fixes Presto. I'll do a separate PR addressing the other DBs.

Also curious about your thoughts on the feasibility of the PyArrow workaround here.

Looks like it would solve our problem, but I don't know if it would be better to monkey patch PyArrow (or if it can be done), or if we should create a light wrapper around it.

@betodealmeida betodealmeida merged commit b9be01f into apache:master Sep 17, 2019
@betodealmeida betodealmeida mentioned this pull request Sep 18, 2019
12 tasks
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.35.0 labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/M 🚢 0.35.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas casting int64 to float64, misrepresenting value
3 participants