-
Notifications
You must be signed in to change notification settings - Fork 15.1k
Handle int64 columns with missing data in SQL Lab #8226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
superset/dataframe.py
Outdated
@@ -183,7 +205,7 @@ def agg_func(cls, dtype, column_name): | |||
if ( | |||
hasattr(dtype, "type") | |||
and issubclass(dtype.type, np.generic) | |||
and np.issubdtype(dtype, np.number) | |||
and dtype._is_numeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this method is Pandas specific. I'll have to use this and the previous one together.
@robdiciuccio ^^^ |
The dataframe implementation looks good, as does the Presto engine dtype fix, but does this fully address #8225 if only Presto is handled? Are there other databases this fix should be implemented for (even in a separate PR)? Also curious about your thoughts on the feasibility of the PyArrow workaround here. |
You're right, this only fixes Presto. I'll do a separate PR addressing the other DBs.
Looks like it would solve our problem, but I don't know if it would be better to monkey patch PyArrow (or if it can be done), or if we should create a light wrapper around it. |
CATEGORY
Choose one
This PR fixes #8225.
SUMMARY
When a column has
int64
integers and missing data, Pandas will cast it tofloat64
, resulting in loss of precision and possibly returning incorrect numbers.This PR fixes the bug by adding a method to the DB engine specs that returns a
dtype
based on the cursor description, currently implemented in Presto only. With thedtype
, we can create a PandasSeries
for each column, and create aDataFrame
that has the proper types.Note that in order to represent the column correctly we need to use a nullable data type, introduced in Pandas 0.240. Unfortunately, PyArrow is unable to serialize the resulting data frame, so
msgpack
has to be disabled.TEST PLAN
Added unit test.
ADDITIONAL INFORMATION
REVIEWERS
@villebro @robdiciuccio