-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: DataFrame.astype(dtype: dict)
should work in the presence of superfluous keys.
#43837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the suggestion, but I would be -1 on this enhancement as I think promoting the more explicitly behavior is better as well as your other point:
|
I agree with @mroeschke. You can also also get around it with dict comprehension:
|
A counterpoint would be that This also feels similar to So there is a consideration of API consistency as well as what is the optimal API for this individual method. The usecase here would be to have a single dictionary of column names -> dtypes for a project, something I've done in the past that feels nice and explicit/organized. At various points in the process of data manipulation/analysis, you may end up with dataframes that have only a subset of the columns but not naturally have a an explicit representation of the subtset of column names that you want. This alternate suggestion is not sufficient IMO
This does not work if you're writing a chain of operations, something like (
df
.melt(...) # Create new columns with values in the column index, that may not be typed correctly
.astype(...) # At this point there is no variable allowing access to the current set of column names
) |
And that's why I am not arguing for changing the default behaviour, except in the case when one can be relatively sure that no typos happened (i.e. when all columns are present as keys.). @attack68 Of course one can get around it this way, but as I said earlier, I find this has several disadvantages, mainly that
df.astype({c: d for c, d in dtypes_dict.items() if c in df.columns}) compared to df.astype(dtypes_dict) To me, the former really just feels like it adds useless noise to the code, whereas the latter is to the point. |
@randolf-scholz I think this is a good suggestion and it's something I've stumbled over in the past. I might offer a friendly suggestion which is that the tone of your request, e.g.
could be decreasing enthusiasm for the idea. It reads (to me) as unnecessarily combative, and I think you might find more success by focusing on the value of your proposed enhancement rather than the (perceived) flaws of the existing API. |
Another consideration is that |
When I reconsider the the proposal carefully I must raise two points for discussion:
There may be a reduced chance of an error due to a typo in case 1, but that does not necessarily imply the chance of error in the programme overall is below a threshold of significance that your condition makes it acceptable. It may be or may not be. The more complicated logic error behaviour may (or may not) also have unwanted side effects to other programmes. For 2. @mwaskom makes a similar observation to my own. Were we discussing the issue from an original design perspective I'm not sure what I would prefer, and perhaps that is why the API is inconsistent in parts because this is one of those subjective areas. However, having an established MO, means at least for me, the bar must be suitably high and the argument suitably strong and proven that a change is warranted. But pandas is open source, feel free to contribute a quick/draft PR if you feel convincing enough and see hows its taken.. |
I agree with @mroeschke here, -1 on changing. |
Currently,
DataFrame.astype(dtype_dict: dict)
requires that thedict
keys are a subset of theDataFrame
's columns. This feels like an unnecessary restriction, in my opinion it would suffice / be more intuitive if it would roughly perform:The fact that this currently raises an error becomes annoying, for example if one needs to repair data types after they became destroyed by a stacking operation - one needs to slice the
dtype_dict
-dictionary by the column keys every time!Of course, a strong argument can be made that raising an error is a good idea to prevent users from erroneously believing the type-casting was performed in the case when a key was miss-typed.
Describe the solution you'd like
I propose to consider either one of the following changes:
Option 1: If
df.columns
is a subset ofdtype_dict
, do not raise an error if superfluous keys are present. In this case all columns are identified and there is a negligible chance that there is an error due to a typo.Option 2: Extent the functionality of the already present
errors='ignore'
option to also ignore superfluous keys in thedtype_dict
.The text was updated successfully, but these errors were encountered: