Skip to content

ENH: DataFrame.astype(dtype: dict) should work in the presence of superfluous keys. #43837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
randolf-scholz opened this issue Oct 1, 2021 · 8 comments
Labels
Astype Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@randolf-scholz
Copy link
Contributor

randolf-scholz commented Oct 1, 2021

Currently, DataFrame.astype(dtype_dict: dict) requires that the dict keys are a subset of the DataFrame's columns. This feels like an unnecessary restriction, in my opinion it would suffice / be more intuitive if it would roughly perform:

for col in df.columns:
   if col in dtype:
      df[col] = df.col.astype(dtype_dict[col])

The fact that this currently raises an error becomes annoying, for example if one needs to repair data types after they became destroyed by a stacking operation - one needs to slice the dtype_dict-dictionary by the column keys every time!

Of course, a strong argument can be made that raising an error is a good idea to prevent users from erroneously believing the type-casting was performed in the case when a key was miss-typed.

Describe the solution you'd like

I propose to consider either one of the following changes:

Option 1: If df.columns is a subset of dtype_dict, do not raise an error if superfluous keys are present. In this case all columns are identified and there is a negligible chance that there is an error due to a typo.
Option 2: Extent the functionality of the already present errors='ignore' option to also ignore superfluous keys in the dtype_dict.

@randolf-scholz randolf-scholz added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 1, 2021
@mroeschke
Copy link
Member

Thanks for the suggestion, but I would be -1 on this enhancement as I think promoting the more explicitly behavior is better as well as your other point:

Of course, a strong argument can be made that raising an error is a good idea to prevent users from erroneously believing the type-casting was performed in the case when a key was miss-typed

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 2, 2021
@attack68
Copy link
Contributor

attack68 commented Oct 2, 2021

I agree with @mroeschke. You can also also get around it with dict comprehension:

df.astype({c: d for c, d in _dtypes.items() if c in df.columns})

@mwaskom
Copy link
Contributor

mwaskom commented Oct 2, 2021

A counterpoint would be that pd.read_csv accepts a dict for dtype= and does not error if keys in the dict do not match columns in the frame.

This also feels similar to df.rename which has errors="ignore" as default, allowing one to choose a stricter/safer interpretation if desired.

So there is a consideration of API consistency as well as what is the optimal API for this individual method.


The usecase here would be to have a single dictionary of column names -> dtypes for a project, something I've done in the past that feels nice and explicit/organized. At various points in the process of data manipulation/analysis, you may end up with dataframes that have only a subset of the columns but not naturally have a an explicit representation of the subtset of column names that you want.

This alternate suggestion is not sufficient IMO

You can also also get around it with dict comprehension:
df.astype({c: d for c, d in _dtypes.items() if c in df.columns})

This does not work if you're writing a chain of operations, something like

(
    df
    .melt(...)  # Create new columns with values in the column index, that may not be typed correctly
    .astype(...)  #  At this point there is no variable allowing access to the current set of column names
)

@randolf-scholz
Copy link
Contributor Author

@mroeschke

Thanks for the suggestion, but I would be -1 on this enhancement as I think promoting the more explicitly behavior is better as well as your other point.

And that's why I am not arguing for changing the default behaviour, except in the case when one can be relatively sure that no typos happened (i.e. when all columns are present as keys.).

@attack68 Of course one can get around it this way, but as I said earlier, I find this has several disadvantages, mainly that

  1. It becomes really annoying when you have to do it often - in my case this happens when constructing several tables with stack/unstack/melt operations which have a tendency to destroy the datatypes.
  2. It, in my opinion, hurts readability. There is no real useful extra information in the line
df.astype({c: d for c, d in dtypes_dict.items() if c in df.columns})

compared to

df.astype(dtypes_dict)

To me, the former really just feels like it adds useless noise to the code, whereas the latter is to the point.
I would kindly ask you to reconsider, and also carefully think about what I am actually proposing - which is not to change the default behaviour, except in cases where it is safe to do so.

@mwaskom
Copy link
Contributor

mwaskom commented Oct 2, 2021

@randolf-scholz I think this is a good suggestion and it's something I've stumbled over in the past. I might offer a friendly suggestion which is that the tone of your request, e.g.

This is a completely unnecessary restriction, it would clearly suffice if it would roughly perform ...

could be decreasing enthusiasm for the idea. It reads (to me) as unnecessarily combative, and I think you might find more success by focusing on the value of your proposed enhancement rather than the (perceived) flaws of the existing API.

@mwaskom
Copy link
Contributor

mwaskom commented Oct 2, 2021

Another consideration is thaterrors="ignore" would be ambiguous here in a way that it isn't with df.rename. What happens if you ask for an illegal conversion of a column that does exist (say, casting a non-numeric string column to float)?

@attack68
Copy link
Contributor

attack68 commented Oct 2, 2021

When I reconsider the the proposal carefully I must raise two points for discussion:

I propose the following changes:

  1. If df.columns is a subset of dtype, do not raise an error if superfluous keys are present
    • In this case all columns are identified and there is a negligible chance that there is an error due to a typo.
  2. Extent the functionality of the already present errors='ignore' option to also ignore superfluous keys in the dtype-dictionary.

There may be a reduced chance of an error due to a typo in case 1, but that does not necessarily imply the chance of error in the programme overall is below a threshold of significance that your condition makes it acceptable. It may be or may not be. The more complicated logic error behaviour may (or may not) also have unwanted side effects to other programmes.

For 2. @mwaskom makes a similar observation to my own.

Were we discussing the issue from an original design perspective I'm not sure what I would prefer, and perhaps that is why the API is inconsistent in parts because this is one of those subjective areas. However, having an established MO, means at least for me, the bar must be suitably high and the argument suitably strong and proven that a change is warranted.

But pandas is open source, feel free to contribute a quick/draft PR if you feel convincing enough and see hows its taken..

@jbrockmendel
Copy link
Member

I agree with @mroeschke here, -1 on changing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Astype Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants