DISC: pd.DataFrame methods we specifically _don't_ want included #83

jbrockmendel · 2022-08-26T21:40:31Z

#3 has a discussion of which pd.DataFrame methods should be included in the standard based on a) measures of popularity among users and b) which are common across dataframe libraries. I'd like to try coming at it from the other direction: what existing pd.DataFrame methods can we exclude from consideration?

Throwing out some ideas here, none of these are strong opinions on my part:

metadata-centric attributes/methods
- attrs
- flags
- set_flags
deprecated pd.DataFrame methods
- tshift
- slice_shift
- lookup
- iteritems
- append
non-dunder arithmetic methods
- add
- radd
- mul
- rmul
- [26 of these total]
- eq
- ne
Methods that don't make sense if there isn't a row-index
- set_index
- reset_index
- sort_index
- stack
- unstack
- swapaxes
- asof
- between_time
- at_time
- last_valid_index
- first_valid_index
Methods that don't make sense if there isn't a MultiIndex
- droplevel
- reorder_levels
- swaplevel
Dtype specific methods
- explode (object)
- infer_objects
- sparse
Other
- to_records/from_records
- to_dict/from_dict
- potentially many other to_foo/pd.read_foo methods/functions
- style
- to_period/to_timestamp/tz_convert/tz_localize/asfreq

rgommers · 2022-08-30T05:17:56Z

Thanks @jbrockmendel, that is a very good question. My handwavy suggestion would be "anything that's not core to the data structure or methods to manipulate it or do basic computations on it" .Your list makes complete sense to me, and I'd add:

I/O functionality
Visualization functionality
All options/settings type APIs: https://pandas.pydata.org/docs/user_guide/options.html
Higher level statistical methods (e.g. pct_change, cov, corr, rank, spearman, kendall, pearson). It must be possible to defer such methods to statsmodels, scipy.stats or another such library, rather than requiring it to be reimplemented in every dataframe library.
Special calendar handling, e.g. BusinessHour & co, Easter, SemiMonth, etc. Very little on https://pandas.pydata.org/docs/reference/offset_frequency.html aside from DateOffset and to_offset probably.
Not sure, but probably most windowing functionality: https://pandas.pydata.org/docs/user_guide/window.html
Exceptions and warnings (as a general principle, only expected input/behavior should be specified, not responses to unexpected input - because there's an endless amount of possible unexpected inputs).
An API extension mechanism (https://pandas.pydata.org/docs/reference/extensions.html)

The above is all my sense of "not core". For a different reason I'd add eval and query - evaluating string expressions of syntax rather than the actual syntax seems like a performance optimization detail bubbled up to the end user. If you already optimize performance in a different way, or don't have something like numexpr, it probably doesn't make sense.

shwina · 2022-08-30T11:34:04Z

Thanks! Largely agree with all the suggestions so far. Just one comment:

Not sure, but probably most windowing functionality: https://pandas.pydata.org/docs/user_guide/window.html

I think windowing, and in particular, grouped-window functions, should stay. It's core to timeseries analysis and not easy for users to work around or implement themselves.

kkraus14 · 2022-08-30T14:57:54Z

Higher level statistical methods (e.g. pct_change, cov, corr, rank, spearman, kendall, pearson). It must be possible to defer such methods to statsmodels, scipy.stats or another such library, rather than requiring it to be reimplemented in every dataframe library.

Often these are used as groupby and window functions, where other libraries typically don't have grouped implementations and calling them per group leads to very subpar performance.

I'm still a +1 on removing them at least from the v1 of the API though 😄.

rgommers · 2022-08-31T14:28:52Z

Thanks for the details on grouped window functions, interesting. So it seems like an important topic to deal with at some point. It feels like something can be made more composable there - I had a quick look at how it's implemented in Pandas. For example 'spearman' is only used as:

class DataFrame:
    def corr(...):
        elif method == "spearman":
            correl = libalgos.nancorr_spearman(mat, minp=min_periods)

And libalgos.nancorr_spearman looks like a standard statistical function - 2-D array in, numerical result out (no special handling of a groupby object):

def nancorr_spearman(ndarray[float64_t, ndim=2] mat, Py_ssize_t minp=1) -> ndarray:

So that seems generalizable to any callable and correlation metric. Same for if it would take some object that is the result of calling groupby: should still be able to define what functions take in and return, so you don't have to reimplement such things over and over.

Windowing is a large topic (and has a large API surface), maybe worth splitting off into its own issue?

rgommers · 2022-09-26T18:21:03Z

We had a little brainstorm on functions not to include in a call a few weeks ago. Here are some notes on what was discussed regarding APIs that would be good to exclude:

APIs dealing with row indices & implicit alignment. These are basically impossible to do in parallel distributed settings. There's also not many use cases for these - these are some, e.g. in Dask sorting on index allows for some performance optimizations. That's an implementation detail though.
row indices in general
do not allow calling general dataframe operations against groupby and windowing groups.
- for apply, relevant operations need to be defined.
- Outside of apply, there are various operations possible in pandas (e.g., as performing a correlation on a groupby), which are not heavily used, but requires specialization in libraries, such as cuDF. (note that in Pandas, this isn't done, but it may be needed in principle to get good performance)
.empty / .is_empty -> all agree
.bool -> all agree
.describe and .info. Useful in interactive contexts, but not by libraries
.explode -> several people find this useful (e.g. when flattening nested data), keep
iter-like methods -> wide agreement, some discussion points made:
- for iter methods, would be nice if they could explicitly raise, rather than defer to library implementations. Would prefer users be guided to built-in methods.
- we should not support making dataframe iterables in any way.
- breaking backwards compat in e.g. pandas is tricky though. Question about what to include in standard and what not to include.
  - even if standard says to raise, does not mean that we cannot include in standard. There will have to be some sort of compatibility mode.
.transpose / .T -> everyone agrees
.head / .tail -> maybe: they are used, but they overlap with iloc and other methods. On the other hand, easy to implement.
.squeeze
- if there is no Series, then there's no use for .squeeze perhaps?
.combine -> agree, remove
.update -> agree, remove (indexing is better)
.select_dtypes and .exclude_dtypes -> some discussion: useful for end users, less so for libraries. likely leave out for now, but may revisit later
.convert_dtypes (converts NumPy dtypes to nullable dtypes) -> agree, remove
.mask? discussion: perhaps doesn't make sense if you are immutable; however doesn't have to be in-place. Could use where instead.
.query -> agreement that this is a big hack/mess, good to remove.
- If something like this is desired, need to put some work in standardizing expression format. Some work done in Arrow. Oriented toward devs and more onerous for users.
- more attempts in that direction, like numexpr, patsy
- related to eval. All of this fancy query stuff is a fancy way of doing loop fusion. pandas is just deferring to numexpr.
- if one wants to do loop fusion, one should just write/use a compiler.

jorisvandenbossche · 2022-09-29T17:03:18Z

Maybe controversial, but should loc and iloc be in this list?

jbrockmendel · 2022-09-29T18:01:36Z

Maybe controversial, but should loc and iloc be in this list?

loc sounds like a good candidate bc in the absence of a row-index i think it would be dominated by __getitem__.

iloc I'd expect to be more universally useful, though it could be replaced by something like iat+slice+take

Definitely want to avoid overloading __getitem__ (pandas-dev/pandas#9595)

MarcoGorelli · 2023-10-27T15:00:08Z

I think everything here has been addressed (or rather, kept out), so I think this can be closed, do let me know if I've misunderstood

rgommers added the API design label Sep 13, 2022

MarcoGorelli closed this as completed Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISC: pd.DataFrame methods we specifically _don't_ want included #83

DISC: pd.DataFrame methods we specifically _don't_ want included #83

jbrockmendel commented Aug 26, 2022 •

edited

Loading

rgommers commented Aug 30, 2022

shwina commented Aug 30, 2022

kkraus14 commented Aug 30, 2022

rgommers commented Aug 31, 2022 •

edited

Loading

rgommers commented Sep 26, 2022

jorisvandenbossche commented Sep 29, 2022

jbrockmendel commented Sep 29, 2022

MarcoGorelli commented Oct 27, 2023

DISC: pd.DataFrame methods we specifically _don't_ want included #83

DISC: pd.DataFrame methods we specifically _don't_ want included #83

Comments

jbrockmendel commented Aug 26, 2022 • edited Loading

rgommers commented Aug 30, 2022

shwina commented Aug 30, 2022

kkraus14 commented Aug 30, 2022

rgommers commented Aug 31, 2022 • edited Loading

rgommers commented Sep 26, 2022

jorisvandenbossche commented Sep 29, 2022

jbrockmendel commented Sep 29, 2022

MarcoGorelli commented Oct 27, 2023

jbrockmendel commented Aug 26, 2022 •

edited

Loading

rgommers commented Aug 31, 2022 •

edited

Loading