Skip to content

DISC: pd.DataFrame methods we specifically _don't_ want included #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Aug 26, 2022 · 8 comments
Closed

Comments

@jbrockmendel
Copy link
Contributor

jbrockmendel commented Aug 26, 2022

#3 has a discussion of which pd.DataFrame methods should be included in the standard based on a) measures of popularity among users and b) which are common across dataframe libraries. I'd like to try coming at it from the other direction: what existing pd.DataFrame methods can we exclude from consideration?

Throwing out some ideas here, none of these are strong opinions on my part:

  • metadata-centric attributes/methods
    • attrs
    • flags
    • set_flags
  • deprecated pd.DataFrame methods
    • tshift
    • slice_shift
    • lookup
    • iteritems
    • append
  • non-dunder arithmetic methods
    • add
    • radd
    • mul
    • rmul
    • [26 of these total]
    • eq
    • ne
  • Methods that don't make sense if there isn't a row-index
    • set_index
    • reset_index
    • sort_index
    • stack
    • unstack
    • swapaxes
    • asof
    • between_time
    • at_time
    • last_valid_index
    • first_valid_index
  • Methods that don't make sense if there isn't a MultiIndex
    • droplevel
    • reorder_levels
    • swaplevel
  • Dtype specific methods
    • explode (object)
    • infer_objects
    • sparse
  • Other
    • to_records/from_records
    • to_dict/from_dict
    • potentially many other to_foo/pd.read_foo methods/functions
    • style
    • to_period/to_timestamp/tz_convert/tz_localize/asfreq
@rgommers
Copy link
Member

Thanks @jbrockmendel, that is a very good question. My handwavy suggestion would be "anything that's not core to the data structure or methods to manipulate it or do basic computations on it" .Your list makes complete sense to me, and I'd add:

The above is all my sense of "not core". For a different reason I'd add eval and query - evaluating string expressions of syntax rather than the actual syntax seems like a performance optimization detail bubbled up to the end user. If you already optimize performance in a different way, or don't have something like numexpr, it probably doesn't make sense.

@shwina
Copy link
Contributor

shwina commented Aug 30, 2022

Thanks! Largely agree with all the suggestions so far. Just one comment:

Not sure, but probably most windowing functionality: https://pandas.pydata.org/docs/user_guide/window.html

I think windowing, and in particular, grouped-window functions, should stay. It's core to timeseries analysis and not easy for users to work around or implement themselves.

@kkraus14
Copy link
Collaborator

  • Higher level statistical methods (e.g. pct_change, cov, corr, rank, spearman, kendall, pearson). It must be possible to defer such methods to statsmodels, scipy.stats or another such library, rather than requiring it to be reimplemented in every dataframe library.

Often these are used as groupby and window functions, where other libraries typically don't have grouped implementations and calling them per group leads to very subpar performance.

I'm still a +1 on removing them at least from the v1 of the API though 😄.

@rgommers
Copy link
Member

rgommers commented Aug 31, 2022

Thanks for the details on grouped window functions, interesting. So it seems like an important topic to deal with at some point. It feels like something can be made more composable there - I had a quick look at how it's implemented in Pandas. For example 'spearman' is only used as:

class DataFrame:
    def corr(...):
        elif method == "spearman":
            correl = libalgos.nancorr_spearman(mat, minp=min_periods)

And libalgos.nancorr_spearman looks like a standard statistical function - 2-D array in, numerical result out (no special handling of a groupby object):

def nancorr_spearman(ndarray[float64_t, ndim=2] mat, Py_ssize_t minp=1) -> ndarray:

So that seems generalizable to any callable and correlation metric. Same for if it would take some object that is the result of calling groupby: should still be able to define what functions take in and return, so you don't have to reimplement such things over and over.

Windowing is a large topic (and has a large API surface), maybe worth splitting off into its own issue?

@rgommers
Copy link
Member

We had a little brainstorm on functions not to include in a call a few weeks ago. Here are some notes on what was discussed regarding APIs that would be good to exclude:

  • APIs dealing with row indices & implicit alignment. These are basically impossible to do in parallel distributed settings. There's also not many use cases for these - these are some, e.g. in Dask sorting on index allows for some performance optimizations. That's an implementation detail though.
  • row indices in general
  • do not allow calling general dataframe operations against groupby and windowing groups.
    • for apply, relevant operations need to be defined.
    • Outside of apply, there are various operations possible in pandas (e.g., as performing a correlation on a groupby), which are not heavily used, but requires specialization in libraries, such as cuDF. (note that in Pandas, this isn't done, but it may be needed in principle to get good performance)
  • .empty / .is_empty -> all agree
  • .bool -> all agree
  • .describe and .info. Useful in interactive contexts, but not by libraries
  • .explode -> several people find this useful (e.g. when flattening nested data), keep
  • iter-like methods -> wide agreement, some discussion points made:
    • for iter methods, would be nice if they could explicitly raise, rather than defer to library implementations. Would prefer users be guided to built-in methods.
    • we should not support making dataframe iterables in any way.
    • breaking backwards compat in e.g. pandas is tricky though. Question about what to include in standard and what not to include.
      • even if standard says to raise, does not mean that we cannot include in standard. There will have to be some sort of compatibility mode.
  • .transpose / .T -> everyone agrees
  • .head / .tail -> maybe: they are used, but they overlap with iloc and other methods. On the other hand, easy to implement.
  • .squeeze
    • if there is no Series, then there's no use for .squeeze perhaps?
  • .combine -> agree, remove
  • .update -> agree, remove (indexing is better)
  • .select_dtypes and .exclude_dtypes -> some discussion: useful for end users, less so for libraries. likely leave out for now, but may revisit later
  • .convert_dtypes (converts NumPy dtypes to nullable dtypes) -> agree, remove
  • .mask? discussion: perhaps doesn't make sense if you are immutable; however doesn't have to be in-place. Could use where instead.
  • .query -> agreement that this is a big hack/mess, good to remove.
    • If something like this is desired, need to put some work in standardizing expression format. Some work done in Arrow. Oriented toward devs and more onerous for users.
    • more attempts in that direction, like numexpr, patsy
    • related to eval. All of this fancy query stuff is a fancy way of doing loop fusion. pandas is just deferring to numexpr.
    • if one wants to do loop fusion, one should just write/use a compiler.

@jorisvandenbossche
Copy link
Member

Maybe controversial, but should loc and iloc be in this list?

@jbrockmendel
Copy link
Contributor Author

Maybe controversial, but should loc and iloc be in this list?

loc sounds like a good candidate bc in the absence of a row-index i think it would be dominated by __getitem__.

iloc I'd expect to be more universally useful, though it could be replaced by something like iat+slice+take

Definitely want to avoid overloading __getitem__ (pandas-dev/pandas#9595)

@MarcoGorelli
Copy link
Contributor

I think everything here has been addressed (or rather, kept out), so I think this can be closed, do let me know if I've misunderstood

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants