Skip to content

Split pandas package into pandas and pandas-core #57550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datapythonista opened this issue Feb 21, 2024 · 12 comments
Closed

Split pandas package into pandas and pandas-core #57550

datapythonista opened this issue Feb 21, 2024 · 12 comments
Labels
Build Library building on various platforms Dependencies Required and optional dependencies Ideas Long-Term Enhancement Discussions Needs Discussion Requires discussion from core team before further action Release

Comments

@datapythonista
Copy link
Member

Maybe worth a PDEP, but opening as an issue first to see what other people think, and see if it's needed or worth the time.

The status quo for dependencies in pandas has been to depend on numpy, pytz, and dateutil, and for everything else just make them optional. This has been working reasonably ok, but I don't think it's ideal, and the discussion on whether PyArrow should be required or optional is one example of it.

In my opinion, there are two main things to consider. The first one is about users, and I see two broad groups:

  1. The average user who will pip/conda install pandas and wants things to work without much hassle
  2. The advanced user who wants more control on what is installed

I think the current approach favors group 2, and causes users in group 1 to experience many exceptions on missing dependencies if they want to use key functionalities like .read_excel() or .plot() or have suboptimal performance if they use .read_csv() and others and miss PyArrow. Of course this is avoided if they install pandas with a distribution that includes the dependencies, which I think it's common.

There is a second thing that it's how code is structured for soft dependencies, but I will leave it out of this discussion, as it's another complex but somehow independent topic.

What I propose regarding the packaging is what many other packages do, for example R in the packaging on Linux distributions. Distribute two different packages pandas and pandas-core.

The existing package pandas would be renamed to pandas-core, and users who would want a minimal installation would be able to use it. A new metapackage would be created with the existing name pandas. It'd be a metapackage / "empty" package with just dependencies to pandas-code, pyarrow, matplotlib... and any other package we consider important to have by default.

I think this would solve in a reasonable way the discussion on whether to make PyArrow required, and in general improve the experience of most pandas users.

@pandas-dev/pandas-core thoughts?

@datapythonista datapythonista added Build Library building on various platforms Ideas Long-Term Enhancement Discussions Release Needs Discussion Requires discussion from core team before further action Dependencies Required and optional dependencies labels Feb 21, 2024
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 21, 2024

I like the idea of having a "minimal" installation that covers most common use cases and avoids downloading unneeded packages. I would suggest the name minipandas, akin to how miniconda vs. anaconda are a minimal and maximal version of anaconda.

With respect to the pyarrow issue, we'd then have to make sure that minipandas would work without pyarrow being installed.

One other thought - I imagine the current test suite would have to be split into tests appropriate for minipandas and pandas , and there would be an additional burden when building distributions. We'd also have to carefully examine the docs to determine which parts need a "full pandas" label to indicate that you need the full package (or specific dependencies) for it to work.

@mroeschke
Copy link
Member

I do think this would be useful but mainly for considering how the code is packaged and less about how dependencies are bundled (which seems to be the focus here?). For pip installations we have the pip extras set up and understandably conda doesn't have something like that (yet). If this re-packaging is to make the conda installation story nicer I'm not sure if it's worth it.

Just noting that core seems to be the "common" prefix for minimal packages in Python too:

https://anaconda.org/conda-forge/jupyter_core
https://anaconda.org/conda-forge/dask-core
https://anaconda.org/conda-forge/poetry-core
https://anaconda.org/conda-forge/botocore

@datapythonista
Copy link
Member Author

which seems to be the focus here?

My main point is about the UX, anything else I'm personally flexible and can be discussed later.

I think pip install pandas and conda install pandas should install PyArrow, and possibly Matplotlib and other dependencies. And there should be a way to install pandas without any optional dependencies, pip install pandas[core] and conda install pandas-core, or whatever makes sense and is feasible.

@WillAyd
Copy link
Member

WillAyd commented Feb 21, 2024

I think pip install pandas and conda install pandas should install PyArrow, and possibly Matplotlib and other dependencies. And there should be a way to install pandas without any optional dependencies, pip install pandas[core] and conda install pandas-core, or whatever makes sense and is feasible.

As long as PDEP-10 holds I think pyarrow is a core package. Outside of that how much of a difference is this expected to make? I think there is also a downside to having separate packages because then you start to fragment the user base

@datapythonista
Copy link
Member Author

The difference is that by default users will get our recommended dependencies, as opposed as now, since the main packages will now add them, still leaving the option for users to install a version with no optional dependencies.

Making up the numbers, but if 20% of users have PyArrow now, maybe we'll get 80% of them, making pandas faster for many users who don't know or don't care much on what to install, and trust us on providing what they need by default.

I personally don't see the fragmentation problem you mention. This solution has been implemented for decades in the Linux world. If you want KDE for example, you just install the kde package and you get a notepad, a calculator, a calendar... If you have a reason to not have everything that KDE provides, you can still install kde-core and the specific packages you want. I wouldn't say KDE users are fragmented because of this, or that pandas users will be. We are already dealing with an user base where each individual has a different set of dependencies. We'll affect the percentage of users that have some of the pandas optional dependencies, but other than that I personally don't see a significant change or any drawback. the pandas installed will be exactly the same, the one in pandas-core, which will be installed by both the pandas and the pandas-core packages.

@bashtage
Copy link
Contributor

There is already a great mechanism and all that is needed are some recommendations like installing pandas[all] (or full or kitchen-sink) and possibly other subsets like pandas[io].

I think it would be a mistake to try and redefine pandas to be some huge set of dependencies, and to introduce some other package to be the current pandas.

@rhshadrach
Copy link
Member

I think this is well known, but feels worth stating anyways: no matter how its implemented, if there are ways of using pandas without pyarrow, then we have to maintain both "pandas with pyarrow" and "pandas without pyarrow" - which to me was the main reasons for PDEP-10.

If pyarrow is always opt-in, then I don't see much issue with this. But if we are having e.g. "string[pyarrow] when pyarrow is installed and otherwise numpy object" type inference, then users will have different behavior in pandas itself depending on whether a third party package is installed or not. That seems like a very bad user experience to me.

@attack68
Copy link
Contributor

I think this is well known, but feels worth stating anyways: no matter how its implemented, if there are ways of using pandas without pyarrow, then we have to maintain both "pandas with pyarrow" and "pandas without pyarrow" - which to me was the main reasons for PDEP-10.

That was my understanding of one of the core reasons for PDEP-10. I was was one of the few people voting against PDEP-10, but now it has been voted is it not supposed to be accepted it and stuck with? Unless a new PDEP or amendment to it is put forward then surely this is out of scope until then. According to the PDEP the warning should be retained also and not repealed by a close majority vote which might also not follow PDEP rules.

@datapythonista
Copy link
Member Author

I agree, and it's surely not the goal of this issue to cancel PDEP-10. Also, while having two packages could be used to install PyArrow more broadly without requiring, the scope of what I'm discussing here is not limited to PyArrow and could be used to other dependencies that we recommend (or assume users are most likely to want) but we don't want to force, for example Matplotlib.

From the previous discussions seems like several people have interest in not moving forward with PDEP-10, at least as is. I fully agree that this issue is not where we want to decide or even discuss it. But if there is interest in implementing the two packages for default and minimal dependencies, I think it can make a difference for future discussions on requiring Arrow.

And clearly, this issue doesn't help with cleaning our codebase of if pyarrow or having to deal with two separate cases. The main change I envision is a significant increase in the number of users who have PyArrow installed.

I'm personally +1 on moving forward with PDEP-10, fully requiring PyArrow and keeping the warning, but if many people dislike the PDEP now, I think we'll have to have a new discussion.

@lithomas1
Copy link
Member

There is already a great mechanism and all that is needed are some recommendations like installing pandas[all] (or full or kitchen-sink) and possibly other subsets like pandas[io].

I think it would be a mistake to try and redefine pandas to be some huge set of dependencies, and to introduce some other package to be the current pandas.

Agreed with this. Extras should stay extras.

IIRC, the -core thing is probably specific to conda-forge, I've never seen it used with a project on PyPI.

@fangchenli
Copy link
Member

There is already a great mechanism and all that is needed are some recommendations like installing pandas[all] (or full or kitchen-sink) and possibly other subsets like pandas[io].

I think it would be a mistake to try and redefine pandas to be some huge set of dependencies, and to introduce some other package to be the current pandas.

Agree. And we should have more detailed installation instructions to educate users on using extras.

And I think we could have a pandas-core containing all (or some) the extension modules so we could have more fine-grained tests and benchmarks. It'll also speed up CI and improve developer experience.

@datapythonista
Copy link
Member Author

Thanks all for the feedback. It doesn't seem there is much interest to move forward with this at this point. I guess in the future something similar can be considered for conda-forge, which doesn't have extras like pip, but I'll close this issue, which was specific to making the "normal" pandas package to install a subset of optional dependencies, which doesn't have much support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Library building on various platforms Dependencies Required and optional dependencies Ideas Long-Term Enhancement Discussions Needs Discussion Requires discussion from core team before further action Release
Projects
None yet
Development

No branches or pull requests

9 participants