-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Split pandas package into pandas and pandas-core #57550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I like the idea of having a "minimal" installation that covers most common use cases and avoids downloading unneeded packages. I would suggest the name With respect to the One other thought - I imagine the current test suite would have to be split into tests appropriate for |
I do think this would be useful but mainly for considering how the code is packaged and less about how dependencies are bundled (which seems to be the focus here?). For pip installations we have the pip extras set up and understandably conda doesn't have something like that (yet). If this re-packaging is to make the conda installation story nicer I'm not sure if it's worth it. Just noting that https://anaconda.org/conda-forge/jupyter_core |
My main point is about the UX, anything else I'm personally flexible and can be discussed later. I think |
As long as PDEP-10 holds I think pyarrow is a core package. Outside of that how much of a difference is this expected to make? I think there is also a downside to having separate packages because then you start to fragment the user base |
The difference is that by default users will get our recommended dependencies, as opposed as now, since the main packages will now add them, still leaving the option for users to install a version with no optional dependencies. Making up the numbers, but if 20% of users have PyArrow now, maybe we'll get 80% of them, making pandas faster for many users who don't know or don't care much on what to install, and trust us on providing what they need by default. I personally don't see the fragmentation problem you mention. This solution has been implemented for decades in the Linux world. If you want KDE for example, you just install the |
There is already a great mechanism and all that is needed are some recommendations like installing I think it would be a mistake to try and redefine pandas to be some huge set of dependencies, and to introduce some other package to be the current pandas. |
I think this is well known, but feels worth stating anyways: no matter how its implemented, if there are ways of using pandas without pyarrow, then we have to maintain both "pandas with pyarrow" and "pandas without pyarrow" - which to me was the main reasons for PDEP-10. If pyarrow is always opt-in, then I don't see much issue with this. But if we are having e.g. "string[pyarrow] when pyarrow is installed and otherwise numpy object" type inference, then users will have different behavior in pandas itself depending on whether a third party package is installed or not. That seems like a very bad user experience to me. |
That was my understanding of one of the core reasons for PDEP-10. I was was one of the few people voting against PDEP-10, but now it has been voted is it not supposed to be accepted it and stuck with? Unless a new PDEP or amendment to it is put forward then surely this is out of scope until then. According to the PDEP the warning should be retained also and not repealed by a close majority vote which might also not follow PDEP rules. |
I agree, and it's surely not the goal of this issue to cancel PDEP-10. Also, while having two packages could be used to install PyArrow more broadly without requiring, the scope of what I'm discussing here is not limited to PyArrow and could be used to other dependencies that we recommend (or assume users are most likely to want) but we don't want to force, for example Matplotlib. From the previous discussions seems like several people have interest in not moving forward with PDEP-10, at least as is. I fully agree that this issue is not where we want to decide or even discuss it. But if there is interest in implementing the two packages for default and minimal dependencies, I think it can make a difference for future discussions on requiring Arrow. And clearly, this issue doesn't help with cleaning our codebase of I'm personally +1 on moving forward with PDEP-10, fully requiring PyArrow and keeping the warning, but if many people dislike the PDEP now, I think we'll have to have a new discussion. |
Agreed with this. Extras should stay extras. IIRC, the -core thing is probably specific to conda-forge, I've never seen it used with a project on PyPI. |
Agree. And we should have more detailed installation instructions to educate users on using extras. And I think we could have a |
Thanks all for the feedback. It doesn't seem there is much interest to move forward with this at this point. I guess in the future something similar can be considered for conda-forge, which doesn't have extras like pip, but I'll close this issue, which was specific to making the "normal" |
Maybe worth a PDEP, but opening as an issue first to see what other people think, and see if it's needed or worth the time.
The status quo for dependencies in pandas has been to depend on numpy, pytz, and dateutil, and for everything else just make them optional. This has been working reasonably ok, but I don't think it's ideal, and the discussion on whether PyArrow should be required or optional is one example of it.
In my opinion, there are two main things to consider. The first one is about users, and I see two broad groups:
pip/conda install pandas
and wants things to work without much hassleI think the current approach favors group 2, and causes users in group 1 to experience many exceptions on missing dependencies if they want to use key functionalities like
.read_excel()
or.plot()
or have suboptimal performance if they use.read_csv()
and others and miss PyArrow. Of course this is avoided if they install pandas with a distribution that includes the dependencies, which I think it's common.There is a second thing that it's how code is structured for soft dependencies, but I will leave it out of this discussion, as it's another complex but somehow independent topic.
What I propose regarding the packaging is what many other packages do, for example R in the packaging on Linux distributions. Distribute two different packages
pandas
andpandas-core
.The existing package
pandas
would be renamed topandas-core
, and users who would want a minimal installation would be able to use it. A new metapackage would be created with the existing namepandas
. It'd be a metapackage / "empty" package with just dependencies topandas-code
,pyarrow
,matplotlib
... and any other package we consider important to have by default.I think this would solve in a reasonable way the discussion on whether to make PyArrow required, and in general improve the experience of most pandas users.
@pandas-dev/pandas-core thoughts?
The text was updated successfully, but these errors were encountered: