Skip to content

Support for array-like and pandas-like data #470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

smathot
Copy link
Contributor

@smathot smathot commented Jun 19, 2025

This is a small patch that does not change any functionality or fixes issues, but rather makes Pingouin more flexible in the types of data that it accepts. Specifically, right now Pingouin does not (always) accept data that is compatible with numpy.ndarray or pandas.DataFrame when it is not exactly of that type. This patch adds some flexibility by checking whether data can be converted to an ndarray or DataFrame when it is not of those types. It seems that data-type checking happens at different places throughout the code, so I'm not sure I caught everything.

For me, the main reason to submit this PR is to make DataMatrix 2.0, which is close to fully compatible with DataFrame, also fully compatible with Pingouin. With this patch it is. The DataMatrix unit tests for now still fail on GitHub, but that's because it's pulling the latest stable version of Pingouin.

All Pingouin unittests pass with this patch.

@smathot smathot changed the title Support dataframe like Support for array-like and pandas-like data Jun 19, 2025
Copy link

codecov bot commented Jun 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.54%. Comparing base (c939d2d) to head (41df958).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #470   +/-   ##
=======================================
  Coverage   98.54%   98.54%           
=======================================
  Files          19       19           
  Lines        3360     3360           
  Branches      492      492           
=======================================
  Hits         3311     3311           
  Misses         26       26           
  Partials       23       23           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@remrama
Copy link
Contributor

remrama commented Jun 19, 2025

I'm hesitant to support this for fear of getting into a pattern of accommodating compatibility with niche packages. @smathot, is it too much for DataMatrix users to simply use the to-pandas features of DataMatrix right before calling Pingouin?

My primary concerns are:

  1. Checking for a __dataframe__ attribute doesn't have any immediate red flags to me now, but it seems more principled to check against the more reliable and consistent DataFrame type. This might introduce confusion later on, like if someone is using custom code with the same attribute.
  2. Accommodating one niche package now opens up the floodgates for the many niche communities that use Pingouin to do something similar, which could get messy.

It seems to me that DataMatrix could deal with the one line conversion prior to calling Pingouin, or alternatively the onus could be on the DataMatrix code base to add a method to DataMatrix objects that call Pingouin and do the conversion on the DataMatrix end. What do you think @smathot ? Am I missing a complexity here that makes this change to Pingouin more critical for your users?

@smathot
Copy link
Contributor Author

smathot commented Jun 19, 2025

Yes, I assumed that you wouldn't want to make ad-hoc exceptions. For this reason, you'll see that I avoided the patch from specifically targeting datamatrix, and instead used the DataFrame interchange protocol. This is an attempt at improving interoperability between packages that somehow represent data (not just DataMatrix), and essentially comes down to libraries implementing a __dataframe__() function. The same is true for the __array__() function when it comes NumPy, although that seems more like a soft standard that doesn't really have a name.

I think you can safely adopt these patches, because they don't have any disadvantage for existing users, are not specific to one package, and they have a small benefit of making it easier for other data packages to work with Pingouin ☺️

@remrama
Copy link
Contributor

remrama commented Jun 19, 2025

Oh cool, thanks for the explanation and link @smathot, I wasn't aware of this interchange protocol. I see how this could be useful.

It's hard for me to tell from that link how actively adopted this is. But some quick searching makes me skeptical that this has a lot of support these days. Again, please correct me! Sounds like a cool idea and I'd be happy to see any links to active projects that are developing this protocol, and especially some major projects that are actively utilizing it. Maybe the repo was transfered somewhere else or something and I'm looking at the wrong thing. But here's what I found:

  • The latest open Issue on the dataframe-api github repo is a request to archive the project (Please consider archiving project or marking as deprecated data-apis/dataframe-api#363)
  • From Pandas, the api.interchange.from_dataframe docs page has a Note about deprecation ("From pandas 2.3 onwards, from_dataframe uses the PyCapsule Interface, only falling back to the interchange protocol if that fails") and a Warning about "sever implementation issues". Maybe you want to use the "PyCapsule interface" mentioned here? (I didn't look into that)

If I'm wrong and this is an actively-supported thing and the community thinks this is a good idea for Pingouin, I would still recommend:

  • adding unit-tests that include non-pandas dataframes we expect to be compatible (e.g., polars)
  • adding an explanation or link to this interchange protocol thing in a comment line above the assertions
  • minor update to assertion statement that says something like "must be dataframe-like" or "must conform to dataframe API standard", maybe including a link (these last 2 points are just to help someone like me who had no idea about this)

@smathot
Copy link
Contributor Author

smathot commented Jun 20, 2025

Reading this, I think your assessment is regarding the DataFrame protocol is probably correct, and that not many libraries except for DataFrame and DataMatrix are going to implement a __dataframe__ property. Checking for a __dataframe__ property still seems safe to me, with little practical disadvantage, and as I mentioned with a small advantage.

When it comes to checking for __array__, I think there are more obvious data types that benefit, such as tensors from machine learning libraries, which are widely used. (Though of course not typically in the context that pingouin is used.)

It's up to you! I would of course prefer this patch to be merged, because DataMatrix benefits from it. If you agree, then I can also add the polish you suggest. But if not, then no hard feelings!

@remrama
Copy link
Contributor

remrama commented Jun 22, 2025

Okay, that makes sense. Thanks for the update.

I would still vote no on __dataframe__, mostly just to (a) keep Pingouin's underlying codebase as simple/straight-forward as possible and (b) prevent unforeseen issues with it in the future. I agree with you that it seems safe at first glance now, but I don't know how it would play out. There isn't a lot of active development on Pingouin right now, making it even more important to keep the risks low. (I don't really see a benefit of checking for __array__ either. I am not aware of any big need for that, and again I prefer to keep the codebase as self-explanatory as possible.)

But this decision is most definitely not up to only me! I have a minor voice here and we will have to wait for others to chime in before moving forward.

@raphaelvallat
Copy link
Owner

Hi there,

Thanks for the PR and detailed explanation @smathot. I tend to agree with @remrama that I don't see a clear benefit of adding this to Pingouin. Would the conversion that @remrama mentioned be feasible on your side?

It seems to me that DataMatrix could deal with the one line conversion prior to calling Pingouin, or alternatively the onus could be on the DataMatrix code base to add a method to DataMatrix objects that call Pingouin and do the conversion on the DataMatrix end.

Thank you both,
Raphael

@smathot
Copy link
Contributor Author

smathot commented Jun 23, 2025

Would the conversion that @remrama mentioned be feasible on your side?

Sure, it's possible. But it means that I have to instruct users with something like: "You can pass a DataMatrix to any function that accepts a DataFrame, except for the following pingouin functions: …". And the same is true for array-like objects like tensors. I don't fully understand the hesitation, to be honest! After all, this is a safe PR with no disadvantages, no meaningful decrease in readability, and only minor advantages. But it's 100% your call! Feel free to close the PR if it doesn't work for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants