Skip to content

ENH: Synchronize pickle with upstream #206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Sep 5, 2022
Merged

Conversation

bashtage
Copy link
Contributor

Align APIs
Add tests

  • Tests added: Please use assert_type() to assert the type of any return value

@bashtage
Copy link
Contributor Author

Failure is likely a bug in pyright

ReadPickleBuffer,
StorageOptions,
WriteBuffer,
)

def to_pickle(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since pandas.io.pickle.to_pickle() is not public, we should delete this here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need to handle removal of to_pickle() from this file (and then the associated tests)

tests/test_io.py Outdated
os.unlink(file.name)

with tempfile.NamedTemporaryFile(delete=False) as file:
check(assert_type(to_pickle(DF, file), None), type(None))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since to_pickle() is not public, no need to test it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is listed in pandas/io/api.py so I assume this makes it public even it not on the docs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is listed in pandas/io/api.py so I assume this makes it public even it not on the docs.

There's a few schools of thought here:

  1. We trim down pandas-stubs so that it only type checks what is in the documented public API
  2. If a function or class is "public" in the sense that it does not begin with an underscore but is not documented, we create a stub for it and test it.
  3. If a function or class is "public" in the sense that it does not begin with an underscore and is not documented, we are agnostic on creating a stub or testing that stub.

So far, @twoertwein and I have been leaning towards (1). With pandas.io.api.to_pickle(), you are proposing (2).

@twoertwein What are your thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more of what is the "API". To me it is something like

  • Documented as part of the docs
  • Imported into an api.py file or the top level `init.py file.

I think 2 is too broad because there hasn't been enough effort in pandas to _ modules, classes and methods.

In short, if it seems to be part of an API, then it is reasonable to include it.

A related point is documenting public methods of classes that appear parrtially in the docs, something like Klass.method(arg1, arg2). Should public of Klass be documented, or just Klass.__init__ and Klass.method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal is that everything that is meant to be public (which is often unclear) is documented and in pandas-stubs. Personally, I think the best way is to remove any symbol from the stubs that is not meant to be public.

  • If it seems reasonable that it is meant to be public, it might be a better user experience, if we first open an issue at pandas before potentially removing it.
  • If we remove too much, we get user feedback and can then create an issue at pandas.

I believe typeshed uses # undocumented to indicate which parts of their stubs are technically not documented. if we want to keep more in here, we could follow that approach. Luckily we have rather good connections to pandas :) so we might just open an issue there :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some grey zones: a private super class but the inherited methods are public in a public child class: I would keep the parent class (in the long-term, I would like if pandas-stubs aligns with pandas), define __all__ but exclude the class from it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more of what is the "API". To me it is something like

  • Documented as part of the docs
  • Imported into an api.py file or the top level `init.py file.

I think that's a fair definition. So using that, then to_pickle() is public, but it appears to be undocumented, so someone should create an issue over in pandas repo to indicate that it should be documented. In that case, we'll get a reaction of "wait - that shouldn't be public", or "Yes, let's document".

Looking at the source pandas.to_pickle() is what is really public, so I think you should change the tests to use pandas.to_pickle() rather than pandas.io.api.to_pickle() . Same goes for pandas.read_pickle() versus pandas.io.api.read_pickle()

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Aug 18, 2022

Failure is likely a bug in pyright

If you can create a small test case for pyright, and submit it to them, they are usually very fast at doing fixes.

@twoertwein
Copy link
Member

twoertwein commented Aug 18, 2022

I don't think this is a bug in pyright.

Typeshed says that readline takes an argument https://github.com/python/typeshed/blob/1e1a5868936145392d421e3e44258d9d0863ce4c/stdlib/tempfile.pyi#L211 but ReadPickleBuffer expects no argument for readline. I think pandas has the same issue.

edit: on the other side, the argument is optional, so maybe it is a bug in pyright.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Aug 18, 2022

I don't think this is a bug in pyright.

Typeshed says that readline takes an argument https://github.com/python/typeshed/blob/1e1a5868936145392d421e3e44258d9d0863ce4c/stdlib/tempfile.pyi#L211 but ReadPickleBuffer expects no argument for readline. I think pandas has the same issue.

edit: on the other side, the argument is optional, so maybe it is a bug in pyright.

I think the arguments have to be consistent, even if it is optional. So if we just change readline() in the defn of ReadPickleBuffer to match the arguments in typeshed, then I think we will be fine.

It's not getting caught in pandas testing because the CI for pyright doesn't type check the testing code.

@twoertwein
Copy link
Member

I opened pandas-dev/pandas#48144 to fix this issue. You can simply replace the return type of readline with bytes

Correct definition
Use ensure_clean
Ensure to_pickle is tested since it  appears in pandas/io/api
Import read_pickle from main
Fix merge conflicts
Remove extra def
Test on series
Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't test or have a stub for pd.io.api.to_pickle() I think I brought that up somewhere in this PR

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Aug 22, 2022

Also need to resolve conflicts

@bashtage
Copy link
Contributor Author

We shouldn't test or have a stub for pd.io.api.to_pickle() I think I brought that up somewhere in this PR

It has an open issue on pandas. I was leaning towards anything in API being part of the API irrespective of whether it is in the docs. Path forward would be to deprecate from API in pandas, then drop here.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Aug 22, 2022

We shouldn't test or have a stub for pd.io.api.to_pickle() I think I brought that up somewhere in this PR

It has an open issue on pandas. I was leaning towards anything in API being part of the API irrespective of whether it is in the docs. Path forward would be to deprecate from API in pandas, then drop here.

can you link to that issue here? I know @twoertwein created an issue to ask how we want to handle the API in general pandas-dev/pandas#48186 , but I think creating a specific issue for pandas.io.api.to_pickle() over there would be good as well.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Aug 22, 2022

have to resolve conflicts

ReadPickleBuffer,
StorageOptions,
WriteBuffer,
)

def to_pickle(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need to handle removal of to_pickle() from this file (and then the associated tests)

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to remove to_pickle() from io/pickle.pyi

@bashtage
Copy link
Contributor Author

bashtage commented Sep 5, 2022

I just noticed that to_pickle is a top-level function (i.e., pd.to_pickle), so I think it needs to be deprecated in pandas before it should be removed here. I think this is ready.

@bashtage
Copy link
Contributor Author

bashtage commented Sep 5, 2022

green.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Sep 5, 2022

I just noticed that to_pickle is a top-level function (i.e., pd.to_pickle), so I think it needs to be deprecated in pandas before it should be removed here. I think this is ready.

It may be a top-level function, but it is not documented, so I think that's an error in the implementation.

Can you create an issue in pandas for this?

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Sep 5, 2022

I just noticed that to_pickle is a top-level function (i.e., pd.to_pickle), so I think it needs to be deprecated in pandas before it should be removed here. I think this is ready.

It may be a top-level function, but it is not documented, so I think that's an error in the implementation.

Can you create an issue in pandas for this?

I'll let @twoertwein provide his opinion on this to resolve this. Summary:

  1. pd.to_pickle() is in the API, but not documented.
  2. DataFrame.to_pickle() and Series.to_pickle() are documented and in the API.
  3. In this PR, @bashtage has maintained a stub for pd.to_pickle().
  4. In my opinion, as it is not documented, we shouldn't provide a stub.

What do you think @twoertwein ?

@twoertwein
Copy link
Member

Creating an issue/PR at pandas is probably best.

In the meantime, I wouldn't mind merging this PR. Can address this small inconsistency in a follow-up.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @bashtage

@Dr-Irv Dr-Irv merged commit 2f2289c into pandas-dev:main Sep 5, 2022
@bashtage bashtage deleted the io-pickle branch September 15, 2022 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants