Skip to content

BUG: Fix pd.read_orc raising AttributeError #40970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions pandas/io/orc.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ def read_orc(
if distutils.version.LooseVersion(pyarrow.__version__) < "0.13.0":
raise ImportError("pyarrow must be >= 0.13.0 for read_orc")

import pyarrow.orc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrrow should only be imported once, so should remove the import pyarrow statement a couple lines up.

Copy link
Contributor Author

@amznero amznero Apr 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line-49 will check the PyArrow's version, it depends on line-47(import PyArrow).

Copy link
Member

@lithomas1 lithomas1 Apr 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking again, it's better to use

def import_optional_dependency(

so would be something like

from pandas.compat._optional import import_optional_dependency
orc = import_optional_dependency("pyarrow.orc")
...
orc_file = orc.ORCFile(handles.handle)

and then we could get rid of the other code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already tested it and it looks good. But it will remind users to install pyarrow.orc instead of pyarrow since import_optional_dependency cannot receive customized error messages. Is that OK?

Log:
ImportError: Missing optional dependency 'pyarrow.orc'. Use pip or conda to install pyarrow.orc.


with get_handle(path, "rb", is_text=False) as handles:
orc_file = pyarrow.orc.ORCFile(handles.handle)
return orc_file.read(columns=columns, **kwargs).to_pandas()
1 change: 0 additions & 1 deletion pandas/tests/io/test_orc.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@
import pandas._testing as tm

pytest.importorskip("pyarrow", minversion="0.13.0")
pytest.importorskip("pyarrow.orc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests might fail without this, since orc doesn't import properly for certain OSes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will raise some errors on Windows env.

But if add this line to test_orc.py, the test case can not find the bug mentioned in #40918, it just skips this test case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe try pyarrow._orc? (That's what it tries to pyarrow.orc tries to find for me on windows)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, let me re-describe the reason for deleting this line.

If the PyArrow package is installed from Conda, pytest.importorskip("pyarrow.orc") will successfully import pyarrow.orc module, so test_orc.py-Line143 got = read_orc(inputfile).iloc[:10] will works fine.

But, AttributeError will be raised if the user uses pd.read_orc directly without importing pyarrow.orc first. The test case failed to find this bug.


Maybe we should keep pytest.importorskip("pyarrow.orc") and delete pytest.importorskip("pyarrow.orc"). Then fix this bug in pandas/io/orc.py, and make pyarrow be imported only once(discuss below).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I am hypothesizing that the bug will reproduce since pyarrow.orc is not imported by pytest.importorskip(pyarrow._orc), and running pytest.importorskip(pyarrow._orc) will skip the test on Windows where pyarrow orc support is not present, as I think it is pyarrow._orc where the orc stuff is actually implemented. I cannot verify this works, though.


pytestmark = pytest.mark.filterwarnings(
"ignore:RangeIndex.* is deprecated:DeprecationWarning"
Expand Down