-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Class to read OpenDocument Tables #25427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
63 commits
Select commit
Hold shift + click to select a range
479e639
Class to read OpenDocument Tables
detrout 8be4b67
Remove unneeded assignments
detrout 77d9033
Rename filepath_or_stream to filepath_or_buffer
detrout 47b2ffb
Use compat.string_types instead of str
detrout 0fa2ac9
Use pd as name as pandas
detrout e6e2365
Use single underscore for private functions
detrout 1bbf284
Return an unparsed sheet.
detrout d5c7ec0
Move ODFReader get_sheet exception testing code to its own function
detrout 691f1e9
Append _raises to end of function name that tests exceptions
detrout 93c2b66
Remove test docstrings that include no useful information
detrout 394c4bd
Indicate likely minimum version.
detrout b149d84
Convert notes about some OpenDocument tests to comments
detrout 19587b3
Add note about new OpenDocument functionality to whatsnew
detrout 60a5bc1
Sort imports correctly
detrout 1fef008
Use str instead of compat.string_types
detrout 7148995
Remove leading underscore from ODFParser
detrout 5db1a0b
Remove obsolete class (object)
detrout 83c0243
Improve docstring text
detrout 735e2b4
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 8302fd7
Added test_odf
WillAyd d0df3bd
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 47597c9
Class naming consistency
WillAyd 9e1799a
Whatsnew linting
WillAyd d5c60ab
Added optional dependency load
WillAyd 39cfecf
typo
WillAyd 8a9a66c
Updated inheritance to use excel reader interface
WillAyd fd7663f
Added ods test files
WillAyd 3bcc1b7
Updated tests
WillAyd 15e69eb
convert_float handling
WillAyd 65615cd
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 9584753
Fixed missing value handling
WillAyd 9dc34f4
Fixed error handling
WillAyd 5e32f6d
Fixed bool handling
WillAyd 6360c07
Skip missing file on master
WillAyd 4227268
datetime compat
WillAyd 80607b0
fixed row repeat
WillAyd 43f7160
multiindex handling
WillAyd 4da0445
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd cbbc653
Handled horizontally merged cells
WillAyd 1227216
Converted to pytest idiom
WillAyd 696ed5d
Test idiom cleanup
WillAyd 49fff9f
Removed duplicative test files
WillAyd 7b08304
Raised NotImplemented for vertical merging
WillAyd 4d97d84
Table attribute access simplification
WillAyd 59cdf0b
Typing and func cleanups
WillAyd 98d3ca7
lint and isort
WillAyd fb48d8d
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 6576af9
typing fixup
WillAyd 4dc1b51
Skip ods files for xlrd
WillAyd 8ce45b4
Removed one-off tests
WillAyd f9f88b0
Handled defusedxml warnings
WillAyd 3e0d758
Updated assert_warnings funcs to allow DeprecationWarnings
WillAyd ff28993
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 7396ad6
Updated to config_init.py
WillAyd 5a440a4
Updated whatsnew
WillAyd 250a3d3
Updated io.rst
WillAyd d7e7d05
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 93adedb
Refactored to simplify
WillAyd 62a37e7
Removed unnecessary test
WillAyd 13fb76f
lint fixup
WillAyd fb6c5ee
mypy error
WillAyd 5c839f4
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd 4026fc1
Doc updates
WillAyd File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -32,6 +32,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like | |
text;`HTML <https://en.wikipedia.org/wiki/HTML>`__;:ref:`read_html<io.read_html>`;:ref:`to_html<io.html>` | ||
text; Local clipboard;:ref:`read_clipboard<io.clipboard>`;:ref:`to_clipboard<io.clipboard>` | ||
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>` | ||
binary;`OpenDocument <http://www.opendocumentformat.org>`__;:ref:`read_excel<io.ods>`; | ||
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>` | ||
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>` | ||
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>` | ||
|
@@ -2779,9 +2780,10 @@ parse HTML tables in the top-level pandas io function ``read_html``. | |
Excel files | ||
----------- | ||
|
||
The :func:`~pandas.read_excel` method can read Excel 2003 (``.xls``) and | ||
Excel 2007+ (``.xlsx``) files using the ``xlrd`` Python | ||
module. The :meth:`~DataFrame.to_excel` instance method is used for | ||
The :func:`~pandas.read_excel` method can read Excel 2003 (``.xls``) | ||
files using the ``xlrd`` Python module. Excel 2007+ (``.xlsx``) files | ||
can be read using either ``xlrd`` or ``openpyxl``. | ||
The :meth:`~DataFrame.to_excel` instance method is used for | ||
saving a ``DataFrame`` to Excel. Generally the semantics are | ||
similar to working with :ref:`csv<io.read_csv_table>` data. | ||
See the :ref:`cookbook<cookbook.excel>` for some advanced strategies. | ||
|
@@ -3217,7 +3219,27 @@ The look and feel of Excel worksheets created from pandas can be modified using | |
* ``float_format`` : Format string for floating point numbers (default ``None``). | ||
* ``freeze_panes`` : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default ``None``). | ||
|
||
.. _io.ods: | ||
|
||
OpenDocument Spreadsheets | ||
------------------------- | ||
|
||
.. versionadded:: 0.25 | ||
|
||
The :func:`~pandas.read_excel` method can also read OpenDocument spreadsheets | ||
using the ``odfpy`` module. The semantics and features for reading | ||
OpenDocument spreadsheets match what can be done for `Excel files`_ using | ||
``engine='odf'``. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you show an example here (code is ok) |
||
.. code-block:: python | ||
|
||
# Returns a DataFrame | ||
pd.read_excel('path_to_file.ods', engine='odf') | ||
|
||
.. note:: | ||
|
||
Currently pandas only supports *reading* OpenDocument spreadsheets. Writing | ||
is not implemented. | ||
|
||
.. _io.clipboard: | ||
|
||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,176 @@ | ||
from typing import List | ||
|
||
from pandas.compat._optional import import_optional_dependency | ||
|
||
import pandas as pd | ||
from pandas._typing import FilePathOrBuffer, Scalar | ||
|
||
from pandas.io.excel._base import _BaseExcelReader | ||
|
||
|
||
class _ODFReader(_BaseExcelReader): | ||
"""Read tables out of OpenDocument formatted files | ||
|
||
Parameters | ||
---------- | ||
filepath_or_buffer: string, path to be parsed or | ||
an open readable stream. | ||
""" | ||
def __init__(self, filepath_or_buffer: FilePathOrBuffer): | ||
import_optional_dependency("odf") | ||
super().__init__(filepath_or_buffer) | ||
|
||
@property | ||
def _workbook_class(self): | ||
from odf.opendocument import OpenDocument | ||
return OpenDocument | ||
|
||
def load_workbook(self, filepath_or_buffer: FilePathOrBuffer): | ||
from odf.opendocument import load | ||
return load(filepath_or_buffer) | ||
|
||
@property | ||
def empty_value(self) -> str: | ||
"""Property for compat with other readers.""" | ||
return '' | ||
|
||
@property | ||
def sheet_names(self) -> List[str]: | ||
"""Return a list of sheet names present in the document""" | ||
from odf.table import Table | ||
|
||
tables = self.book.getElementsByType(Table) | ||
return [t.getAttribute("name") for t in tables] | ||
|
||
def get_sheet_by_index(self, index: int): | ||
from odf.table import Table | ||
tables = self.book.getElementsByType(Table) | ||
return tables[index] | ||
|
||
def get_sheet_by_name(self, name: str): | ||
from odf.table import Table | ||
|
||
tables = self.book.getElementsByType(Table) | ||
|
||
for table in tables: | ||
if table.getAttribute("name") == name: | ||
return table | ||
|
||
raise ValueError("sheet {name} not found".format(name)) | ||
|
||
def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]: | ||
"""Parse an ODF Table into a list of lists | ||
""" | ||
from odf.table import CoveredTableCell, TableCell, TableRow | ||
|
||
covered_cell_name = CoveredTableCell().qname | ||
table_cell_name = TableCell().qname | ||
cell_names = {covered_cell_name, table_cell_name} | ||
|
||
sheet_rows = sheet.getElementsByType(TableRow) | ||
empty_rows = 0 | ||
max_row_len = 0 | ||
|
||
table = [] # type: List[List[Scalar]] | ||
|
||
for i, sheet_row in enumerate(sheet_rows): | ||
WillAyd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
sheet_cells = [x for x in sheet_row.childNodes | ||
if x.qname in cell_names] | ||
empty_cells = 0 | ||
table_row = [] # type: List[Scalar] | ||
|
||
WillAyd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
for j, sheet_cell in enumerate(sheet_cells): | ||
if sheet_cell.qname == table_cell_name: | ||
value = self._get_cell_value(sheet_cell, convert_float) | ||
else: | ||
value = self.empty_value | ||
|
||
column_repeat = self._get_column_repeat(sheet_cell) | ||
|
||
# Queue up empty values, writing only if content succeeds them | ||
if value == self.empty_value: | ||
empty_cells += column_repeat | ||
else: | ||
table_row.extend([self.empty_value] * empty_cells) | ||
empty_cells = 0 | ||
table_row.extend([value] * column_repeat) | ||
|
||
if max_row_len < len(table_row): | ||
max_row_len = len(table_row) | ||
|
||
row_repeat = self._get_row_repeat(sheet_row) | ||
if self._is_empty_row(sheet_row): | ||
empty_rows += row_repeat | ||
else: | ||
# add blank rows to our table | ||
table.extend([[self.empty_value]] * empty_rows) | ||
empty_rows = 0 | ||
for _ in range(row_repeat): | ||
table.append(table_row) | ||
|
||
# Make our table square | ||
for row in table: | ||
if len(row) < max_row_len: | ||
row.extend([self.empty_value] * (max_row_len - len(row))) | ||
|
||
return table | ||
|
||
def _get_row_repeat(self, row) -> int: | ||
"""Return number of times this row was repeated | ||
Repeating an empty row appeared to be a common way | ||
of representing sparse rows in the table. | ||
""" | ||
from odf.namespaces import TABLENS | ||
|
||
return int(row.attributes.get((TABLENS, 'number-rows-repeated'), 1)) | ||
|
||
def _get_column_repeat(self, cell) -> int: | ||
from odf.namespaces import TABLENS | ||
return int(cell.attributes.get( | ||
(TABLENS, 'number-columns-repeated'), 1)) | ||
|
||
def _is_empty_row(self, row) -> bool: | ||
"""Helper function to find empty rows | ||
""" | ||
for column in row.childNodes: | ||
if len(column.childNodes) > 0: | ||
return False | ||
|
||
return True | ||
|
||
def _get_cell_value(self, cell, convert_float: bool) -> Scalar: | ||
from odf.namespaces import OFFICENS | ||
cell_type = cell.attributes.get((OFFICENS, 'value-type')) | ||
if cell_type == 'boolean': | ||
if str(cell) == "TRUE": | ||
return True | ||
return False | ||
if cell_type is None: | ||
return self.empty_value | ||
elif cell_type == 'float': | ||
# GH5394 | ||
cell_value = float(cell.attributes.get((OFFICENS, 'value'))) | ||
|
||
if cell_value == 0. and str(cell) != cell_value: # NA handling | ||
return str(cell) | ||
|
||
if convert_float: | ||
val = int(cell_value) | ||
if val == cell_value: | ||
return val | ||
return cell_value | ||
elif cell_type == 'percentage': | ||
cell_value = cell.attributes.get((OFFICENS, 'value')) | ||
return float(cell_value) | ||
elif cell_type == 'string': | ||
return str(cell) | ||
elif cell_type == 'currency': | ||
cell_value = cell.attributes.get((OFFICENS, 'value')) | ||
return float(cell_value) | ||
elif cell_type == 'date': | ||
cell_value = cell.attributes.get((OFFICENS, 'date-value')) | ||
return pd.to_datetime(cell_value) | ||
elif cell_type == 'time': | ||
return pd.to_datetime(str(cell)).time() | ||
else: | ||
raise ValueError('Unrecognized type {}'.format(cell_type)) |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
import functools | ||
|
||
import numpy as np | ||
import pytest | ||
|
||
import pandas as pd | ||
import pandas.util.testing as tm | ||
|
||
pytest.importorskip("odf") | ||
|
||
|
||
@pytest.fixture(autouse=True) | ||
def cd_and_set_engine(monkeypatch, datapath): | ||
func = functools.partial(pd.read_excel, engine="odf") | ||
monkeypatch.setattr(pd, 'read_excel', func) | ||
monkeypatch.chdir(datapath("io", "data")) | ||
|
||
|
||
def test_read_invalid_types_raises(): | ||
# the invalid_value_type.ods required manually editing | ||
# of the included content.xml file | ||
with pytest.raises(ValueError, | ||
match="Unrecognized type awesome_new_type"): | ||
pd.read_excel("invalid_value_type.ods") | ||
|
||
|
||
def test_read_writer_table(): | ||
# Also test reading tables from an text OpenDocument file | ||
# (.odt) | ||
index = pd.Index(["Row 1", "Row 2", "Row 3"], name="Header") | ||
expected = pd.DataFrame([ | ||
[1, np.nan, 7], | ||
[2, np.nan, 8], | ||
[3, np.nan, 9], | ||
], index=index, columns=["Column 1", "Unnamed: 2", "Column 3"]) | ||
|
||
result = pd.read_excel("writertable.odt", 'Table1', index_col=0) | ||
|
||
tm.assert_frame_equal(result, expected) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a versionchanged here