Skip to content

Class to read OpenDocument Tables #25427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 63 commits into from
Jul 3, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
479e639
Class to read OpenDocument Tables
detrout Feb 27, 2019
8be4b67
Remove unneeded assignments
detrout Feb 28, 2019
77d9033
Rename filepath_or_stream to filepath_or_buffer
detrout Feb 28, 2019
47b2ffb
Use compat.string_types instead of str
detrout Feb 28, 2019
0fa2ac9
Use pd as name as pandas
detrout Feb 28, 2019
e6e2365
Use single underscore for private functions
detrout Feb 28, 2019
1bbf284
Return an unparsed sheet.
detrout Feb 28, 2019
d5c7ec0
Move ODFReader get_sheet exception testing code to its own function
detrout Feb 28, 2019
691f1e9
Append _raises to end of function name that tests exceptions
detrout Feb 28, 2019
93c2b66
Remove test docstrings that include no useful information
detrout Feb 28, 2019
394c4bd
Indicate likely minimum version.
detrout Feb 28, 2019
b149d84
Convert notes about some OpenDocument tests to comments
detrout Apr 5, 2019
19587b3
Add note about new OpenDocument functionality to whatsnew
detrout Apr 5, 2019
60a5bc1
Sort imports correctly
detrout Apr 7, 2019
1fef008
Use str instead of compat.string_types
detrout Apr 8, 2019
7148995
Remove leading underscore from ODFParser
detrout May 14, 2019
5db1a0b
Remove obsolete class (object)
detrout May 15, 2019
83c0243
Improve docstring text
detrout Jun 14, 2019
735e2b4
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jun 28, 2019
8302fd7
Added test_odf
WillAyd Jun 28, 2019
d0df3bd
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jun 29, 2019
47597c9
Class naming consistency
WillAyd Jun 29, 2019
9e1799a
Whatsnew linting
WillAyd Jun 29, 2019
d5c60ab
Added optional dependency load
WillAyd Jun 29, 2019
39cfecf
typo
WillAyd Jun 29, 2019
8a9a66c
Updated inheritance to use excel reader interface
WillAyd Jun 29, 2019
fd7663f
Added ods test files
WillAyd Jun 29, 2019
3bcc1b7
Updated tests
WillAyd Jun 29, 2019
15e69eb
convert_float handling
WillAyd Jun 29, 2019
65615cd
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jun 30, 2019
9584753
Fixed missing value handling
WillAyd Jun 30, 2019
9dc34f4
Fixed error handling
WillAyd Jun 30, 2019
5e32f6d
Fixed bool handling
WillAyd Jun 30, 2019
6360c07
Skip missing file on master
WillAyd Jun 30, 2019
4227268
datetime compat
WillAyd Jun 30, 2019
80607b0
fixed row repeat
WillAyd Jun 30, 2019
43f7160
multiindex handling
WillAyd Jun 30, 2019
4da0445
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jun 30, 2019
cbbc653
Handled horizontally merged cells
WillAyd Jun 30, 2019
1227216
Converted to pytest idiom
WillAyd Jun 30, 2019
696ed5d
Test idiom cleanup
WillAyd Jun 30, 2019
49fff9f
Removed duplicative test files
WillAyd Jun 30, 2019
7b08304
Raised NotImplemented for vertical merging
WillAyd Jun 30, 2019
4d97d84
Table attribute access simplification
WillAyd Jun 30, 2019
59cdf0b
Typing and func cleanups
WillAyd Jun 30, 2019
98d3ca7
lint and isort
WillAyd Jun 30, 2019
fb48d8d
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jun 30, 2019
6576af9
typing fixup
WillAyd Jun 30, 2019
4dc1b51
Skip ods files for xlrd
WillAyd Jun 30, 2019
8ce45b4
Removed one-off tests
WillAyd Jul 1, 2019
f9f88b0
Handled defusedxml warnings
WillAyd Jul 1, 2019
3e0d758
Updated assert_warnings funcs to allow DeprecationWarnings
WillAyd Jul 1, 2019
ff28993
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jul 1, 2019
7396ad6
Updated to config_init.py
WillAyd Jul 2, 2019
5a440a4
Updated whatsnew
WillAyd Jul 2, 2019
250a3d3
Updated io.rst
WillAyd Jul 2, 2019
d7e7d05
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jul 2, 2019
93adedb
Refactored to simplify
WillAyd Jul 2, 2019
62a37e7
Removed unnecessary test
WillAyd Jul 2, 2019
13fb76f
lint fixup
WillAyd Jul 2, 2019
fb6c5ee
mypy error
WillAyd Jul 2, 2019
5c839f4
Merge remote-tracking branch 'upstream/master' into libreoffice-support
WillAyd Jul 2, 2019
4026fc1
Doc updates
WillAyd Jul 2, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/deps/travis-36-cov.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ dependencies:
- nomkl
- numexpr
- numpy=1.15.*
- odfpy
- openpyxl
- pandas-gbq
# https://github.com/pydata/pandas-gbq/issues/271
Expand Down
28 changes: 25 additions & 3 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
text;`HTML <https://en.wikipedia.org/wiki/HTML>`__;:ref:`read_html<io.read_html>`;:ref:`to_html<io.html>`
text; Local clipboard;:ref:`read_clipboard<io.clipboard>`;:ref:`to_clipboard<io.clipboard>`
binary;`MS Excel <https://en.wikipedia.org/wiki/Microsoft_Excel>`__;:ref:`read_excel<io.excel_reader>`;:ref:`to_excel<io.excel_writer>`
binary;`OpenDocument <http://www.opendocumentformat.org>`__;:ref:`read_excel<io.ods>`;
binary;`HDF5 Format <https://support.hdfgroup.org/HDF5/whatishdf5.html>`__;:ref:`read_hdf<io.hdf5>`;:ref:`to_hdf<io.hdf5>`
binary;`Feather Format <https://github.com/wesm/feather>`__;:ref:`read_feather<io.feather>`;:ref:`to_feather<io.feather>`
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
Expand Down Expand Up @@ -2779,9 +2780,10 @@ parse HTML tables in the top-level pandas io function ``read_html``.
Excel files
-----------

The :func:`~pandas.read_excel` method can read Excel 2003 (``.xls``) and
Excel 2007+ (``.xlsx``) files using the ``xlrd`` Python
module. The :meth:`~DataFrame.to_excel` instance method is used for
The :func:`~pandas.read_excel` method can read Excel 2003 (``.xls``)
files using the ``xlrd`` Python module. Excel 2007+ (``.xlsx``) files
can be read using either ``xlrd`` or ``openpyxl``.
The :meth:`~DataFrame.to_excel` instance method is used for
saving a ``DataFrame`` to Excel. Generally the semantics are
similar to working with :ref:`csv<io.read_csv_table>` data.
See the :ref:`cookbook<cookbook.excel>` for some advanced strategies.
Expand Down Expand Up @@ -3217,7 +3219,27 @@ The look and feel of Excel worksheets created from pandas can be modified using
* ``float_format`` : Format string for floating point numbers (default ``None``).
* ``freeze_panes`` : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default ``None``).

.. _io.ods:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a versionchanged here

OpenDocument Spreadsheets
-------------------------

.. versionadded:: 0.25

The :func:`~pandas.read_excel` method can also read OpenDocument spreadsheets
using the ``odfpy`` module. The semantics and features for reading
OpenDocument spreadsheets match what can be done for `Excel files`_ using
``engine='odf'``.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show an example here (code is ok)

.. code-block:: python

# Returns a DataFrame
pd.read_excel('path_to_file.ods', engine='odf')

.. note::

Currently pandas only supports *reading* OpenDocument spreadsheets. Writing
is not implemented.

.. _io.clipboard:

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@ Other enhancements
- Added new option ``plotting.backend`` to be able to select a plotting backend different than the existing ``matplotlib`` one. Use ``pandas.set_option('plotting.backend', '<backend-module>')`` where ``<backend-module`` is a library implementing the pandas plotting API (:issue:`14130`)
- :class:`pandas.offsets.BusinessHour` supports multiple opening hours intervals (:issue:`15481`)
- :func:`read_excel` can now use ``openpyxl`` to read Excel files via the ``engine='openpyxl'`` argument. This will become the default in a future release (:issue:`11499`)
- :func:`pandas.io.excel.read_excel` supports reading OpenDocument tables. Specify ``engine='odf'`` to enable. Consult the :ref:`IO User Guide <io.ods>` for more details (:issue:`9070`)

.. _whatsnew_0250.api_breaking:

Expand Down
1 change: 1 addition & 0 deletions pandas/compat/_optional.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
"lxml.etree": "3.8.0",
"matplotlib": "2.2.2",
"numexpr": "2.6.2",
"odfpy": "1.3.0",
"openpyxl": "2.4.8",
"pandas_gbq": "0.8.0",
"pyarrow": "0.9.0",
Expand Down
9 changes: 9 additions & 0 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -422,6 +422,7 @@ def use_inf_as_na_cb(key):
_xls_options = ['xlrd']
_xlsm_options = ['xlrd', 'openpyxl']
_xlsx_options = ['xlrd', 'openpyxl']
_ods_options = ['odf']


with cf.config_prefix("io.excel.xls"):
Expand All @@ -447,6 +448,14 @@ def use_inf_as_na_cb(key):
validator=str)


with cf.config_prefix("io.excel.ods"):
cf.register_option("reader", "auto",
reader_engine_doc.format(
ext='ods',
others=', '.join(_ods_options)),
validator=str)


# Set up the io.excel specific writer configuration.
writer_engine_doc = """
: string
Expand Down
4 changes: 3 additions & 1 deletion pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -768,12 +768,14 @@ class ExcelFile:
Acceptable values are None or ``xlrd``.
"""

from pandas.io.excel._xlrd import _XlrdReader
from pandas.io.excel._odfreader import _ODFReader
from pandas.io.excel._openpyxl import _OpenpyxlReader
from pandas.io.excel._xlrd import _XlrdReader

_engines = {
'xlrd': _XlrdReader,
'openpyxl': _OpenpyxlReader,
'odf': _ODFReader,
}

def __init__(self, io, engine=None):
Expand Down
176 changes: 176 additions & 0 deletions pandas/io/excel/_odfreader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
from typing import List

from pandas.compat._optional import import_optional_dependency

import pandas as pd
from pandas._typing import FilePathOrBuffer, Scalar

from pandas.io.excel._base import _BaseExcelReader


class _ODFReader(_BaseExcelReader):
"""Read tables out of OpenDocument formatted files

Parameters
----------
filepath_or_buffer: string, path to be parsed or
an open readable stream.
"""
def __init__(self, filepath_or_buffer: FilePathOrBuffer):
import_optional_dependency("odf")
super().__init__(filepath_or_buffer)

@property
def _workbook_class(self):
from odf.opendocument import OpenDocument
return OpenDocument

def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
from odf.opendocument import load
return load(filepath_or_buffer)

@property
def empty_value(self) -> str:
"""Property for compat with other readers."""
return ''

@property
def sheet_names(self) -> List[str]:
"""Return a list of sheet names present in the document"""
from odf.table import Table

tables = self.book.getElementsByType(Table)
return [t.getAttribute("name") for t in tables]

def get_sheet_by_index(self, index: int):
from odf.table import Table
tables = self.book.getElementsByType(Table)
return tables[index]

def get_sheet_by_name(self, name: str):
from odf.table import Table

tables = self.book.getElementsByType(Table)

for table in tables:
if table.getAttribute("name") == name:
return table

raise ValueError("sheet {name} not found".format(name))

def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]:
"""Parse an ODF Table into a list of lists
"""
from odf.table import CoveredTableCell, TableCell, TableRow

covered_cell_name = CoveredTableCell().qname
table_cell_name = TableCell().qname
cell_names = {covered_cell_name, table_cell_name}

sheet_rows = sheet.getElementsByType(TableRow)
empty_rows = 0
max_row_len = 0

table = [] # type: List[List[Scalar]]

for i, sheet_row in enumerate(sheet_rows):
sheet_cells = [x for x in sheet_row.childNodes
if x.qname in cell_names]
empty_cells = 0
table_row = [] # type: List[Scalar]

for j, sheet_cell in enumerate(sheet_cells):
if sheet_cell.qname == table_cell_name:
value = self._get_cell_value(sheet_cell, convert_float)
else:
value = self.empty_value

column_repeat = self._get_column_repeat(sheet_cell)

# Queue up empty values, writing only if content succeeds them
if value == self.empty_value:
empty_cells += column_repeat
else:
table_row.extend([self.empty_value] * empty_cells)
empty_cells = 0
table_row.extend([value] * column_repeat)

if max_row_len < len(table_row):
max_row_len = len(table_row)

row_repeat = self._get_row_repeat(sheet_row)
if self._is_empty_row(sheet_row):
empty_rows += row_repeat
else:
# add blank rows to our table
table.extend([[self.empty_value]] * empty_rows)
empty_rows = 0
for _ in range(row_repeat):
table.append(table_row)

# Make our table square
for row in table:
if len(row) < max_row_len:
row.extend([self.empty_value] * (max_row_len - len(row)))

return table

def _get_row_repeat(self, row) -> int:
"""Return number of times this row was repeated
Repeating an empty row appeared to be a common way
of representing sparse rows in the table.
"""
from odf.namespaces import TABLENS

return int(row.attributes.get((TABLENS, 'number-rows-repeated'), 1))

def _get_column_repeat(self, cell) -> int:
from odf.namespaces import TABLENS
return int(cell.attributes.get(
(TABLENS, 'number-columns-repeated'), 1))

def _is_empty_row(self, row) -> bool:
"""Helper function to find empty rows
"""
for column in row.childNodes:
if len(column.childNodes) > 0:
return False

return True

def _get_cell_value(self, cell, convert_float: bool) -> Scalar:
from odf.namespaces import OFFICENS
cell_type = cell.attributes.get((OFFICENS, 'value-type'))
if cell_type == 'boolean':
if str(cell) == "TRUE":
return True
return False
if cell_type is None:
return self.empty_value
elif cell_type == 'float':
# GH5394
cell_value = float(cell.attributes.get((OFFICENS, 'value')))

if cell_value == 0. and str(cell) != cell_value: # NA handling
return str(cell)

if convert_float:
val = int(cell_value)
if val == cell_value:
return val
return cell_value
elif cell_type == 'percentage':
cell_value = cell.attributes.get((OFFICENS, 'value'))
return float(cell_value)
elif cell_type == 'string':
return str(cell)
elif cell_type == 'currency':
cell_value = cell.attributes.get((OFFICENS, 'value'))
return float(cell_value)
elif cell_type == 'date':
cell_value = cell.attributes.get((OFFICENS, 'date-value'))
return pd.to_datetime(cell_value)
elif cell_type == 'time':
return pd.to_datetime(str(cell)).time()
else:
raise ValueError('Unrecognized type {}'.format(cell_type))
Binary file added pandas/tests/io/data/blank.ods
Binary file not shown.
Binary file added pandas/tests/io/data/blank_with_header.ods
Binary file not shown.
Binary file added pandas/tests/io/data/invalid_value_type.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test1.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test2.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test3.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test4.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test5.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test_converters.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test_index_name_pre17.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test_multisheet.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test_squeeze.ods
Binary file not shown.
Binary file added pandas/tests/io/data/test_types.ods
Binary file not shown.
Binary file added pandas/tests/io/data/testdateoverflow.ods
Binary file not shown.
Binary file added pandas/tests/io/data/testdtype.ods
Binary file not shown.
Binary file added pandas/tests/io/data/testmultiindex.ods
Binary file not shown.
Binary file added pandas/tests/io/data/testskiprows.ods
Binary file not shown.
Binary file added pandas/tests/io/data/times_1900.ods
Binary file not shown.
Binary file added pandas/tests/io/data/times_1904.ods
Binary file not shown.
Binary file added pandas/tests/io/data/writertable.odt
Binary file not shown.
2 changes: 1 addition & 1 deletion pandas/tests/io/excel/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def df_ref():
return df_ref


@pytest.fixture(params=['.xls', '.xlsx', '.xlsm'])
@pytest.fixture(params=['.xls', '.xlsx', '.xlsm', '.ods'])
def read_ext(request):
"""
Valid extensions for reading Excel files.
Expand Down
39 changes: 39 additions & 0 deletions pandas/tests/io/excel/test_odf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
import functools

import numpy as np
import pytest

import pandas as pd
import pandas.util.testing as tm

pytest.importorskip("odf")


@pytest.fixture(autouse=True)
def cd_and_set_engine(monkeypatch, datapath):
func = functools.partial(pd.read_excel, engine="odf")
monkeypatch.setattr(pd, 'read_excel', func)
monkeypatch.chdir(datapath("io", "data"))


def test_read_invalid_types_raises():
# the invalid_value_type.ods required manually editing
# of the included content.xml file
with pytest.raises(ValueError,
match="Unrecognized type awesome_new_type"):
pd.read_excel("invalid_value_type.ods")


def test_read_writer_table():
# Also test reading tables from an text OpenDocument file
# (.odt)
index = pd.Index(["Row 1", "Row 2", "Row 3"], name="Header")
expected = pd.DataFrame([
[1, np.nan, 7],
[2, np.nan, 8],
[3, np.nan, 9],
], index=index, columns=["Column 1", "Unnamed: 2", "Column 3"])

result = pd.read_excel("writertable.odt", 'Table1', index_col=0)

tm.assert_frame_equal(result, expected)
Loading