ENH20521 Added metadata argument to DataFrame.to_parquet #20534

JacekPliszka · 2018-03-29T12:21:47Z

The argument allows for custom file metadata updating the default one.
Closes #20521

Checklist for other PRs (remove this part if you are doing a PR for the pandas documentation sprint):

closes #xxxx
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Checklist from comments:

WillAyd

Not being overly familiar with this format I'll admit I don't quite fully understand the purpose of this...and given we just went through a huge documentation sprint it seems like this is something that could be documented with an example in the to_parquet docstring

WillAyd · 2018-03-29T18:23:50Z

pandas/io/parquet.py

+            custom_metadata = kwargs.pop('metadata', {})
+            if custom_metadata:
+                if 'pandas' in custom_metadata:
+                    warn(


There should also be a test that asserts this warning actually happens

OK - will add.

WillAyd · 2018-03-29T18:24:16Z

pandas/io/parquet.py

        else:
            table = self.api.Table.from_pandas(df)
+            custom_metadata = kwargs.pop('metadata', {})


Is there a reason why you have this tucked away in kwargs rather than creating it as an optional named argument?

OK - I will change.

WillAyd · 2018-03-29T18:28:16Z

pandas/tests/io/test_parquet.py

@@ -437,6 +437,27 @@ def test_s3_roundtrip(self, df_compat, s3_resource, pa):
        check_round_trip(df_compat, pa,
                         path='s3://pandas-test/pyarrow.parquet')

+    @pytest.mark.xfail(
+        is_platform_windows() or is_platform_mac(),


This would cut out a pretty large user base - is this intentional to not support either of these platforms?

Not really. This feature is useful mostly when you have hundreds of large files.
Such deployments are almost 100% Linux. I've never seen such deployment on Windows or Mac.

I put it because couple lines above you have

@pytest.mark.xfail(is_platform_windows() or is_platform_mac(), reason="reading pa metadata failing on Windows/mac") def test_cross_engine_pa_fp(df_cross_compat, pa, fp):

The above comment suggests the features are less mature on Windows/Mac.
I do not have Windows or Mac so I can not test it/check it.

Anyone who needs it on Windows/Mac could validate it and correct.

Our CI runs on Mac, Linux, and Windows.

@JacekPliszka we have a test of cross compatibility of fastparquet and pyarrow. this is a very specific instance. this is not applicable to this PR

TomAugspurger

Could you add API docs?

Could you add a fastparquet implementation?

TomAugspurger · 2018-03-29T18:48:32Z

doc/source/whatsnew/v0.23.0.txt

@@ -345,6 +345,7 @@ Other Enhancements
 - :meth:`DataFrame.to_sql` now performs a multivalue insert if the underlying connection supports itk rather than inserting row by row.
  ``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`)
 - :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`)
+- :func:`DataFrame.to_parquet` now accepts a ``metadata`` keyword argument - the object passed updates key value file metadata generated by pandas. If pandas key is present - default pandas value is overwritten and warning is issued. Default ``None`` value means standard pandas metadata is used. (``None`` by default) (:issue:`20521`)


This should be a short highlight with a link to the API docs or a new section in io.rst if necessary.

Do you mean this should be shorter and sentence moved to API docs?

Yeah, a link to the API documentation is probably best here.

TomAugspurger · 2018-03-29T18:52:31Z

pandas/tests/io/test_parquet.py

+        is_platform_windows() or is_platform_mac(),
+        reason="reading pa metadata failing on Windows/mac"
+    )
+    def test_custom_metadata(self, pa_ge_070, df_full):


Add a comment about why pyarrow>=0.7.0 is required (is it?)

Not sure. It might work with 0.5.0.

Do you really think it is worth testing?

pyarrow is already 0.9.0 and people should rather not use versions older than 0.7.0.
In my opinion the newer versions of pandas should require at least 0.7.0
the older ones are not worth testing.

It's worth seeing if it works.

jreback · 2018-03-30T18:54:56Z

pandas/io/parquet.py

+                custom_metadata = dict(
+                    table.schema.metadata or {},
+                    **custom_metadata
+                )


why do you want to do this in pandas? is this not a feature request in pyarrow itself? e.g. to process a metadata kwarg?

cc @cpcloud @wesm

JacekPliszka · 2019-01-10T20:10:45Z

Summing up: pyarrow/pandas handling of file level parquet metadata is ugly and at some point of time should be rewritten but looks like it is a larger task than I can afford to do.

The above simple hack can be implemented by anyone interested on his/her own just instead of df.to_parquet
use function like this:

import pyarrow.parquet
def to_parquet_with_updated_metadata(filename, df, metadata_update):
    table = pyarrow.Table.from_pandas(df)
    updated_metadata = dict(table.schema.metadata or {}, **metadata_update)
    table = table.replace_schema_metadata(updated_metadata)
    pyarrow.parquet.write_table(table, filename)

to_parquet_with_updated_metadata('test.parquet', df, {'test key ': 'test value'})

Should be usable enough until proper solution with access to all metadatas in parquet file is developed.
This is why I closed this PR

jorisvandenbossche · 2019-05-02T07:59:33Z

@JacekPliszka coming back to this: can you be a bit more specific in what is missing in pyarrow (or pandas) for properly handling of parquet metadata?
Do you know if there is an issue for that on the pyarrow side? (or otherwise want to open one?)

jorisvandenbossche · 2019-05-02T08:01:38Z

Is what you are looking for the ability to specify additional metadata in the pyarrow.parquet.write_table function?

JacekPliszka · 2019-05-02T08:43:07Z

It's been couple months since I looked at it but parquet allows metadata at several levels - full file, column and page header.

It would be nice to have access to read and write at least the first two.

Firstly some kind of convention/format needs to be set how to separate library/pandas metadata from user controlled one.

Also there is issue of parquet datasets - individual files has each its own metadata while they are read as single pandas dataframe

WillAyd requested changes Mar 29, 2018

View reviewed changes

TomAugspurger reviewed Mar 29, 2018

View reviewed changes

jreback requested changes Mar 30, 2018

View reviewed changes

jreback added the IO Parquet parquet, feather label Mar 30, 2018

JacekPliszka closed this Oct 7, 2018

jorisvandenbossche mentioned this pull request May 2, 2019

ENH adding metadata argument to DataFrame.to_parquet #20521

Open

anders-kiaer mentioned this pull request Aug 25, 2020

Slow libecl reading with large ensembles equinor/webviz-subsurface#418

Closed

anders-kiaer mentioned this pull request Jan 22, 2021

Support metadata together with pd.DataFrame when writing/reading to parquet equinor/webviz-config#382

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH20521 Added metadata argument to DataFrame.to_parquet #20534

ENH20521 Added metadata argument to DataFrame.to_parquet #20534

JacekPliszka commented Mar 29, 2018 •

edited

Loading

WillAyd left a comment

WillAyd Mar 29, 2018

JacekPliszka Mar 30, 2018 •

edited

Loading

WillAyd Mar 29, 2018

JacekPliszka Mar 30, 2018

WillAyd Mar 29, 2018 •

edited

Loading

JacekPliszka Mar 30, 2018

TomAugspurger Mar 30, 2018

jreback Mar 30, 2018

TomAugspurger left a comment

TomAugspurger Mar 29, 2018

JacekPliszka Mar 30, 2018

TomAugspurger Mar 30, 2018

TomAugspurger Mar 29, 2018

JacekPliszka Mar 30, 2018

TomAugspurger Mar 30, 2018

jreback Mar 30, 2018

JacekPliszka commented Jan 10, 2019 •

edited

Loading

jorisvandenbossche commented May 2, 2019

jorisvandenbossche commented May 2, 2019

JacekPliszka commented May 2, 2019

ENH20521 Added metadata argument to DataFrame.to_parquet #20534

ENH20521 Added metadata argument to DataFrame.to_parquet #20534

Conversation

JacekPliszka commented Mar 29, 2018 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JacekPliszka Mar 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Mar 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JacekPliszka commented Jan 10, 2019 • edited Loading

jorisvandenbossche commented May 2, 2019

jorisvandenbossche commented May 2, 2019

JacekPliszka commented May 2, 2019

JacekPliszka commented Mar 29, 2018 •

edited

Loading

JacekPliszka Mar 30, 2018 •

edited

Loading

WillAyd Mar 29, 2018 •

edited

Loading

JacekPliszka commented Jan 10, 2019 •

edited

Loading