-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH20521 Added metadata argument to DataFrame.to_parquet #20534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not being overly familiar with this format I'll admit I don't quite fully understand the purpose of this...and given we just went through a huge documentation sprint it seems like this is something that could be documented with an example in the to_parquet
docstring
pandas/io/parquet.py
Outdated
custom_metadata = kwargs.pop('metadata', {}) | ||
if custom_metadata: | ||
if 'pandas' in custom_metadata: | ||
warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should also be a test that asserts this warning actually happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - will add.
pandas/io/parquet.py
Outdated
else: | ||
table = self.api.Table.from_pandas(df) | ||
custom_metadata = kwargs.pop('metadata', {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you have this tucked away in kwargs rather than creating it as an optional named argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I will change.
pandas/tests/io/test_parquet.py
Outdated
@@ -437,6 +437,27 @@ def test_s3_roundtrip(self, df_compat, s3_resource, pa): | |||
check_round_trip(df_compat, pa, | |||
path='s3://pandas-test/pyarrow.parquet') | |||
|
|||
@pytest.mark.xfail( | |||
is_platform_windows() or is_platform_mac(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would cut out a pretty large user base - is this intentional to not support either of these platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. This feature is useful mostly when you have hundreds of large files.
Such deployments are almost 100% Linux. I've never seen such deployment on Windows or Mac.
I put it because couple lines above you have
@pytest.mark.xfail(is_platform_windows() or is_platform_mac(),
reason="reading pa metadata failing on Windows/mac")
def test_cross_engine_pa_fp(df_cross_compat, pa, fp):
The above comment suggests the features are less mature on Windows/Mac.
I do not have Windows or Mac so I can not test it/check it.
Anyone who needs it on Windows/Mac could validate it and correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our CI runs on Mac, Linux, and Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JacekPliszka we have a test of cross compatibility of fastparquet and pyarrow. this is a very specific instance. this is not applicable to this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add API docs?
Could you add a fastparquet implementation?
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -345,6 +345,7 @@ Other Enhancements | |||
- :meth:`DataFrame.to_sql` now performs a multivalue insert if the underlying connection supports itk rather than inserting row by row. | |||
``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`) | |||
- :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`) | |||
- :func:`DataFrame.to_parquet` now accepts a ``metadata`` keyword argument - the object passed updates key value file metadata generated by pandas. If pandas key is present - default pandas value is overwritten and warning is issued. Default ``None`` value means standard pandas metadata is used. (``None`` by default) (:issue:`20521`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a short highlight with a link to the API docs or a new section in io.rst
if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean this should be shorter and sentence moved to API docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a link to the API documentation is probably best here.
pandas/tests/io/test_parquet.py
Outdated
is_platform_windows() or is_platform_mac(), | ||
reason="reading pa metadata failing on Windows/mac" | ||
) | ||
def test_custom_metadata(self, pa_ge_070, df_full): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment about why pyarrow>=0.7.0 is required (is it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. It might work with 0.5.0.
Do you really think it is worth testing?
pyarrow is already 0.9.0 and people should rather not use versions older than 0.7.0.
In my opinion the newer versions of pandas should require at least 0.7.0
the older ones are not worth testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth seeing if it works.
pandas/io/parquet.py
Outdated
custom_metadata = dict( | ||
table.schema.metadata or {}, | ||
**custom_metadata | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summing up: pyarrow/pandas handling of file level parquet metadata is ugly and at some point of time should be rewritten but looks like it is a larger task than I can afford to do. The above simple hack can be implemented by anyone interested on his/her own just instead of df.to_parquet
Should be usable enough until proper solution with access to all metadatas in parquet file is developed. |
@JacekPliszka coming back to this: can you be a bit more specific in what is missing in pyarrow (or pandas) for properly handling of parquet metadata? |
Is what you are looking for the ability to specify additional metadata in the |
It's been couple months since I looked at it but parquet allows metadata at several levels - full file, column and page header. It would be nice to have access to read and write at least the first two. Firstly some kind of convention/format needs to be set how to separate library/pandas metadata from user controlled one. Also there is issue of parquet datasets - individual files has each its own metadata while they are read as single pandas dataframe |
The argument allows for custom file metadata updating the default one.
Closes #20521
Checklist for other PRs (remove this part if you are doing a PR for the pandas documentation sprint):
git diff upstream/master -u -- "*.py" | flake8 --diff
Checklist from comments: