Skip to content

ARROW-3903: [Python] Random array generator for Arrow conversion and Parquet testing #3301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Jan 3, 2019

Generate random schemas, arrays, chunked_arrays, columns, record_batches and tables.
Slow, but makes quiet easy to isolate corner cases (already created jira issues). In follow up PRs We should use these strategies to increase the coverage. It'll enable us to reduce the issues, We could even use it for generate benchmark datasets periodically (only if We persist somewhere).

Example usage:

Run 10 samples (dev profile):
pytest -sv pyarrow/tests/test_strategies.py::test_tables --enable-hypothesis --hypothesis-show-statistics --hypothesis-profile=dev

Print the generated examples (debug):
pytest -sv pyarrow/tests/test_strategies.py::test_schemas --enable-hypothesis --hypothesis-show-statistics --hypothesis-profile=debug

@kszucs kszucs added the WIP PR is work in progress label Jan 3, 2019
@kszucs kszucs removed the WIP PR is work in progress label Jan 30, 2019
if isinstance(type, st.SearchStrategy):
type = draw(type)

# TODO(kszucs): remove it, field metadata is not kept
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type equality check fails at https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L297
We should probably use .equals(check_metadata=False) and find out why the two metadata are different.

I didn't file a jira issue because I couldn't create a reproducible example - the metadata is not displayed. However commenting out the assume line reproduces the issue.

@kszucs kszucs requested review from wesm and xhochy January 30, 2019 20:21
Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@fjetter you will like this!

@@ -1155,9 +1155,9 @@ cdef class Table(_PandasConvertible):

Parameters
----------
arrays: list of pyarrow.Array or pyarrow.Column
arrays : list of pyarrow.Array or pyarrow.Column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not needed in the latest numpydoc spec. But for docs improvement, we could probably build one day on pandas' work: pandas-dev/pandas#22408

@@ -32,6 +34,7 @@
pickle5 = None

import pyarrow as pa
import pyarrow.tests.strategies as past
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice word pun 😂

@kszucs kszucs removed the request for review from wesm February 7, 2019 12:35
@kszucs kszucs closed this in f957b5b Feb 7, 2019
xhochy pushed a commit that referenced this pull request Feb 8, 2019
…Parquet testing

Generate random schemas, arrays, chunked_arrays, columns, record_batches and tables.
Slow, but makes quiet easy to isolate corner cases (already created jira issues). In follow up PRs We should use these strategies to increase the coverage. It'll enable us to reduce the issues, We could even use it for generate benchmark datasets periodically (only if We persist somewhere).

Example usage:

Run 10 samples (dev profile):
`pytest -sv pyarrow/tests/test_strategies.py::test_tables --enable-hypothesis --hypothesis-show-statistics --hypothesis-profile=dev`

Print the generated examples (debug):
`pytest -sv pyarrow/tests/test_strategies.py::test_schemas --enable-hypothesis --hypothesis-show-statistics --hypothesis-profile=debug`

Author: Krisztián Szűcs <[email protected]>

Closes #3301 from kszucs/ARROW-3903 and squashes the following commits:

ff6654c <Krisztián Szűcs> finalize
8b5e7ea <Krisztián Szűcs> rat
61fe01d <Krisztián Szűcs> strategies for chunked_arrays, columns, record batches; test the strategies themselves
bdb63df <Krisztián Szűcs> hypothesis array strategy
trxcllnt pushed a commit to trxcllnt/arrow that referenced this pull request Feb 12, 2019
…Parquet testing

Generate random schemas, arrays, chunked_arrays, columns, record_batches and tables.
Slow, but makes quiet easy to isolate corner cases (already created jira issues). In follow up PRs We should use these strategies to increase the coverage. It'll enable us to reduce the issues, We could even use it for generate benchmark datasets periodically (only if We persist somewhere).

Example usage:

Run 10 samples (dev profile):
`pytest -sv pyarrow/tests/test_strategies.py::test_tables --enable-hypothesis --hypothesis-show-statistics --hypothesis-profile=dev`

Print the generated examples (debug):
`pytest -sv pyarrow/tests/test_strategies.py::test_schemas --enable-hypothesis --hypothesis-show-statistics --hypothesis-profile=debug`

Author: Krisztián Szűcs <[email protected]>

Closes apache#3301 from kszucs/ARROW-3903 and squashes the following commits:

ff6654c <Krisztián Szűcs> finalize
8b5e7ea <Krisztián Szűcs> rat
61fe01d <Krisztián Szűcs> strategies for chunked_arrays, columns, record batches; test the strategies themselves
bdb63df <Krisztián Szűcs> hypothesis array strategy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants