-
Notifications
You must be signed in to change notification settings - Fork 3.7k
ARROW-3903: [Python] Random array generator for Arrow conversion and Parquet testing #3301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if isinstance(type, st.SearchStrategy): | ||
type = draw(type) | ||
|
||
# TODO(kszucs): remove it, field metadata is not kept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type equality check fails at https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L297
We should probably use .equals(check_metadata=False)
and find out why the two metadata are different.
I didn't file a jira issue because I couldn't create a reproducible example - the metadata is not displayed. However commenting out the assume line reproduces the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
@fjetter you will like this!
@@ -1155,9 +1155,9 @@ cdef class Table(_PandasConvertible): | |||
|
|||
Parameters | |||
---------- | |||
arrays: list of pyarrow.Array or pyarrow.Column | |||
arrays : list of pyarrow.Array or pyarrow.Column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not needed in the latest numpydoc spec. But for docs improvement, we could probably build one day on pandas' work: pandas-dev/pandas#22408
@@ -32,6 +34,7 @@ | |||
pickle5 = None | |||
|
|||
import pyarrow as pa | |||
import pyarrow.tests.strategies as past |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice word pun 😂
…Parquet testing Generate random schemas, arrays, chunked_arrays, columns, record_batches and tables. Slow, but makes quiet easy to isolate corner cases (already created jira issues). In follow up PRs We should use these strategies to increase the coverage. It'll enable us to reduce the issues, We could even use it for generate benchmark datasets periodically (only if We persist somewhere). Example usage: Run 10 samples (dev profile): `pytest -sv pyarrow/tests/test_strategies.py::test_tables --enable-hypothesis --hypothesis-show-statistics --hypothesis-profile=dev` Print the generated examples (debug): `pytest -sv pyarrow/tests/test_strategies.py::test_schemas --enable-hypothesis --hypothesis-show-statistics --hypothesis-profile=debug` Author: Krisztián Szűcs <[email protected]> Closes #3301 from kszucs/ARROW-3903 and squashes the following commits: ff6654c <Krisztián Szűcs> finalize 8b5e7ea <Krisztián Szűcs> rat 61fe01d <Krisztián Szűcs> strategies for chunked_arrays, columns, record batches; test the strategies themselves bdb63df <Krisztián Szűcs> hypothesis array strategy
…Parquet testing Generate random schemas, arrays, chunked_arrays, columns, record_batches and tables. Slow, but makes quiet easy to isolate corner cases (already created jira issues). In follow up PRs We should use these strategies to increase the coverage. It'll enable us to reduce the issues, We could even use it for generate benchmark datasets periodically (only if We persist somewhere). Example usage: Run 10 samples (dev profile): `pytest -sv pyarrow/tests/test_strategies.py::test_tables --enable-hypothesis --hypothesis-show-statistics --hypothesis-profile=dev` Print the generated examples (debug): `pytest -sv pyarrow/tests/test_strategies.py::test_schemas --enable-hypothesis --hypothesis-show-statistics --hypothesis-profile=debug` Author: Krisztián Szűcs <[email protected]> Closes apache#3301 from kszucs/ARROW-3903 and squashes the following commits: ff6654c <Krisztián Szűcs> finalize 8b5e7ea <Krisztián Szűcs> rat 61fe01d <Krisztián Szűcs> strategies for chunked_arrays, columns, record batches; test the strategies themselves bdb63df <Krisztián Szűcs> hypothesis array strategy
Generate random schemas, arrays, chunked_arrays, columns, record_batches and tables.
Slow, but makes quiet easy to isolate corner cases (already created jira issues). In follow up PRs We should use these strategies to increase the coverage. It'll enable us to reduce the issues, We could even use it for generate benchmark datasets periodically (only if We persist somewhere).
Example usage:
Run 10 samples (dev profile):
pytest -sv pyarrow/tests/test_strategies.py::test_tables --enable-hypothesis --hypothesis-show-statistics --hypothesis-profile=dev
Print the generated examples (debug):
pytest -sv pyarrow/tests/test_strategies.py::test_schemas --enable-hypothesis --hypothesis-show-statistics --hypothesis-profile=debug