Skip to content

CI: Fastparquet release broke ci #41366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
phofl opened this issue May 7, 2021 · 9 comments · Fixed by dask/fastparquet#600, #41443 or dask/fastparquet#608
Closed

CI: Fastparquet release broke ci #41366

phofl opened this issue May 7, 2021 · 9 comments · Fixed by dask/fastparquet#600, #41443 or dask/fastparquet#608
Assignees
Labels
CI Continuous Integration Dependencies Required and optional dependencies Upstream issue Issue related to pandas dependency
Milestone

Comments

@phofl
Copy link
Member

phofl commented May 7, 2021

It looks like the fastparquet release broke our ci

Successful build:
https://github.com/pandas-dev/pandas/runs/2522988276

Erroneous build:
https://github.com/pandas-dev/pandas/runs/2524444950

This is the full version diff:


        packages                      version_error_build                version_successful_build
38         boto3                      1.17.68                      1.17.67
59       cramjam                        2.3.0                          NaN
79   fastparquet                        0.6.0                        0.5.0
228       pandas  1.3.0.dev0+1562.g6ea277f241  1.3.0.dev0+1561.gebf3b98596
316   sqlalchemy                       1.4.14                       1.4.13
@phofl phofl added Bug Needs Triage Issue that has not been reviewed by a pandas team member CI Continuous Integration Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 7, 2021
@TomAugspurger
Copy link
Contributor

cc @martindurant.

Is it just the doc build that failed? Can you point to a specific example that's causing the failure?

@phofl
Copy link
Member Author

phofl commented May 7, 2021

It looks like

df = pd.DataFrame(
    {
        "a": list("abc"),
        "b": list(range(1, 4)),
        "c": np.arange(3, 6).astype("u1"),
        "d": np.arange(4.0, 7.0, dtype="float64"),
        "e": [True, False, True],
        "f": pd.date_range("20130101", periods=3),
        "g": pd.date_range("20130101", periods=3, tz="US/Eastern"),
        "h": pd.Categorical(list("abc")),
        "i": pd.Categorical(list("abc"), ordered=True),
    }
)

df.to_parquet("example_fp.parquet", engine="fastparquet")
result = pd.read_parquet("example_fp.parquet", engine="fastparquet")

this is causing the failure, but can not reproduce locally and the ipython errors are ugly to read.

Additionally a few tests are failing in our windows ci with np18

FAILED pandas/tests/io/test_fsspec.py::test_s3_parquet - TypeError: Cannot co...
FAILED pandas/tests/io/test_parquet.py::test_cross_engine_pa_fp - PermissionE...
FAILED pandas/tests/io/test_parquet.py::TestParquetFastParquet::test_basic - ...
FAILED pandas/tests/io/test_parquet.py::TestParquetFastParquet::test_timezone_aware_index[timezone_aware_date_list1]
FAILED pandas/tests/io/test_parquet.py::TestParquetFastParquet::test_timezone_aware_index[timezone_aware_date_list2]
FAILED pandas/tests/io/test_parquet.py::TestParquetFastParquet::test_timezone_aware_index[timezone_aware_date_list3]
FAILED pandas/tests/io/test_parquet.py::TestParquetFastParquet::test_timezone_aware_index[timezone_aware_date_list4]
FAILED pandas/tests/io/test_parquet.py::TestParquetFastParquet::test_timezone_aware_index[timezone_aware_date_list5]
FAILED pandas/tests/io/test_parquet.py::TestParquetFastParquet::test_timezone_aware_index[timezone_aware_date_list6]

here for example https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=59379&view=logs&j=404760ec-14d3-5d48-e580-13034792878f&t=f81e4cc8-d61a-5fb8-36be-36768e5c561a&l=59

@phofl phofl added the Dependencies Required and optional dependencies label May 7, 2021
@martindurant
Copy link
Contributor

The error seems to be in fastparquert.df.empty, where the main (only?) change was dask/fastparquet#571 (cc @jbrockmendel ). fastparquet is passing against released pandas (1.2.4).

@jbrockmendel
Copy link
Member

The .empty thing i'll address. some of the other failures look pytz-related?

@martindurant
Copy link
Contributor

Can you post the traceback of a remaining error, please - I find the log hard to parse.

@phofl
Copy link
Member Author

phofl commented May 7, 2021

Traceback:

    def test_timezone_aware_index(self, fp, timezone_aware_date_list):
        idx = 5 * [timezone_aware_date_list]
    
        df = pd.DataFrame(index=idx, data={"index_as_col": idx})
    
        expected = df.copy()
        expected.index.name = "index"
>       check_round_trip(df, fp, expected=expected)

pandas/tests/io/test_parquet.py:1065: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_parquet.py:219: in check_round_trip
    compare(repeat)
pandas/tests/io/test_parquet.py:207: in compare
    actual = read_parquet(path, **read_kwargs)
pandas/io/parquet.py:497: in read_parquet
    return impl.read(
pandas/io/parquet.py:341: in read
    parquet_file = self.api.ParquetFile(path, **parquet_kwargs)
../../anaconda3/envs/omi_reports/pandas-dev-test/lib/python3.8/site-packages/fastparquet/api.py:97: in __init__
    self._parse_header(fn, verify)
../../anaconda3/envs/omi_reports/pandas-dev-test/lib/python3.8/site-packages/fastparquet/api.py:142: in _parse_header
    self._set_attrs()
../../anaconda3/envs/omi_reports/pandas-dev-test/lib/python3.8/site-packages/fastparquet/api.py:155: in _set_attrs
    self._dtypes()
../../anaconda3/envs/omi_reports/pandas-dev-test/lib/python3.8/site-packages/fastparquet/api.py:530: in _dtypes
    z = pytz.timezone(f"Etc/GMT{z}")

>               raise UnknownTimeZoneError(zone)
E               pytz.exceptions.UnknownTimeZoneError: 'Etc/GMTUTC'

Is this helpful?

Edit:

Here are the fixtures:

@pytest.fixture(
    params=[
        datetime.datetime.now(datetime.timezone.utc),
        datetime.datetime.now(datetime.timezone.min),
        datetime.datetime.now(datetime.timezone.max),
        datetime.datetime.strptime("2019-01-04T16:41:24+0200", "%Y-%m-%dT%H:%M:%S%z"),
        datetime.datetime.strptime("2019-01-04T16:41:24+0215", "%Y-%m-%dT%H:%M:%S%z"),
        datetime.datetime.strptime("2019-01-04T16:41:24-0200", "%Y-%m-%dT%H:%M:%S%z"),
        datetime.datetime.strptime("2019-01-04T16:41:24-0215", "%Y-%m-%dT%H:%M:%S%z"),
    ]
)
def timezone_aware_date_list(request):
    return request.param

It looks like the test fails for 6/7 (utc passes)

@martindurant
Copy link
Contributor

That's interesting, yes. I wonder, what string would you give to pytz.timezone() to get the right timezone for these? Apparently I found a formalism that works for some subset (fractional hours, in particular, ought to be hard!).

@martindurant
Copy link
Contributor

i.e., .tz_localize('-23:59') doesn't work.

@phofl
Copy link
Member Author

phofl commented May 7, 2021

Reopening until Pin is reverted

@phofl phofl reopened this May 7, 2021
@lithomas1 lithomas1 removed the Bug label May 7, 2021
@phofl phofl changed the title CI: Fastparquet release broke ci (probably) CI: Fastparquet release broke ci May 7, 2021
@lithomas1 lithomas1 self-assigned this May 12, 2021
@phofl phofl reopened this May 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration Dependencies Required and optional dependencies Upstream issue Issue related to pandas dependency
Projects
None yet
5 participants