Skip to content

CI: Fastparquet upgrade broke CI #42588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lithomas1 opened this issue Jul 17, 2021 · 10 comments · Fixed by #42919
Closed

CI: Fastparquet upgrade broke CI #42588

lithomas1 opened this issue Jul 17, 2021 · 10 comments · Fixed by #42919
Labels
CI Continuous Integration IO Parquet parquet, feather
Milestone

Comments

@lithomas1
Copy link
Member

Fastparquet updated to 0.7 which is causing failure.
See these logs from database build for example.

___________________________ test_cross_engine_fp_pa ____________________________
[gw1] linux -- Python 3.8.10 /usr/share/miniconda/envs/pandas-dev/bin/python

request = <FixtureRequest for <Function test_cross_engine_fp_pa>>
df_cross_compat =    a  b    d      e          f
0  a  1  4.0   True 2013-01-01
1  b  2  5.0  False 2013-01-02
2  c  3  6.0   True 2013-01-03
pa = 'pyarrow', fp = 'fastparquet'

    def test_cross_engine_fp_pa(request, df_cross_compat, pa, fp):
        # cross-compat with differing reading/writing engines
        df = df_cross_compat
        with tm.ensure_clean() as path:
            df.to_parquet(path, engine=fp, compression=None)
    
            with catch_warnings(record=True):
                result = read_parquet(path, engine=pa)
>               tm.assert_frame_equal(result, df)
E               AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="f") are different
E               
E               Attribute "dtype" are different
E               [left]:  datetime64[ns, UTC]
E               [right]: datetime64[ns]

pandas/tests/io/test_parquet.py:337: AssertionError
______________________ TestParquetFastParquet.test_basic _______________________
[gw1] linux -- Python 3.8.10 /usr/share/miniconda/envs/pandas-dev/bin/python

self = <pandas.tests.io.test_parquet.TestParquetFastParquet object at 0x7fd39d1cc070>
fp = 'fastparquet'
df_full =   string string_with_nan  ...               datetime_tz timedelta
0      a               a  ... 2013-01-01 00:00:00-05...01-02 00:00:00-05:00    2 days
2      c               c  ... 2013-01-03 00:00:00-05:00    3 days

[3 rows x 14 columns]

    def test_basic(self, fp, df_full):
        df = df_full
    
        dti = pd.date_range("20130101", periods=3, tz="US/Eastern")
        dti = dti._with_freq(None)  # freq doesn't round-trip
        df["datetime_tz"] = dti
        df["timedelta"] = pd.timedelta_range("1 day", periods=3)
>       check_round_trip(df, fp)

pandas/tests/io/test_parquet.py:915: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_parquet.py:220: in check_round_trip
    compare(repeat)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

repeat = 2

    def compare(repeat):
        for _ in range(repeat):
            df.to_parquet(path, **write_kwargs)
            with catch_warnings(record=True):
                actual = read_parquet(path, **read_kwargs)
    
>           tm.assert_frame_equal(
                expected,
                actual,
                check_names=check_names,
                check_like=check_like,
                check_dtype=check_dtype,
            )
E           AssertionError: Attributes of DataFrame.iloc[:, 6] (column name="uint") are different
E           
E           Attribute "dtype" are different
E           [left]:  uint8
E           [right]: UInt8

pandas/tests/io/test_parquet.py:210: AssertionError
__________________ TestParquetFastParquet.test_bool_with_none __________________
[gw1] linux -- Python 3.8.10 /usr/share/miniconda/envs/pandas-dev/bin/python

self = <pandas.tests.io.test_parquet.TestParquetFastParquet object at 0x7fd39fd34af0>
fp = 'fastparquet'

    def test_bool_with_none(self, fp):
        df = pd.DataFrame({"a": [True, None, False]})
        expected = pd.DataFrame({"a": [1.0, np.nan, 0.0]}, dtype="float16")
>       check_round_trip(df, fp, expected=expected)

pandas/tests/io/test_parquet.py:928: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_parquet.py:220: in check_round_trip
    compare(repeat)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

repeat = 2

    def compare(repeat):
        for _ in range(repeat):
            df.to_parquet(path, **write_kwargs)
            with catch_warnings(record=True):
                actual = read_parquet(path, **read_kwargs)
    
>           tm.assert_frame_equal(
                expected,
                actual,
                check_names=check_names,
                check_like=check_like,
                check_dtype=check_dtype,
            )
E           AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="a") are different
E           
E           Attribute "dtype" are different
E           [left]:  float16
E           [right]: boolean
@lithomas1 lithomas1 added the CI Continuous Integration label Jul 17, 2021
@lithomas1 lithomas1 added this to the 1.3.1 milestone Jul 17, 2021
@jreback
Copy link
Contributor

jreback commented Jul 17, 2021

can u report in that tracker as well

seems to be lots of breaking changes in fastparuet lately

@lithomas1
Copy link
Member Author

lithomas1 commented Jul 19, 2021

Ok, investigating a bit more. The uint8 vs Uint8 change seems like a deliberate change in fastparquet, xref dask/fastparquet#623. In pandas, we make this opt in with the use_nullable_dtypes keyword, but fastparquet just auto uses them.
test_bool_with_none seems to fail for the same reason, the nullable boolean dtype is used by default, when it actually should be opt in.

The datetime64 one just feels like fastparquet is dropping the timezone, possibly related to dask/fastparquet#628 I think(didn't look closely).
EDIT: Nvm, Looks like fastparquet is adding a UTC tz, when it should not have a timezone.

cc @martindurant

@martindurant
Copy link
Contributor

The datetime64 one just feels like fastparquet is dropping the timezone, possibly related to dask/fastparquet#628 I think(didn't look closely).

Fastparquet uses ns precision to store timestamps as of that PR, which ought not to have changed the tz propagation - there are tests for that. I haven't looked at your tests, yet, though.

@lithomas1
Copy link
Member Author

Slightly unrelated, but perhaps fastparquet can set up a CI for the dev version of pandas? That way, we can catch bugs/issues before the release of a new pandas/fastparquet. I'd be happy to set that up for fastparquet, if you want.

@martindurant
Copy link
Contributor

btw: I would love if there were a variant of assert_frame_equal that would make, for example, uint8 and UInt8 equivalent.

@martindurant
Copy link
Contributor

@lithomas1 , yes that sounds very useful. We already test against dask, for example.

@martindurant
Copy link
Contributor

I appreciate people here sticking with me, and hope that all the improvements described at https://fastparquet.readthedocs.io/en/latest/releasenotes.html are worth it!

@lithomas1
Copy link
Member Author

Thanks for developing fastparquet! :)

@lithomas1
Copy link
Member Author

Hmmm... There might be another inconsistency in between the fastparquet and pyarrow engines. We currently configure pyarrow to always go to nullable dtypes when the keyword use_nullable_dtypes=True is passed to pandas.read_parquet. Fastparquet, I think uses the nullable dtypes conditionally.(please do correct me if I'm wrong)
From the fastparquet release notes

If the metadata guarantees that there are no nulls, we still use the non-nullable variant unless the data was written with fastparquet/pyarrow, and the metadata indicates that the original datatype was nullable.

So, I think we need to decide whether use_nullable_dtypes should always make nullable dtypes, even in the absence of nulls or metadata indicating a null column, or only conditionally create the nullable dtypes based on value/metadata. Personally, I feel conditionally using the nullable dtypes as a user makes more sense, b/c there's no benefit to using the nullable dtypes w/out nulls, and perf would suffer a bit(b/c of the mask that backs the nullable dtype which is useless in the absence of nulls).

cc @jorisvandenbossche (since last time, I think you had some reservations about making use_nullable_dtypes depend on whether NAs are present #40687 (comment))

@jorisvandenbossche
Copy link
Member

There might be another inconsistency in between the fastparquet and pyarrow engines. We currently configure pyarrow to always go to nullable dtypes when the keyword use_nullable_dtypes=True is passed to pandas.read_parquet. Fastparquet, I think uses the nullable dtypes conditionally.(please do correct me if I'm wrong)

Since this use_nullable_dtypes keyword is a pandas keyword (and not of fastparquet or pyarrow), I think we should ensure that there is consistent behaviour for the different engines.

So, I think we need to decide whether use_nullable_dtypes should always make nullable dtypes, even in the absence of nulls or metadata indicating a null column, or only conditionally create the nullable dtypes based on value/metadata.

I had indeed reservations about doing this conditionally. IMO the keyword should give consistent behaviour, and is also meant to opt-in to all nullable dtypes, so that's how I originally implemented it. I would prefer to keep the behaviour that way.

Nullable vs non-nullable dtypes can behave differently for certain operations, even if the original read data has no nulls (you can easily introduce nulls with a next operation). Yes, nullable dtypes can waste some space if there are no NAs (the mask), but there are other ways to resolve this (i.e. ensure we can not store any mask in such a case -> #30435).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants