CI: Fastparquet upgrade broke CI #42588

lithomas1 · 2021-07-17T17:54:12Z

Fastparquet updated to 0.7 which is causing failure.
See these logs from database build for example.

___________________________ test_cross_engine_fp_pa ____________________________
[gw1] linux -- Python 3.8.10 /usr/share/miniconda/envs/pandas-dev/bin/python

request = <FixtureRequest for <Function test_cross_engine_fp_pa>>
df_cross_compat =    a  b    d      e          f
0  a  1  4.0   True 2013-01-01
1  b  2  5.0  False 2013-01-02
2  c  3  6.0   True 2013-01-03
pa = 'pyarrow', fp = 'fastparquet'

    def test_cross_engine_fp_pa(request, df_cross_compat, pa, fp):
        # cross-compat with differing reading/writing engines
        df = df_cross_compat
        with tm.ensure_clean() as path:
            df.to_parquet(path, engine=fp, compression=None)
    
            with catch_warnings(record=True):
                result = read_parquet(path, engine=pa)
>               tm.assert_frame_equal(result, df)
E               AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="f") are different
E               
E               Attribute "dtype" are different
E               [left]:  datetime64[ns, UTC]
E               [right]: datetime64[ns]

pandas/tests/io/test_parquet.py:337: AssertionError
______________________ TestParquetFastParquet.test_basic _______________________
[gw1] linux -- Python 3.8.10 /usr/share/miniconda/envs/pandas-dev/bin/python

self = <pandas.tests.io.test_parquet.TestParquetFastParquet object at 0x7fd39d1cc070>
fp = 'fastparquet'
df_full =   string string_with_nan  ...               datetime_tz timedelta
0      a               a  ... 2013-01-01 00:00:00-05...01-02 00:00:00-05:00    2 days
2      c               c  ... 2013-01-03 00:00:00-05:00    3 days

[3 rows x 14 columns]

    def test_basic(self, fp, df_full):
        df = df_full
    
        dti = pd.date_range("20130101", periods=3, tz="US/Eastern")
        dti = dti._with_freq(None)  # freq doesn't round-trip
        df["datetime_tz"] = dti
        df["timedelta"] = pd.timedelta_range("1 day", periods=3)
>       check_round_trip(df, fp)

pandas/tests/io/test_parquet.py:915: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_parquet.py:220: in check_round_trip
    compare(repeat)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

repeat = 2

    def compare(repeat):
        for _ in range(repeat):
            df.to_parquet(path, **write_kwargs)
            with catch_warnings(record=True):
                actual = read_parquet(path, **read_kwargs)
    
>           tm.assert_frame_equal(
                expected,
                actual,
                check_names=check_names,
                check_like=check_like,
                check_dtype=check_dtype,
            )
E           AssertionError: Attributes of DataFrame.iloc[:, 6] (column name="uint") are different
E           
E           Attribute "dtype" are different
E           [left]:  uint8
E           [right]: UInt8

pandas/tests/io/test_parquet.py:210: AssertionError
__________________ TestParquetFastParquet.test_bool_with_none __________________
[gw1] linux -- Python 3.8.10 /usr/share/miniconda/envs/pandas-dev/bin/python

self = <pandas.tests.io.test_parquet.TestParquetFastParquet object at 0x7fd39fd34af0>
fp = 'fastparquet'

    def test_bool_with_none(self, fp):
        df = pd.DataFrame({"a": [True, None, False]})
        expected = pd.DataFrame({"a": [1.0, np.nan, 0.0]}, dtype="float16")
>       check_round_trip(df, fp, expected=expected)

pandas/tests/io/test_parquet.py:928: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/io/test_parquet.py:220: in check_round_trip
    compare(repeat)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

repeat = 2

    def compare(repeat):
        for _ in range(repeat):
            df.to_parquet(path, **write_kwargs)
            with catch_warnings(record=True):
                actual = read_parquet(path, **read_kwargs)
    
>           tm.assert_frame_equal(
                expected,
                actual,
                check_names=check_names,
                check_like=check_like,
                check_dtype=check_dtype,
            )
E           AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="a") are different
E           
E           Attribute "dtype" are different
E           [left]:  float16
E           [right]: boolean

The text was updated successfully, but these errors were encountered:

jreback · 2021-07-17T17:57:45Z

can u report in that tracker as well

seems to be lots of breaking changes in fastparuet lately

lithomas1 · 2021-07-19T17:06:32Z

Ok, investigating a bit more. The uint8 vs Uint8 change seems like a deliberate change in fastparquet, xref dask/fastparquet#623. In pandas, we make this opt in with the use_nullable_dtypes keyword, but fastparquet just auto uses them.
test_bool_with_none seems to fail for the same reason, the nullable boolean dtype is used by default, when it actually should be opt in.

~~The datetime64 one just feels like fastparquet is dropping the timezone, possibly related to dask/fastparquet#628 I think(didn't look closely).~~
EDIT: Nvm, Looks like fastparquet is adding a UTC tz, when it should not have a timezone.

cc @martindurant

martindurant · 2021-07-19T17:08:45Z

The datetime64 one just feels like fastparquet is dropping the timezone, possibly related to dask/fastparquet#628 I think(didn't look closely).

Fastparquet uses ns precision to store timestamps as of that PR, which ought not to have changed the tz propagation - there are tests for that. I haven't looked at your tests, yet, though.

lithomas1 · 2021-07-19T17:09:53Z

Slightly unrelated, but perhaps fastparquet can set up a CI for the dev version of pandas? That way, we can catch bugs/issues before the release of a new pandas/fastparquet. I'd be happy to set that up for fastparquet, if you want.

martindurant · 2021-07-19T17:10:00Z

btw: I would love if there were a variant of assert_frame_equal that would make, for example, uint8 and UInt8 equivalent.

martindurant · 2021-07-19T17:10:39Z

@lithomas1 , yes that sounds very useful. We already test against dask, for example.

martindurant · 2021-07-19T17:12:27Z

I appreciate people here sticking with me, and hope that all the improvements described at https://fastparquet.readthedocs.io/en/latest/releasenotes.html are worth it!

lithomas1 · 2021-07-19T17:40:30Z

Thanks for developing fastparquet! :)

lithomas1 · 2021-07-19T18:42:29Z

Hmmm... There might be another inconsistency in between the fastparquet and pyarrow engines. We currently configure pyarrow to always go to nullable dtypes when the keyword use_nullable_dtypes=True is passed to pandas.read_parquet. Fastparquet, I think uses the nullable dtypes conditionally.(please do correct me if I'm wrong)
From the fastparquet release notes

If the metadata guarantees that there are no nulls, we still use the non-nullable variant unless the data was written with fastparquet/pyarrow, and the metadata indicates that the original datatype was nullable.

So, I think we need to decide whether use_nullable_dtypes should always make nullable dtypes, even in the absence of nulls or metadata indicating a null column, or only conditionally create the nullable dtypes based on value/metadata. Personally, I feel conditionally using the nullable dtypes as a user makes more sense, b/c there's no benefit to using the nullable dtypes w/out nulls, and perf would suffer a bit(b/c of the mask that backs the nullable dtype which is useless in the absence of nulls).

cc @jorisvandenbossche (since last time, I think you had some reservations about making use_nullable_dtypes depend on whether NAs are present #40687 (comment))

jorisvandenbossche · 2021-08-09T06:29:19Z

There might be another inconsistency in between the fastparquet and pyarrow engines. We currently configure pyarrow to always go to nullable dtypes when the keyword use_nullable_dtypes=True is passed to pandas.read_parquet. Fastparquet, I think uses the nullable dtypes conditionally.(please do correct me if I'm wrong)

Since this use_nullable_dtypes keyword is a pandas keyword (and not of fastparquet or pyarrow), I think we should ensure that there is consistent behaviour for the different engines.

So, I think we need to decide whether use_nullable_dtypes should always make nullable dtypes, even in the absence of nulls or metadata indicating a null column, or only conditionally create the nullable dtypes based on value/metadata.

I had indeed reservations about doing this conditionally. IMO the keyword should give consistent behaviour, and is also meant to opt-in to all nullable dtypes, so that's how I originally implemented it. I would prefer to keep the behaviour that way.

Nullable vs non-nullable dtypes can behave differently for certain operations, even if the original read data has no nulls (you can easily introduce nulls with a next operation). Yes, nullable dtypes can waste some space if there are no NAs (the mask), but there are other ways to resolve this (i.e. ensure we can not store any mask in such a case -> #30435).

lithomas1 added the CI Continuous Integration label Jul 17, 2021

lithomas1 added this to the 1.3.1 milestone Jul 17, 2021

simonjayhawkins mentioned this issue Jul 20, 2021

Pin fastparquet #42612

Merged

4 tasks

lithomas1 mentioned this issue Jul 21, 2021

CI: Test dev version of pandas dask/fastparquet#644

Merged

simonjayhawkins modified the milestones: 1.3.1, 1.3.2 Jul 24, 2021

lithomas1 added the IO Parquet parquet, feather label Aug 6, 2021

lithomas1 mentioned this issue Aug 6, 2021

COMPAT: Support fastparquet 0.7.1 #42919

Merged

4 tasks

jreback closed this as completed in #42919 Aug 8, 2021

lithomas1 mentioned this issue Aug 10, 2021

DIS: Should the keyword use_nullable_dtypes use nullable dtypes in the absence of nulls #42973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Fastparquet upgrade broke CI #42588

CI: Fastparquet upgrade broke CI #42588

lithomas1 commented Jul 17, 2021

jreback commented Jul 17, 2021

lithomas1 commented Jul 19, 2021 •

edited

Loading

martindurant commented Jul 19, 2021

lithomas1 commented Jul 19, 2021

martindurant commented Jul 19, 2021

martindurant commented Jul 19, 2021

martindurant commented Jul 19, 2021

lithomas1 commented Jul 19, 2021

lithomas1 commented Jul 19, 2021

jorisvandenbossche commented Aug 9, 2021

CI: Fastparquet upgrade broke CI #42588

CI: Fastparquet upgrade broke CI #42588

Comments

lithomas1 commented Jul 17, 2021

jreback commented Jul 17, 2021

lithomas1 commented Jul 19, 2021 • edited Loading

martindurant commented Jul 19, 2021

lithomas1 commented Jul 19, 2021

martindurant commented Jul 19, 2021

martindurant commented Jul 19, 2021

martindurant commented Jul 19, 2021

lithomas1 commented Jul 19, 2021

lithomas1 commented Jul 19, 2021

jorisvandenbossche commented Aug 9, 2021

lithomas1 commented Jul 19, 2021 •

edited

Loading