Skip to content

CI: Add pyarrow 11 to ci #51308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

phofl
Copy link
Member

@phofl phofl commented Feb 10, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Copy link
Member

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put a hold on this for now? I would like to discuss with Arrow folks about the best way to test between our two projects.

(Extended thoughts below)

@lithomas1
Copy link
Member

lithomas1 commented Feb 10, 2023

Status Quo

Currently, we test on pyarrow versions 8, 9, and 10 (IIRC because some version of pyarrow in between our min/latest version broke, but the min/latest version passed). This currently adds a lot of burden on our CI (builds are queueing for 2 hours again even though Github bumped up our concurrent runners), since the builds are created from the cross product of matrix parameters, and also hasn't really caught that many issues (which is expected, since failures would mean that pyarrow had a regression).

IMO, we should share some of that testing burden between the two projects, both to reduce load on the pandas CI, and to ensure that pyarrow has enough testing coverage to catch its own bugs in Pandas/Pyarrow integration.

Proposed changes

  • Pyarrow test pandas nightlies in addition to testing min/latest pandas

    • pyarrow seems to run their tests on pandas nightly.
    • They should add a job to run pandas tests that test Arrow Extension Arrays/CSV IO/Parquet IO etc.
      - Pyarrow uses pandas internals, so it is easy for them to break something, because of changes to internal structures on our side
      - We should provide a mark to only run Arrow tests so that they don't have to run the full test suite.
  • We drop testing of intermediate versions of pyarrow and move to testing only min/max again
    - This aligns with what we're doing with numpy
    - There is a risk of breakage due to pyarrow's usage of our internals, so we should kinda figure out what pyarrow needs in our internals and mark those functions in the code.
    - Maybe we can have a "pseudo-public" API for those functions.

  • (optionally) We test pyarrow nightlies
    - If CI capacity frees up enough, we should also test pyarrow nightlies (I couldn't find the nightly builds for the Python bindings though).

Alternatives

Add the "arrow" mark on our side and only run those tests on the parametrized arrow build.

I will open an issue on their side in a couple days if no objections.

cc @mroeschke @jorisvandenbossche

@mroeschke
Copy link
Member

Generally I would support only testing pyarrow min + max again right away.

pyarrow has seemed fairly stable within the past few versions and appears to be primarily additive in functionality. Especially since pyarrow functionality has been primarily opt-in-without-fallback, I don't see too much utility testing intermediate versions of pyarrow anymore.

@phofl
Copy link
Member Author

phofl commented Feb 10, 2023

I am a bit hesitant about only testing min and max as long as we have the flags that control behavior dependent on the pyarrow version. Wdyt about testing 8 on python 3.8, 9 on python 3.9 and so on?

@mroeschke
Copy link
Member

FWIW I think since we bumped up the pyarrow version recently we only have version dependent fallback behavior in 2 spots, argsort and fillna in special cases. (They appear to also be or versions < 7 which we recently made our min version)

argsort

        null_placement = {"last": "at_end", "first": "at_start"}.get(na_position, None)
        if null_placement is None or pa_version_under7p0:
            # Although pc.array_sort_indices exists in version 6
            # there's a bug that affects the pa.ChunkedArray backing
            # https://issues.apache.org/jira/browse/ARROW-12042
            fallback_performancewarning("7")
            return super().argsort(
                ascending=ascending, kind=kind, na_position=na_position
            )

fillna

        if method is not None and pa_version_under7p0:
            # fill_null_{forward|backward} added in pyarrow 7.0
            fallback_performancewarning(version="7")
            return super().fillna(value=value, method=method, limit=limit)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 10, 2023

I don't fully understand why we at some point moved towards parametrizing the build matrix on all supported versions of pyarrow. Personally, I would keep testing all versions of pyarrow (at least if we currently only support the latest 4 versions that seem doable), but that doesn't mean that we have to build every possible combination of python version and single-cpu / non-single-cpu with every version of pyarrow?

We could also generally run one version of pyarrow per version of Python? (roughly speaking) That gives 8 builds (4*2) instead of 32 (4*4*2)

@jorisvandenbossche
Copy link
Member

Pyarrow test pandas nightlies in addition to testing min/latest pandas

  • pyarrow seems to run their tests on pandas nightly.

Indeed, we already have a build with pandas nightly (which regularly catches some changes/regressions)

  • They should add a job to run pandas tests that test Arrow Extension Arrays/CSV IO/Parquet IO etc.

Yes, this is indeed something we could do. I am not actually sure why we don't do that for pandas, as we do exactly that for dask (run a subset of dask's tests using the nightly arrow)

I think starting with specifically running a few files/directories (parquet and csv tests, the arrow extension arrays) might be a good enough start (not necessarily needing a mark, although if we start with marking a few whole files, that's also easy to add)

(optionally) We test pyarrow nightlies

I would emphasize this a bit more, and it should actually be easy to add (we already have a build with upstream dev versions, and we can add pyarrow nightly to the same build?)

@mroeschke
Copy link
Member

but that doesn't mean that we have to build every possible combination of python version and single-cpu / non-single-cpu with every version of pyarrow

The pyaarrow CSV engine tests are marked as single_cpu while the other tests are not. In the past, I noticed the arrow CSV engine would stall if run with xdist.

@lithomas1
Copy link
Member

  • They should add a job to run pandas tests that test Arrow Extension Arrays/CSV IO/Parquet IO etc.

Yes, this is indeed something we could do. I am not actually sure why we don't do that for pandas, as we do exactly that for dask (run a subset of dask's tests using the nightly arrow)

I think starting with specifically running a few files/directories (parquet and csv tests, the arrow extension arrays) might be a good enough start (not necessarily needing a mark, although if we start with marking a few whole files, that's also easy to add)

Cool. I'll open up an issue on the Arrow side then.

(optionally) We test pyarrow nightlies

I would emphasize this a bit more, and it should actually be easy to add (we already have a build with upstream dev versions, and we can add pyarrow nightly to the same build?)

I can add this if you give me a link. Googling only gives me the R nightlies.

@jorisvandenbossche
Copy link
Member

but that doesn't mean that we have to build every possible combination of python version and single-cpu / non-single-cpu with every version of pyarrow

The pyaarrow CSV engine tests are marked as single_cpu while the other tests are not. In the past, I noticed the arrow CSV engine would stall if run with xdist.

Yes, but what I mentioned was running each combo of python version and single/non-single CPU just with 1 version of pyarrow instead of 4. That still runs the csv tests with pyarrow for single-cpu.

@jorisvandenbossche
Copy link
Member

I can add this if you give me a link. Googling only gives me the R nightlies.

@lithomas1 see https://arrow.apache.org/docs/dev/developers/python.html#installing-nightly-packages (it's indeed a bit hidden in the development section instead of having it on the install page).

@mroeschke
Copy link
Member

Yes, but what I mentioned was running each combo of python version and single/non-single CPU just with 1 version of pyarrow instead of 4. That still runs the csv tests with pyarrow for single-cpu.

Right that's fair. I think I would still prefer to test the min & max version of pyarrow. Anecdotally, pyarrow development has notably affected the minimum pyarrow version build or all builds affecting all Pyarrow versions.

@simonjayhawkins simonjayhawkins added CI Continuous Integration Arrow pyarrow functionality labels Feb 22, 2023
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Mar 25, 2023
@lithomas1
Copy link
Member

Sorry, this slipped off my radar for a bit.

Right now, I'm currently working on getting rid of all hanging tests for read_csv with pyarrow. Then, we should be able to get rid of the single_cpu pyarrow builds.

I'm holding off on adding pyarrow nightlies, until we can first reduce the CI matrix first.

@phofl
Copy link
Member Author

phofl commented Mar 25, 2023

I stumbled upon this on the desk side today, what are your thoughts on adding the arrow nightly to the numpy nightly build?

@lithomas1
Copy link
Member

I stumbled upon this on the desk side today, what are your thoughts on adding the arrow nightly to the numpy nightly build?

I don't think they test with Cython 3 ATM, so it wouldn't be practical (They currently do support the latest beta as of a couple of days ago, but it is untested in their CI).
It also could cause some noise in our CI when numpydev breaks arrow.

@mroeschke
Copy link
Member

mroeschke commented Mar 25, 2023

IMO I think we could already:

  1. Start only testing pyarrow min version & unpinned (latest)
  2. Add a new job in the ubuntu.yml workflow for testing just arrow nightly

That would be a net of -3 jobs in the workflow while testing pyarrow nightly which is probably more valuable than testing intermediate versions of pyarrow

@mroeschke
Copy link
Member

I think we can close this as we cleaned up some of the Arrow testing in the CI as apart of #52211

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality CI Continuous Integration Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants