Skip to content

[NO MRG] Test CI #952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Conversation

jakirkham
Copy link
Member

Opening just to test the current state of CI

@jakirkham
Copy link
Member Author

jakirkham commented Jan 27, 2022

Am seeing the same failure with commit ( e15001b ) on master (the latest change). Going to try reverting that to see what happens

Edit: Should add this seems potentially related as tests are hanging in test_sync.py as seen in this log and the latest change updated fasteners a library we use for locking.

Edit 2: Fortunately the changes in fasteners between 0.17.2 and 0.17.3 are quite small. So hopefully the issue is easy to narrow down once we have a reproducer.

Edit 3: Still seeing issues with 0.17.2, which came out a little over a week back. Trying 0.17.1.

We seem to be experiencing hangs in `test_sync.py` with this version. So
exclude it from testing.
joshmoore added a commit to joshmoore/zarr-python that referenced this pull request Jan 27, 2022
@jakirkham
Copy link
Member Author

jakirkham commented Jan 28, 2022

Here's a dependency diff between CI for commit ( 21739ad )'s run and commit ( e15001b )'s run where the issue first appeared

Diff:
27c27
< coverage==6.2
---
> coverage==6.3
38c38
< fasteners==0.17.2
---
> fasteners==0.17.3
62c62
< jsondiff==1.3.0
---
> jsondiff==1.3.1
68c68
< jupyter-client==7.1.1
---
> jupyter-client==7.1.2
78c78
< moto==2.3.2
---
> moto==3.0.1
82c82
< multidict==5.2.0
---
> multidict==6.0.2
90c90
< notebook==6.4.7
---
> notebook==6.4.8
103c103
< prometheus-client==0.12.0
---
> prometheus-client==0.13.0
107c107
< pure-eval==0.2.1
---
> pure-eval==0.2.2
115c115
< pyparsing==3.0.6
---
> pyparsing==3.0.7

@jakirkham
Copy link
Member Author

Spotted issue ( harlowja/fasteners#86 ), which may also be related.

@jakirkham
Copy link
Member Author

Looks like it was still hanging even with 0.16. Going to try running with more verbosity.

@jakirkham
Copy link
Member Author

Canceled the tests and looked at the tests run. They were definitely using the ProcessSynchronizer. There wasn't any one test that was jamming things up. Each job had a different test running when canceled. This makes me think there are still some locking issues, but when they crop up is sporadic.

@jakirkham
Copy link
Member Author

@joshmoore would be good to get your thoughts here on what we should do next? Should we debug further? Should we xfail the ProcessSynchronizer tests and go ahead with the release? Or should we do something else?

@joshmoore
Copy link
Member

Hmmmm..... so we're sure it's just a change in fasteners and not (also) a change in the mainline that's leading to this?

It would definitely be good to get 2.11 out. We could move the synchronizer tests out to their own workflow (which only gets run periodically?)

@jakirkham
Copy link
Member Author

jakirkham commented Jan 28, 2022

To be completely honest, I'm not sure.

At the moment I'm torn between whether a new copy of fasteners is sneaking in somewhere despite the version constraints or whether there is some subtle locking issue introduced by not writing empty chunks (or maybe something else?).

That said, CI on master was passing until the most recent commit, which coincides with the recent fasteners release.

Davis' change ( 831e687 ) went in a couple months back (October) and issues only cropped up recently. Think we would have seen issues sooner if that was the cause.

All of the testing here is without the recent empty chunks change and it still has issues. So doesn't seem like that's the cause.

Finally where the tests slow down is when testing ProcessSynchronizer, which is the one place fasteners is used.

To summarize, there isn't an obvious cause outside of fasteners. Though it is disquieting that pinning doesn't seem to work.

@joshmoore
Copy link
Member

joshmoore commented Jan 28, 2022

Currently trying to reproduce locally with:

pytest -svvvx --count=1000 --random-order zarr/tests/test_sync.py

@psarka
Copy link

psarka commented Jan 28, 2022

Hi, fasteners maintainer here, saw a note in our issue about the deadlock, got interested. I'm puzzled by this thing, looking at the commit log I see that your tests passed 8 days ago with fasteners 17.2, but didn't with 2 days ago with fasteners 17.3:
image

Which is weird, as you are using ProcessLock, which was not touched in 17.3. Maybe it would be helpful to rerun your CI on the commit 21739ad (the one that was fine 8 days ago) and see if it still passes?

@joshmoore
Copy link
Member

Thanks, @psarka. Happy to give it a try. The previous action run was:

https://github.com/zarr-developers/zarr-python/runs/4878605616?check_suite_focus=true
Screen Shot 2022-01-28 at 15 38 16

I've just re-launched. Let's see what happens.

@psarka
Copy link

psarka commented Jan 28, 2022

Looks like it is hanging 🤯

Is there any other explanation, other than that something changed with github actions infrastructure? I'll rerun fasteners CI just in case.

Edit: fasteners are passing fine. I'm out of ideas.

@jakirkham
Copy link
Member Author

Thanks for taking a look Paulius! 😄

Yeah the hang seems to be occurring with the ProcessSynchronizer tests, which use ProcessLock. That was what we were seeing here as well.

That said, we probably need to turn this into a smaller reproducer so it is a bit easier to dig into.

@joshmoore
Copy link
Member

I got side tracked by trying to reproduce locally with https://github.com/nektos/act (unsuccessfully). I will give up on that and try to find a way forward.

@joshmoore
Copy link
Member

Plan discussed with @jakirkham:

  • I will try on bare-metal linux & linux/docker to try to reproduce.
  • John will try to tease apart whether or not the version of fasteners that we expect is actually being installed.
  • If we can't find a solution by tomorrow EOB, we will disable the synchronizer test in GHA and move forward with the 2.11 release.

@joshmoore
Copy link
Member

Linux & Linux/docker don't show a hang locally when following the python-package.yml workflow instructions. Currently running with --count=1000 but I don't expect it will show anything.

@jakirkham
Copy link
Member Author

More strange is we saw PR ( #955 ) pass. Trying to see if we can replicate that elsewhere. If so, maybe it was a GH Action's change?

@joshmoore
Copy link
Member

Maybe it was a temporary issue in our segment of the GHA cloud?! (if so, reminder for next time: contact GH earlier)

@jakirkham
Copy link
Member Author

Checking here as well ( #950 ) just to make sure this is consistently working

@jakirkham
Copy link
Member Author

Looks like things are working 🎉

Sorry for the false alarm @psarka and thanks for the help 🙂

Going to go ahead and close

@jakirkham jakirkham closed this Feb 4, 2022
@jakirkham jakirkham deleted the tst_ci branch February 4, 2022 10:01
joshmoore added a commit that referenced this pull request Feb 4, 2022
* Set write_empty_chunks to default to False

* Add release entry for write_empty_chunks default

* add Empty chunks section to tutorial.rst

* add benchmarky example

* proper formatting of code block

* Fix abstore deprecated strings

* Also catch ValueError in all_equal

The call to `np.any(array)` in zarr.util.all_equal triggers the
following ValueError:

```
>       return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
E       ValueError: invalid literal for int() with base 10: 'baz'
```

Extending the catch block allows test_array_with_categorize_filter
to pass, but it's unclear if this points to a deeper issue.

* Add --timeout argument to all uses of pytest

* Pin fasteners to 0.16.3 (see #952)

Co-authored-by: Davis Vann Bennett <[email protected]>
Co-authored-by: Josh Moore <[email protected]>
Co-authored-by: jmoore <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants