Skip to content

Test failure in dataflow/gpu-workers #5247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tmatsuo opened this issue Jan 15, 2021 · 6 comments
Closed

Test failure in dataflow/gpu-workers #5247

tmatsuo opened this issue Jan 15, 2021 · 6 comments
Assignees
Labels
api: dataflow Issues related to the Dataflow API. priority: p2 Moderately-important priority. Fix may not be included in next release. samples Issues that are directly related to samples. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@tmatsuo
Copy link
Contributor

tmatsuo commented Jan 15, 2021

Example build

Probably a flake, restarting the build to see.

@tmatsuo tmatsuo added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. api: dataflow Issues related to the Dataflow API. samples Issues that are directly related to samples. labels Jan 15, 2021
@tmatsuo tmatsuo self-assigned this Jan 15, 2021
@tmatsuo
Copy link
Contributor Author

tmatsuo commented Jan 15, 2021

It seems like it always fails.

@davidcavazos Do you have any idea why it fails?

@tmatsuo
Copy link
Contributor Author

tmatsuo commented Jan 15, 2021

I see a dlerror in the log:

2021-01-15 22:29:10.524176: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-15 22:29:10.524235: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

@tmatsuo
Copy link
Contributor Author

tmatsuo commented Jan 15, 2021

I found a more important error:

Startup of the worker pool in zone us-central1-a failed to bring up any of the desired 1 workers. ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS: The zone 'projects/python-docs-samples-tests/zones/us-central1-a' does not have enough resources available to fulfill the request.  '(resource type:compute)'.

@tmatsuo
Copy link
Contributor Author

tmatsuo commented Jan 15, 2021

There were leaked GCE instances. I deleted them and I think we're good to go.

@tmatsuo
Copy link
Contributor Author

tmatsuo commented Jan 16, 2021

Now it's constantly passing.

@tmatsuo tmatsuo closed this as completed Jan 16, 2021
@davidcavazos
Copy link
Contributor

Thanks. Sorry, I missed these notifications. The dynamic library warning is normal. It's trying to load GPUs, but the VM that launches the job doesn't have GPUs, but when the workers run the pipeline they should work as expected.

About the quota issue, Valentyn also mentioned that we could change to us-central1-f and that could solve that. I applied that in #5275

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: dataflow Issues related to the Dataflow API. priority: p2 Moderately-important priority. Fix may not be included in next release. samples Issues that are directly related to samples. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

2 participants