-
Notifications
You must be signed in to change notification settings - Fork 18k
x/build/cmd/coordinator: add health check for root filesystem of the Mac bastion host not being read-only #32449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do you know if the remaining 9 machines are operating correctly? As I understood the output, the Mac builders are at half capacity rather than completely down. Or is there a problem that's causing it to not report that the remaining 9 machines aren't operating? Based on builds at https://build.golang.org/ and details at https://farmer.golang.org/#pools, it seems the remaining 9 machines are connected, but they offer |
Each of the hosts can run any VM type, but it doesn't actively rebalance. (And even if it did, in this case it can't even do API calls to the VMware API server, so it wouldn't be able to anyway) The health checker could also report failures on the connected guest types. |
Yes, they'll probably each run fine for one build each and then kill themselves after the build and get stuck in the same state as the other 11. |
I'm starting to investigate this now.
One of them is running for 8 minutes now:
So we'll see what happens to it after it's done. |
That host completed the build and re-connected successfully after that. I tried restarting the physical host04 node via MacStadium UI. It previously had one of the VMs missing but another present. Now they're both missing and not coming back. So the problem is not there. The most immediate problem seems to be the
I haven't used |
There was a chance makemac was running as another user (instead of The entire root filesystem on the bastion host appears to be read-only. That isn't intentional, is it? I wonder if something changed recently that would result in that being the case. |
It shouldn't be read-only. The filesystem probably crapped itself. Check That's another thing that should be exported in makemac's status JSON. (My "Related: ..." comment above). |
Yeah,
|
I've restarted the bastion host. It came back up, and its filesystem is now writeable. As a result, calls to I suspect this will be enough to resolve the immediate issue, but there are more followup tasks to improve monitoring so we can spot some of these issues sooner. Edit: By now, all Mac hosts are up, and https://build.golang.org is catching up on missed Mac builds. https://farmer.golang.org/status/macs
Some Mac VMs still do disappear occasionally. That has a different cause than this original outage and will need to be investigated separately. |
Change https://golang.org/cl/181217 mentions this issue: |
…oordinator This adds information on warnings & errors to makemac's JSON status handler that is then parsed by the coordinator's health checking code, which already polls this JSON endpoint. Updates golang/go#32449 Updates golang/go#15760 Change-Id: I69bea7b07c184d1f62a358bc317376aa97018230 Reviewed-on: https://go-review.googlesource.com/c/build/+/181217 Reviewed-by: Brad Fitzpatrick <[email protected]>
This happened again as part of #35109. It will be helpful to add a health check for the bastion host root filesystem not being mounted as read-only to help diagnose this kind of issue in the future. |
Change https://golang.org/cl/202822 mentions this issue: |
Fixes golang/go#32449 Change-Id: I35d059778ab96ef4d57236aaccb41698314d6fac Reviewed-on: https://go-review.googlesource.com/c/build/+/202822 Reviewed-by: Dmitri Shuralyov <[email protected]>
The Macs are down again:
https://farmer.golang.org/status/macs
Looking at the macstadiumd host's logs:
Something's wrong with the cluster.
Related: since the coordinator now polls the makemac JSON status URL (and it's currently reporting healthy), we should include errors like
getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
in the makemac daemon's status response JSON, so they can be shown in the coordinator health output./cc @andybons @bcmills
The text was updated successfully, but these errors were encountered: