Skip to content

x/build/windows-arm64: recover from unresponsive VM #47018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
toothrot opened this issue Jul 1, 2021 · 7 comments
Closed

x/build/windows-arm64: recover from unresponsive VM #47018

toothrot opened this issue Jul 1, 2021 · 7 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@toothrot
Copy link
Contributor

toothrot commented Jul 1, 2021

What version of Go are you using (go version)?

Go tip: 4711bf3

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

windows/arm64

What did you do?

Caused a fatal OS error in the Windows ARM64 buildlet, which failed to reboot. (see #47017 for cause)

What did you expect to see?

The builder to exit successfully after a crash, and process a new build.

What did you see instead?

The Windows VM was stuck in the EFI booting stage, failing to boot windows after a fatal error.

The script that loops the VM is very naive, and will wait indefinitely for the VM to exit. We should kill the VM if it is unresponsive for some time, perhaps by exposing a /healthz endpoint on the Windows buildlet, and exposing it to the host.

It probably makes sense to either extend buildlet to take on the responsibilities of something like rundockerbuildlet and makemac, or to add a new runqemubuildlet command.

@toothrot toothrot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 1, 2021
@toothrot toothrot added this to the Unreleased milestone Jul 1, 2021
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Jul 1, 2021
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/332492 mentions this issue: env/windows-arm64/macstadium: add image notes and qemu script

gopherbot pushed a commit to golang/build that referenced this issue Jul 7, 2021
Add barebones instructions for creating a macmini instance that runs a
Windows ARM64 buildlet in a loop. The instruction templates are from our
other macstadium builders.

See golang/go#47018 for improvements.

Updates golang/go#47018
Fixes golang/go#42604

Change-Id: I0bb092aaf99afb12a0e563a69bcb711333dda743
Reviewed-on: https://go-review.googlesource.com/c/build/+/332492
Trust: Alexander Rakoczy <[email protected]>
Run-TryBot: Alexander Rakoczy <[email protected]>
TryBot-Result: Go Bot <[email protected]>
Reviewed-by: Carlos Amedee <[email protected]>
@toothrot toothrot self-assigned this Jul 12, 2021
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/334372 mentions this issue: cmd/runvmbuildlet: add command to run vm-based buildlets

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/334373 mentions this issue: cmd/buildlet: add healthz endpoint

gopherbot pushed a commit to golang/build that referenced this issue Jul 15, 2021
runqemubuildlet runs a qemu-based buildlet in a loop. This will allow us
to add better monitoring to the command than with the current bash
script.

WaitOrStop was originally implemented for x/playground in
golang.org/cl/228438. It provides a safe way to terminate programs after
a timeout, or to forcibly terminate them after a grace period.

For golang/go#47018

Change-Id: I205c53554bdf287997d567d530581a93febea648
Reviewed-on: https://go-review.googlesource.com/c/build/+/334372
Trust: Alexander Rakoczy <[email protected]>
Run-TryBot: Alexander Rakoczy <[email protected]>
TryBot-Result: Go Bot <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
gopherbot pushed a commit to golang/build that referenced this issue Jul 16, 2021
This adds a healthz endpoint to buildlets. For reverse buildlets, it
also listens for healthz requests on a private port for a monitoring
process.

For golang/go#47018

Change-Id: I100a8939c5752664afb80472e567ab05a80649d7
Reviewed-on: https://go-review.googlesource.com/c/build/+/334373
Trust: Alexander Rakoczy <[email protected]>
Run-TryBot: Alexander Rakoczy <[email protected]>
TryBot-Result: Go Bot <[email protected]>
Reviewed-by: Heschi Kreinick <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/334953 mentions this issue: cmd/runqemubuildlet: pass command arguments correctly

gopherbot pushed a commit to golang/build that referenced this issue Jul 16, 2021
The spaces are not necessary, as each argument is passed correctly to
the command. Add Stdout/Stderr output from qemu.

For golang/go#47018

Change-Id: Ia908bf2cc639cc7d2a60bff137bc2e714a3ec6ef
Reviewed-on: https://go-review.googlesource.com/c/build/+/334953
Trust: Alexander Rakoczy <[email protected]>
Run-TryBot: Alexander Rakoczy <[email protected]>
TryBot-Result: Go Bot <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/336109 mentions this issue: cmd/runqemubuildlet: restart unresponsive qemu processes

gopherbot pushed a commit to golang/build that referenced this issue Jul 21, 2021
Expose the healthz port from the buildlet running under QEMU, and
periodically check it for a successful response. If it has been failing
for longer than ten minutes, try to restart the VM. This should
successfully restart VMs that failed to boot, failed to shut down, or
are otherwise unresponsive.

For golang/go#47018

Change-Id: I9218f94ee24de6e0a56ad60a18e075ce48893938
Reviewed-on: https://go-review.googlesource.com/c/build/+/336109
Trust: Alexander Rakoczy <[email protected]>
Run-TryBot: Alexander Rakoczy <[email protected]>
TryBot-Result: Go Bot <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Reviewed-by: Carlos Amedee <[email protected]>
@toothrot
Copy link
Contributor Author

Tested and verified.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/336590 mentions this issue: cmd/{buildlet,runqemubuildlet}: use a functional healthAddr default

gopherbot pushed a commit to golang/build that referenced this issue Jul 22, 2021
Reverse buildlets now listen publicly, which allows the QEMU host
forwarding to route to the buildlet.

Also, print a newline at the end of the healthz response for legibility.

For golang/go#47018

Change-Id: I71ae1bf4d7cbee4867c42e863cb9f8c2569e1b69
Reviewed-on: https://go-review.googlesource.com/c/build/+/336590
Trust: Alexander Rakoczy <[email protected]>
Run-TryBot: Alexander Rakoczy <[email protected]>
Reviewed-by: Heschi Kreinick <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
TryBot-Result: Go Bot <[email protected]>
@heschi heschi moved this to Done in Go Release Sep 27, 2022
@golang golang locked and limited conversation to collaborators Jun 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Archived in project
Development

No branches or pull requests

2 participants