Skip to content

x/build/cmd/coordinator: add health check for root filesystem of the Mac bastion host not being read-only #32449

Closed
@bradfitz

Description

@bradfitz

The Macs are down again:

https://farmer.golang.org/status/macs

# "macs" status: MacStadium Mac VMs
# Notes: https://github.com/golang/build/tree/master/env/darwin/macstadium
Warn: macstadium_host01a missing, not seen for 46h18m23s
Warn: macstadium_host01b missing, not seen for 54h25m13s
Warn: macstadium_host02a missing, not seen for 54h25m0s
Warn: macstadium_host02b missing, not seen for 48h3m37s
Warn: macstadium_host04b missing, not seen for 47h55m10s
Warn: macstadium_host07a missing, not seen for 46h17m36s
Warn: macstadium_host08a missing, not seen for 48h0m48s
Warn: macstadium_host08b missing, not seen for 46h9m34s
Warn: macstadium_host09a missing, not seen for 46h23m44s
Warn: macstadium_host10a missing, not seen for 112h46m24s
Warn: macstadium_host10b missing, not seen for 112h47m30s
Error: 11 machines missing, 55% of capacity

Looking at the macstadiumd host's logs:

gopher@godns:~$ sudo journalctl -f -u makemac
-- Logs begin at Wed 2019-06-05 07:30:30 PDT. --
Jun 05 08:24:56 godns makemac[2341]: 2019/06/05 08:24:56 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:24:56 godns makemac[2341]: 2019/06/05 08:24:56 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:24:57 godns makemac[2341]: 2019/06/05 08:24:57 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:24:57 godns makemac[2341]: 2019/06/05 08:24:57 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:24:58 godns makemac[2341]: 2019/06/05 08:24:58 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:24:59 godns makemac[2341]: 2019/06/05 08:24:59 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:24:59 godns makemac[2341]: 2019/06/05 08:24:59 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:00 godns makemac[2341]: 2019/06/05 08:25:00 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:00 godns makemac[2341]: 2019/06/05 08:25:00 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:01 godns makemac[2341]: 2019/06/05 08:25:01 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:25:02 godns makemac[2341]: 2019/06/05 08:25:02 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"

Jun 05 08:25:03 godns makemac[2341]: 2019/06/05 08:25:03 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:03 godns makemac[2341]: 2019/06/05 08:25:03 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:25:03 godns makemac[2341]: 2019/06/05 08:25:03 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:04 godns makemac[2341]: 2019/06/05 08:25:04 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:05 godns makemac[2341]: 2019/06/05 08:25:05 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:05 godns makemac[2341]: 2019/06/05 08:25:05 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:25:06 godns makemac[2341]: 2019/06/05 08:25:06 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:06 godns makemac[2341]: 2019/06/05 08:25:06 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:07 godns makemac[2341]: 2019/06/05 08:25:07 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:25:07 godns makemac[2341]: 2019/06/05 08:25:07 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:08 godns makemac[2341]: 2019/06/05 08:25:08 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:09 godns makemac[2341]: 2019/06/05 08:25:09 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:09 godns makemac[2341]: 2019/06/05 08:25:09 getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF
Jun 05 08:25:10 godns makemac[2341]: 2019/06/05 08:25:10 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"
Jun 05 08:25:10 godns makemac[2341]: 2019/06/05 08:25:10 served cached buildlet of "97a16ac063b06959ba54c187354b7f12"

Something's wrong with the cluster.

Related: since the coordinator now polls the makemac JSON status URL (and it's currently reporting healthy), we should include errors like getting VMWare state: Reading /MacStadium-ATL/host/MacMini_Cluster: EOF in the makemac daemon's status response JSON, so they can be shown in the coordinator health output.

/cc @andybons @bcmills

Metadata

Metadata

Assignees

No one assigned

    Labels

    Buildersx/build issues (builders, bots, dashboards)FrozenDueToAgeNeedsFixThe path to resolution is known, but the work has not been done.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions