-
Notifications
You must be signed in to change notification settings - Fork 18k
x/playground: serving 502 errors #38530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'll take a look this afternoon, but it just worked for me now on mobile. There are a couple more stability issues to resolve. |
Change https://golang.org/cl/228438 mentions this issue: |
Add support for returning a meaningful error message to the user, along with partial build output if present. This change also improves graceful command termination throughout the playground. It introduces internal.WaitOrStop, which will wait for a command to finish, or gracefully stop the command if it has not finished by the time the context is cancelled. This also resolves comments from CL 227351. Updates golang/go#38052 Updates golang/go#38530 Updates golang/go#25224 Change-Id: I3a36ad2c5f3626c9cd5b3c1139543ccde3e17ba1 Reviewed-on: https://go-review.googlesource.com/c/playground/+/228438 Run-TryBot: Alexander Rakoczy <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Bryan C. Mills <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]>
It looks like the sandbox instances are running out of memory. I'm investigating. |
Nevermind, that may be a red herring of Docker doing it's job. |
I was getting a 502 of a GET to /. (See screenshot) That wouldn't make its way back to the gvisor world. |
I understand that, but when too many requests 500, the AppEngine LB will return a 502 for all requests from a particular Flex instance. |
Change https://golang.org/cl/229418 mentions this issue: |
Thanks @odeke-em. We're aware of an ongoing issue and are still working to figure out the cause. |
1 similar comment
Thanks @odeke-em. We're aware of an ongoing issue and are still working to figure out the cause. |
Sorry, closed the wrong issue while flipping through tabs. |
For some context, the new architecture from #25224 introduces sandbox hosts that are separate from our GAE instances (which I have been calling frontends, in the Google parlance). The frontend healthcheck looks like this: func (s *server) healthCheck() error {
ctx := context.Background() // TODO: cap it to some reasonable timeout
resp, err := compileAndRun(ctx, &request{Body: healthProg})
if err != nil {
return err
}
if resp.Errors != "" {
return fmt.Errorf("compile error: %v", resp.Errors)
}
if len(resp.Events) != 1 || resp.Events[0].Message != "ok" {
return fmt.Errorf("unexpected output: %v", resp.Events)
}
return nil
} With the change to add new sandbox backends, this health check fails if all of the backends are down. AppEngine will mark our app as dead, and will try restarting it. Until we understand the instability issues in our backend, I think we should change this health check to only do the "compile" phase, which happens on the frontend. It would still be helpful to mark them unhealthy if binary compilation is failing entirely. This should help us stop serving 502s, but it doesn't get us out of the woods of our sandbox hosts being unstable. I have other CLs planned for investigating that, and it starts with measurements. |
I've discussed this with @toothrot and agree that is a good next step. |
It's also worth mentioning that we should move to split health checks: https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#updated_health_checks |
Change https://golang.org/cl/229677 mentions this issue: |
This change avoids cascading sandbox backend failures to the frontend instances. The timeout TODO is safe to remove as we now pass in the request context for cancellation, and sandboxBuild has its own build timeout. Updates golang/go#25224 Updates golang/go#38530 Change-Id: If892f86bad08c55429b6ebab768b83c5d7621cf1 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229677 Run-TryBot: Alexander Rakoczy <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Carlos Amedee <[email protected]>
Change https://golang.org/cl/229679 mentions this issue: |
Change https://golang.org/cl/229680 mentions this issue: |
- Specify the correct image in konlet.yaml. gvisor-playground-sandbox is the child-process container. - Correct interpolation in config identifiers, which is deprecated. - Set min_ready_sec for update policy to not cause an outage when updating - Use name_prefix for instance_template instead of name, which allows updates. Templates are immutable, so previously this was not possible to update. Updates golang/go#38530 Updates golang/go#25224 Change-Id: I3f7618b8e378eaa9714e571b90390b7052bf2855 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229418 Run-TryBot: Alexander Rakoczy <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]>
Change https://golang.org/cl/229681 mentions this issue: |
This changes the sandbox VMs to allow them to write to StackDriver metrics. Updates golang/go#25224 Updates golang/go#38530 Change-Id: I82954b8eed3664289f5c69c0f5301a72206f0948 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229681 Run-TryBot: Alexander Rakoczy <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]>
This change adds OpenCensus's HTTP instrumentation to the sandbox. In development mode, it exposes a prometheus metrics interface on /statusz. This is the first in a series of CLs to add instrumentation to different parts of the sandbox to help investigate instability issues. For now, reporting metrics around our responses per handler to StackDriver will be helpful. OpenTelemetry would be preferable, as it is the successor of OpenCensus, however the StackDriver integration is not quite done. Updates golang/go#25224 Updates golang/go#38530 Change-Id: I600fd695bb66c8bee16bc0b778d51930f4cdd476 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229679 Run-TryBot: Alexander Rakoczy <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Carlos Amedee <[email protected]>
This change adds a metric to count the number of running containers, according to Docker. Updates golang/go#25224 Updates golang/go#38530 Change-Id: Id989986928dff594cb1de0903a56dcffed8220c4 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229680 Run-TryBot: Alexander Rakoczy <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Carlos Amedee <[email protected]>
Change https://golang.org/cl/229958 mentions this issue: |
The generic_task target has a required label format that was not being met. Moves metadata fetching into a helper function. Removes call to view.RegisterExporter for StackDriver exporter, which was unncessary. Updates golang/go#25224 Updates golang/go#38530 Change-Id: Ib009f5ce906f5b9479cdda8c7e8322d06e3036e4 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229958 Run-TryBot: Alexander Rakoczy <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Carlos Amedee <[email protected]>
Change https://golang.org/cl/229980 mentions this issue: |
Change https://golang.org/cl/229981 mentions this issue: |
After clearing up the concurrency issues, and modifying the frontend health check, the latest series of metrics and load testing provided us the results we were looking for: I performed a load test of a trivial program to figure out how the sandbox scaled.
https://golang.org/cl/229980 adds container creation time metrics. Locally, it takes about 600ms to 1s to create a container. That means it could take as long as 8s to recover from 8 requests. https://golang.org/cl/229981 addresses this issue by creating containers in parallel, allowing us to keep up with the request load. I'm expecting this to improve CPU utilization, hopefully triggering auto-scaling under load. |
This change measures the latency and success of container creation. These metrics will help capacity planning and investigating production issues. Updates golang/go#25224 Updates golang/go#38530 Change-Id: Id7f373acb8741d4465c6e632badb188b6e855787 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229980 Run-TryBot: Alexander Rakoczy <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]>
Creating a container in the sandbox takes 500ms to 1s. The sandbox was creating containers serially, but serving requests in parallel. This means that we can be starved for workers with a trivial number of requests. In addition, the sandbox in production is not CPU bound, meaning we probably have room to do some extra work while serving a request. This CL introduces a worker pool to create containers. It also changes the readyContainer chan to unbuffered to avoid having twice as many containers as we expect while idle (the container waiting to be sent plus those already in the channel's buffer). Updates golang/go#25224 Updates golang/go#38530 Change-Id: I0e535cf65409c3dbf32329577a1c0687c2614a0d Reviewed-on: https://go-review.googlesource.com/c/playground/+/229981 Run-TryBot: Alexander Rakoczy <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Carlos Amedee <[email protected]>
The results of those changes are similar, but I was unable to cause the instances to become unhealthy, nor could I make the instances autoscale with trivial programs: Many CPU-bound snippets would still trigger horizontal scaling successfully. I honestly think this is good enough for this issue. For scaling further, we may want to consider:
|
I'm going to close this issue as I think we're now in a good place when it comes to our daily usage. We can re-open it, or open a new one, if we notice something specific now. I'm very happy about our new metrics, and hopefully we'll have way more insight into what's going on the next time something like this comes up for the playground. |
I've been seeing a number of 502s on the playground just now.
(Hope it wasn't my fault!)
/cc @toothrot
The text was updated successfully, but these errors were encountered: