Skip to content

Memory corruption with scheduler=coroutines #2101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dgryski opened this issue Sep 8, 2021 · 20 comments
Closed

Memory corruption with scheduler=coroutines #2101

dgryski opened this issue Sep 8, 2021 · 20 comments
Labels
bug Something isn't working core

Comments

@dgryski
Copy link
Member

dgryski commented Sep 8, 2021

package main

import (
        "context"
        "os"
)

func main() {
        flog := os.Stdout
        println("outside flog is", flog)

        ServeFunc(func(context context.Context) {
                println("inside flog is: ", flog)
        })
}

func Serve(h Handler) {
        _, cancel := context.WithCancel(context.Background())
        defer cancel()
        h.ServeHTTP(nil)
}

func ServeFunc(f HandlerFunc) {
        Serve(f)
}

type Handler interface {
        ServeHTTP(ctx context.Context)
}

type HandlerFunc func(ctx context.Context)

func (f HandlerFunc) ServeHTTP(ctx context.Context) {
        f(ctx)
}

Produces the following output:

outside flog is 0x00010704
panic: runtime error: nil pointer dereference
Error: failed to run main module `m.wasm`

Caused by:
    0: failed to invoke command default
    1: wasm trap: unreachable
       wasm backtrace:
           0: 0x2e1e - runtime.abort
                           at /Users/dgryski/go/src/tinygo.org/tinygo/src/runtime/runtime_tinygowasm.go:63:6
                     - runtime.runtimePanic
                           at /Users/dgryski/go/src/tinygo.org/tinygo/src/runtime/panic.go:20:7
           1:  0xc65 - runtime.nilPanic
                           at /Users/dgryski/go/src/tinygo.org/tinygo/src/runtime/panic.go:32:14
           2: 0xb9ae - main.Serve
                           at /Users/dgryski/go/src/github.com/dgryski/wasmplay/cmd/closenil/main.go
           3: 0x51f5 - (*internal/task.Task).Resume
                           at /Users/dgryski/go/src/tinygo.org/tinygo/src/internal/task/task_coroutine.go:23:12
                     - runtime.scheduler
                           at /Users/dgryski/go/src/tinygo.org/tinygo/src/runtime/scheduler.go:171:11
                     - runtime.run
                           at /Users/dgryski/go/src/tinygo.org/tinygo/src/runtime/scheduler_any.go:24:11
                     - _start
                           at /Users/dgryski/go/src/tinygo.org/tinygo/src/runtime/runtime_wasm_wasi.go:21:5

This was run with tinygo build from dev and the patch from #2100 .

I've tried to shrink this down as much as I can. The println inside the closure generates a nil-pointer crash when trying to restore flog from the coroutine context.

It's possible this will go away with llvm12 with the coroutine changes.

This bug is not present when built with -gc=leaking, but I think that's because the llvm coroutine logic needs more function calls (which are present when there's more logic being inserted for allocations, etc) rather than an issue with the conservative collector itself. (We have a similar issue present in a larger codebase that is built with the leaking collector.)

@aykevl
Copy link
Member

aykevl commented Sep 8, 2021

This appears to be unrelated to #2041 because it doesn't trigger an assertion failure in LLVM. Which means it's probably a real bug in TinyGo.

@aykevl
Copy link
Member

aykevl commented Sep 8, 2021

Some digging:

  • The panic goes away with -opt=1 (which makes debugging harder).
  • The panic only exists with -scheduler=coroutines.
  • The panic only exists on -target=wasi and baremetal targets (tested cortex-m-qemu and riscv-qemu). Not with -target=wasm, which is interesting.

The code here does seem to trigger the LLVM bug in #2041 but monkey-patching those issues doesn't fix it. So I think it's something else.

@aykevl
Copy link
Member

aykevl commented Sep 8, 2021

This bug is not present when built with -gc=leaking, but I think that's because the llvm coroutine logic needs more function calls (which are present when there's more logic being inserted for allocations, etc) rather than an issue with the conservative collector itself.

Indeed. I get the same panic with -gc=leaking -scheduler=coroutines -target=cortex-m-qemu:

$ tinygo gdb -scheduler=coroutines -target=cortex-m-qemu -gc=leaking ./tmp/issue2101.go
[...]
Remote debugging using :1234
0x000019c6 in Reset_Handler () at /home/ayke/src/github.com/tinygo-org/tinygo/src/runtime/runtime_cortexm.go:36
36			*(*uint32)(dst) = *(*uint32)(src)
(gdb) b runtime.runtimePanic
Breakpoint 1 at 0xafe: file /home/ayke/src/github.com/tinygo-org/tinygo/src/runtime/panic.go, line 18.
(gdb) c
Continuing.
outside flog is 0x20001004

Breakpoint 1, runtime.runtimePanic (msg=...) at /home/ayke/src/github.com/tinygo-org/tinygo/src/runtime/panic.go:18
18		printstring("panic: runtime error: ")
(gdb) bt
#0  runtime.runtimePanic (msg=...) at /home/ayke/src/github.com/tinygo-org/tinygo/src/runtime/panic.go:18
#1  0x00000b60 in runtime.nilPanic () at /home/ayke/src/github.com/tinygo-org/tinygo/src/runtime/panic.go:32
#2  0x000026b2 in main.Serve (h=...) at /home/ayke/src/github.com/tinygo-org/tinygo/tmp/issue2101.go:17
#3  0x00001a34 in (*internal/task.Task).Resume (t=<optimized out>) at /home/ayke/src/github.com/tinygo-org/tinygo/src/internal/task/task_coroutine.go:23
#4  runtime.scheduler () at /home/ayke/src/github.com/tinygo-org/tinygo/src/runtime/scheduler.go:171
#5  runtime.run () at /home/ayke/src/github.com/tinygo-org/tinygo/src/runtime/scheduler_any.go:24
#6  0x000019d0 in Reset_Handler () at /home/ayke/src/github.com/tinygo-org/tinygo/src/runtime/runtime_cortexm_qemu.go:23
(gdb) 

So it's not related to the GC. I suspect it's the combination of -scheduler=coroutines and the specific os package that is in use (the os package is implemented somewhat differently for wasm and wasi).

@dgryski
Copy link
Member Author

dgryski commented Sep 9, 2021

I think the os package is a distraction. Here's a smaller version I managed to create:

package main

import (
        "context"
)

type flogStruct struct {
        i int
}

func main() {
        flog := &flogStruct{1}
        println("outside flog is", flog)

        Serve(func(i int) {
                println("inside flog is: ", flog)
        })
}

func Serve(f func(i int)) {
        ctx, _ := context.WithCancel(context.Background())
        _ = ctx
        f(2)
}

This one doesn't crash, but does show that flog inside the closure is incorrect:

outside flog is 0x00010830
inside flog is:  nil

@deadprogram deadprogram added bug Something isn't working wasm WebAssembly labels Sep 9, 2021
@dgryski
Copy link
Member Author

dgryski commented Sep 9, 2021

@deadprogram This bug is not limited to wasm. The most recent reproducer fails built on my Mac:

outside flog is 0x00000001113b4050
inside flog is 0x0000000000000001

Not nil, but also clearly not a valid pointer.

@deadprogram deadprogram added core and removed wasm WebAssembly labels Sep 9, 2021
@dgryski
Copy link
Member Author

dgryski commented Sep 10, 2021

Here's another reduced test case with the contents of the context package inlined as much as possible:

package main

import (
        "sync"
)

type flogStruct struct {
        i int
}

func main() {
        flog := &flogStruct{23}
        println("outside flog is", flog)

        Serve(func() {
                println("inside flog is", flog)
        })
}

func Serve(f func()) {
        WithCancel(new(emptyCtx)) // nil: LLVM ERROR: Cannot select: intrinsic %llvm.coro.size
        f()
}

type Context interface {
        Done() <-chan struct{}
}

type emptyCtx int

func (*emptyCtx) Done() <-chan struct{} {
        return nil
}

func WithCancel(parent Context) (ctx Context, cancel func()) {
        c := cancelCtx{Context: nil}
        propagateCancel(parent, &c)
        return nil, nil
}

func propagateCancel(parent Context, _ Context) { // , _ Context) {
        done := parent.Done()
        select {
        case <-done:
                return
        default:
        }
}

type cancelCtx struct {
        Context
        mu sync.Mutex
}

func (c *cancelCtx) Done() <-chan struct{} {
        c.mu.Lock()
        return nil
}

building with

tinygo build -opt=1 -gc=leaking -scheduler=coroutines -o m.tiny main.go

produces

outside flog is 0x0000000108c89050
inside flog is 0x0000000108c89000

Both flog addresses should be the same.

Edit: Just noticed this particular reducer only triggers with -opt=1. (Previous reducers triggered with -opt=z, the default, but in an attempt to easy debugging I switched to -opt=1)

@dgryski dgryski changed the title Memory corruption with gc=conservative, scheduler=coroutines, target=wasi Memory corruption with scheduler=coroutines Sep 11, 2021
@dgryski
Copy link
Member Author

dgryski commented Sep 15, 2021

This is starting to seem like a lifetime isse. Adding either of these lines (uncommented)

        // flog = nil
        // println("after flog", flog)

after the call to Serve() causes the bug to go away.

Oddly, adding _ = flog instead causes a substantially different heap address to be printed.

@aykevl
Copy link
Member

aykevl commented Sep 15, 2021

I have managed to reduce this bug even further. I believe this is still the same bug:

package main

import "time"

func main() {
	n := byte(23)
	foo(&n)
}

func foo(n *byte) {
	time.Sleep(1)
	println("n is", *n)
}

This should print 23, but it prints 1 in TinyGo:

$ tinygo run -opt=1 -gc=leaking -scheduler=coroutines tmp/issue2101.go
n is 120

The problem is that n in main is stack allocated, which is a valid optimization before the coroutines pass. The coroutines pass converts foo to a coroutine, but does not convert main to a coroutine. Therefore, the alloca is not saved as part of suspending the coroutine stack.

@niaow I'm fairly certain this is a bug in the coroutine conversion pass. It should convert parents of async functions that contain an alloca instruction that is alive between suspension points (or, conservatively, any function that contains an alloca instruction). What do you think? What would the intended behavior be in this case?

@niaow
Copy link
Member

niaow commented Sep 15, 2021

So, more specifically: the main() has only tail calls, so the function is trying to skip the full coroutine transform. Unfortunately the tail-call-only case does not currently handle this situation. I think the correct behavior would be to find live stack allocations and move them specifically to the heap in the tail-call-only case. The not-coroutines replacement I have been working on has a much less crazy solution, but I think this is the correct approach for this situation.

@aykevl
Copy link
Member

aykevl commented Sep 15, 2021

I think the correct behavior would be to find live stack allocations and move them specifically to the heap in the tail-call-only case.

Yes, that would also be a possible solution. It seems slightly more risky to me, but I don't see a reason why it wouldn't work.

@niaow
Copy link
Member

niaow commented Sep 15, 2021

Alright, I can add a check and convert to a coroutine in that case I guess.

@aykevl
Copy link
Member

aykevl commented Sep 15, 2021

Either seems fine by me, if it fixes the bug. Hopefully the new asincify solution will make all of this obsolete.

@niaow
Copy link
Member

niaow commented Sep 15, 2021

Well, this appears to include an addional problem. After switching over to coroutine transform, the coroutine lowering seems to not be moving the alloca into the frame.

@niaow
Copy link
Member

niaow commented Sep 15, 2021

Oh. . . there is actually another bug with tail call lowering I guess. . .

@niaow
Copy link
Member

niaow commented Sep 15, 2021

So it appears that all coroutine tail calls are implemented incorrectly.

@dgryski
Copy link
Member Author

dgryski commented Sep 18, 2021

I've minimized the reproducer in #2101 (comment) a bit more (now that I know about using time.Sleep(1) to force the coroutine pass to do stuff:

package main

import "time"

func main() {
        p := new(int)
        println("addr", p)

        f := func() {
                time.Sleep(1)
                println("addr", p)
        }

        f()
}

This fails even after the patch from #2117.

Building with -opt=1 gives

addr 0x000000010ef63070
addr 0x0000000000cca687

on my Mac.

Adding println("addr") after the call to f() doesn't change the result, but adding println("addr", p) does. So, again it looks like a lifetime issue related to closures and the coroutine scheduler.

@dgryski
Copy link
Member Author

dgryski commented Oct 6, 2021

@aykevl
Copy link
Member

aykevl commented Oct 6, 2021

That first bug is just a code quality issue, it shouldn't result in an actual bug. The second one to me appears like a Clang issue, although I'm not entirely sure.

In the end, the only real way to investigate it is to shrink it down and investigate the IR. However, as I found in #1720, I believe there are fundamental issues in the way we lower goroutines in WebAssembly so maybe a much bigger change is needed to fix it properly.

@dgryski
Copy link
Member Author

dgryski commented Oct 6, 2021

As I've mentioned, it seems like asyncify is the way forward in the short term, and the wasm stack-switching proposal the longer term solution.

@dgryski
Copy link
Member Author

dgryski commented Nov 4, 2022

coroutines were removed a while ago.

@dgryski dgryski closed this as completed Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core
Projects
None yet
Development

No branches or pull requests

4 participants