Skip to content

possible race conditions on intel MacOS #15730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
siddarthkay opened this issue Jul 23, 2024 · 19 comments
Open

possible race conditions on intel MacOS #15730

siddarthkay opened this issue Jul 23, 2024 · 19 comments
Assignees
Labels
bug Something isn't working core-team crash E:Desktop Keycard Bug Bug found after initial keycard development wallet-team
Milestone

Comments

@siddarthkay
Copy link
Contributor

Bug Report

Description

It was observed that status-desktop would often crash on onboarding stage on intel MacOS.
After trying to figure out why that happens here #15134
I found out that this issue was often fixed by doing nothing.
All the signs point to a possible race condition in the thread pool setup where a nim service calls some business logic in status-go or status-keycard-go.

I also discovered that for nim interop of go we free the memory by calling a go method which is called via a nim interface.
Exhibit A :
Lets take a look at how we consume keycardInitFlow which lives in vendor/status-keycard-go/shared/main.go

Source of keycardInitFlow is :

//export KeycardInitFlow
func KeycardInitFlow(storageDir *C.char) *C.char {
	var err error
	globalFlow, err = skg.NewFlow(C.GoString(storageDir))

	return retErr(err)
}

This function is wrapped around another wrapper in vendor/nim-keycard-go/keycard_go.nim and the source looks like this :

import ./keycard_go/impl as go_shim

export KeycardSignalCallback

proc keycardInitFlow*(storageDir: string): string =
  var funcOut = go_shim.keycardInitFlow(storageDir.cstring)
  defer: go_shim.free(funcOut)
  return $funcOut

It is also important to look at the source of go_shim :
The source lives here : vendor/nim-keycard-go/keycard_go/impl.nim

# go functions do not raise nim exceptions and do not interact with the Nim gc
{.push raises: [], gcsafe.}

proc free*(param: pointer) {.importc: "Free".}

proc keycardInitFlow*(storageDir: cstring): cstring {.importc: "KeycardInitFlow".}

Finally this code is being consumed in service.nim like this :

  proc init*(self: Service) =
    if self.doLogging:
      debug "init keycard using ", pairingsJson=status_const.KEYCARDPAIRINGDATAFILE
    let initResp = keycard_go.keycardInitFlow(status_const.KEYCARDPAIRINGDATAFILE)
    if self.doLogging:
      debug "initialization response: ", initResp

service.nim lives here : src/app_service/service/keycard/service.nim

I tried to create a minimal reproduction repo here but I was unable to reproduce the crash :
https://github.com/siddarthkay/status-desktop-intel-crash-reproducer
Although my efforts did not include a thread pool and that could be the key to reproducing the race condition.

Another key factor in discovering this race condition was upgrading go to 1.21.
go 1.21 has brought significant changes to its garbage collector and the crash we would see would often link to the code related to garbage collection.

error message :

bad flushGen 0 in prepareForSweep; sweepgen 14
fatal error: bad flushGen

reference in go source :
https://github.com/golang/go/blob/8f5c6904b616fd97dde4a0ba2f5c71114e588afd/src/runtime/mcache.go#L325

// prepareForSweep flushes c if the system has entered a new sweep phase
// since c was populated. This must happen between the sweep phase
// starting and the first allocation from c.
func (c *mcache) prepareForSweep() {
	// Alternatively, instead of making sure we do this on every P
	// between starting the world and allocating on that P, we
	// could leave allocate-black on, allow allocation to continue
	// as usual, use a ragged barrier at the beginning of sweep to
	// ensure all cached spans are swept, and then disable
	// allocate-black. However, with this approach it's difficult
	// to avoid spilling mark bits into the *next* GC cycle.
	sg := mheap_.sweepgen
	flushGen := c.flushGen.Load()
	if flushGen == sg {
		return
	} else if flushGen != sg-2 {
		println("bad flushGen", flushGen, "in prepareForSweep; sweepgen", sg)
		throw("bad flushGen")
	}
	c.releaseAll()
	stackcache_clear(c)
	c.flushGen.Store(mheap_.sweepgen) // Synchronizes with gcStart
}

At the moment this issue is mitigated by introducing some sleep time in this PR : #15194
However this is not a proper solution and we may run into race conditions elsewhere in the future.

Steps to reproduce

  • find and remove the sleep code and start a fresh app on intel MacOS

Expected behaviour

  • must not crash

Actual behaviour

  • crash
@siddarthkay siddarthkay added the bug Something isn't working label Jul 23, 2024
@jrainville jrainville added this to the 2.31.0 Beta milestone Jul 23, 2024
@jrainville jrainville added E:Desktop Keycard Bug Bug found after initial keycard development backend-team labels Jul 23, 2024
@alaibe alaibe removed this from the 2.31.0 Beta milestone Aug 27, 2024
@iurimatias iurimatias added this to the 2.32.0 Beta milestone Oct 30, 2024
@jrainville jrainville modified the milestones: 2.32.0 Beta, 2.33.0 Beta Dec 4, 2024
@iurimatias iurimatias modified the milestones: 2.33.0 Beta, 2.34.0 Beta Jan 29, 2025
@Khushboo-dev-cpp
Copy link
Contributor

I am observing a crash on my intel macbook Sequoia. Maybe it is linked -

INFO [03-05|16:04:42.836] New local node record                    seq=1,741,190,682,642 id=e115e7f0b008553e ip=192.168.0.107 udp=64115 tcp=57374
fatal error: bad sweepgen in refill

goroutine 17 gp=0xc000006700 m=1 mp=0xc0000a6008 [running, locked to thread]:
runtime.throw({0x112123ba3?, 0x48?})
        /usr/local/go/src/runtime/panic.go:1023 +0x5c fp=0xc00006bb28 sp=0xc00006baf8 pc=0x111dbc99c
runtime.(*mcache).refill(0x10f5a71d8, 0x1?)
        /usr/local/go/src/runtime/mcache.go:157 +0x20d fp=0xc00006bb68 sp=0xc00006bb28 pc=0x111d9c2ad
runtime.(*mcache).nextFree(0x10f5a71d8, 0x59)
        /usr/local/go/src/runtime/malloc.go:948 +0x85 fp=0xc00006bba0 sp=0xc00006bb68 pc=0x111d932a5
runtime.mallocgc(0x1000, 0x11220e780, 0x1)
        /usr/local/go/src/runtime/malloc.go:1149 +0x4a5 fp=0xc00006bc28 sp=0xc00006bba0 pc=0x111d938c5
runtime.makeslice(0xc000006700?, 0x112111676?, 0x1c9d48ed8?)
        /usr/local/go/src/runtime/slice.go:107 +0x49 fp=0xc00006bc50 sp=0xc00006bc28 pc=0x111dd57e9
bufio.NewReaderSize(...)
        /usr/local/go/src/bufio/bufio.go:54
bufio.NewReader(...)
        /usr/local/go/src/bufio/bufio.go:60
net/http/httptest.NewRequest({0x112111676?, 0xc00305d3e0?}, {0x11211167a, 0x4}, {0x11228de18, 0xc002e4be30})
        /usr/local/go/src/net/http/httptest/httptest.go:44 +0x10a fp=0xc00006bd48 sp=0xc00006bc50 pc=0x112064fca
main.KeycardCallRPC(0x0?)
        /Users/khushboomehta/Documents/status-desktop/vendor/status-keycard-go/shared/api_session.go:56 +0xd8 fp=0xc00006be38 sp=0xc00006bd48 pc=0x1120eec38
_cgoexp_1e0169bf1ace_KeycardCallRPC(0x7ff7b2dd9570)
        _cgo_gotypes.go:200 +0x1e fp=0xc00006be58 sp=0xc00006be38 pc=0x1120ef53e
runtime.cgocallbackg1(0x1120ef520, 0x7ff7b2dd9570, 0x0)
        /usr/local/go/src/runtime/cgocall.go:420 +0x295 fp=0xc00006bf18 sp=0xc00006be58 pc=0x111d8a755
runtime.cgocallbackg(0x1120ef520, 0x7ff7b2dd9570, 0x0)
        /usr/local/go/src/runtime/cgocall.go:339 +0x136 fp=0xc00006bf90 sp=0xc00006bf18 pc=0x111d8a416
runtime.cgocallbackg(0x1120ef520, 0x7ff7b2dd9570, 0x0)
        <autogenerated>:1 +0x29 fp=0xc00006bfb8 sp=0xc00006bf90 pc=0x111df4589
runtime.cgocallback(0x0, 0x0, 0x0)
        /usr/local/go/src/runtime/asm_amd64.s:1079 +0xcc fp=0xc00006bfe0 sp=0xc00006bfb8 pc=0x111df1bec
runtime: g 17: unexpected return pc for runtime.cgocallback called from 0x116266c81
stack: frame={sp:0xc00006bfb8, fp:0xc00006bfe0} stack=[0xc000068000,0xc00006c000)
0x000000c00006beb8:  0x0000000000000000  0x000000c000006700 
0x000000c00006bec8:  0x000000c00006beb0  0x0000000111d8a800 <runtime.cgocallbackg1.deferwrap2+0x0000000000000000> 
0x000000c00006bed8:  0x000000c00006be9d  0x00007ff7b2ddd000 
0x000000c00006bee8:  0x0000000000000000  0x0000000000000000 
0x000000c00006bef8:  0x0000000000000000  0x000000c00006bed0 
0x000000c00006bf08:  0x000000c00006bf80  0x0000000111d8a416 <runtime.cgocallbackg+0x0000000000000136> 
0x000000c00006bf18:  0x00000001120ef520 <_cgoexp_1e0169bf1ace_KeycardCallRPC+0x0000000000000000>  0x00007ff7b2dd9570 
0x000000c00006bf28:  0x0000000000000000  0x0000000000000000 
0x000000c00006bf38:  0x0000000000000000  0x0000000000000000 
0x000000c00006bf48:  0x0000000000000000  0x0000000000000000 
0x000000c00006bf58:  0x0000000000000000  0x0000000116266c81 
0x000000c00006bf68:  0x000000c00006bfe0  0x000000c000006700 
0x000000c00006bf78:  0x000000c0000a6008  0x000000c00006bfa8 
0x000000c00006bf88:  0x0000000111df4589 <runtime.cgocallbackg+0x0000000000000029>  0x00000001120ef520 <_cgoexp_1e0169bf1ace_KeycardCallRPC+0x0000000000000000> 
0x000000c00006bf98:  0x00007ff7b2dd9570  0x0000000000000000 
0x000000c00006bfa8:  0x00007ff7b2dd9508  0x0000000111df1bec <runtime.cgocallback+0x00000000000000cc> 
0x000000c00006bfb8: <0x00000001120ef520 <_cgoexp_1e0169bf1ace_KeycardCallRPC+0x0000000000000000>  0x00007ff7b2dd9570 
0x000000c00006bfc8:  0x0000000000000000  0x0000000000000000 
0x000000c00006bfd8:  0x0000000116266c81 >0x0000000000000000 
0x000000c00006bfe8:  0x0000000000000000  0x0000000000000000 
0x000000c00006bff8:  0x0000000000000000 

goroutine 5 gp=0x1c0000061c0 m=5 mp=0x1c000079b08 [syscall]:
runtime.cgocall(0x1120f07d0, 0x1c00003da48)
        /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0x1c00003da20 sp=0x1c00003d9e8 pc=0x111d8a00b
github.com/ebfe/scard._Cfunc_SCardGetStatusChange(0x1, 0xffffffff, 0x1c00041c080, 0x1)
        _cgo_gotypes.go:238 +0x4b fp=0x1c00003da48 sp=0x1c00003da20 pc=0x111fbfe6b
github.com/ebfe/scard.scardGetStatusChange.func1(0x1, 0xffffffff, 0x1c000416150?, {0x1c00041c040?, 0x1, 0x0?})
        /Users/khushboomehta/go/pkg/mod/github.com/ebfe/[email protected]/scard_darwin.go:72 +0x92 fp=0x1c00003da98 sp=0x1c00003da48 pc=0x111fc0cf2
github.com/ebfe/scard.scardGetStatusChange(0x1, 0xffffffff, {0x1c00041c040, 0x1, 0x1})
        /Users/khushboomehta/go/pkg/mod/github.com/ebfe/[email protected]/scard_darwin.go:72 +0x105 fp=0x1c00003db00 sp=0x1c00003da98 pc=0x111fc0b45
github.com/ebfe/scard.(*Context).GetStatusChange(0x1c0003140c8, {0x1c0003f2180, 0x1, 0x1125a9c20?}, 0x112268ea0?)
        /Users/khushboomehta/go/pkg/mod/github.com/ebfe/[email protected]/scard.go:118 +0x1b3 fp=0x1c00003dc10 sp=0x1c00003db00 pc=0x111fbe793
github.com/status-im/status-keycard-go/internal.(*KeycardContextV2).detectionRoutine(0x1c0002d6000, {0x112290fa8, 0x1c0003040a0}, 0x1c000332000)
        /Users/khushboomehta/Documents/status-desktop/vendor/status-keycard-go/internal/keycard_context_v2.go:221 +0x4a5 fp=0x1c00003df60 sp=0x1c00003dc10 pc=0x112057745
github.com/status-im/status-keycard-go/internal.(*KeycardContextV2).startDetectionLoop.func1()
        /Users/khushboomehta/Documents/status-desktop/vendor/status-keycard-go/internal/keycard_context_v2.go:178 +0xa5 fp=0x1c00003dfe0 sp=0x1c00003df60 pc=0x112057205
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0x1c00003dfe8 sp=0x1c00003dfe0 pc=0x111df1e21
created by github.com/status-im/status-keycard-go/internal.(*KeycardContextV2).startDetectionLoop in goroutine 17
        /Users/khushboomehta/Documents/status-desktop/vendor/status-keycard-go/internal/keycard_context_v2.go:173 +0xc5

goroutine 17 gp=0x1c000006700 m=1 mp=0x1c000078008 [syscall, locked to thread]:
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0x1c00023bfe8 sp=0x1c00023bfe0 pc=0x111df1e21

goroutine 2 gp=0x1c000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0x1c000072fa8 sp=0x1c000072f88 pc=0x111dbf8ce
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:408
runtime.forcegchelper()
        /usr/local/go/src/runtime/proc.go:326 +0xb3 fp=0x1c000072fe0 sp=0x1c000072fa8 pc=0x111dbf753
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0x1c000072fe8 sp=0x1c000072fe0 pc=0x111df1e21
created by runtime.init.6 in goroutine 1
        /usr/local/go/src/runtime/proc.go:314 +0x1a

goroutine 3 gp=0x1c000007500 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0x1c000073780 sp=0x1c000073760 pc=0x111dbf8ce
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:408
runtime.bgsweep(0x1c000030070)
        /usr/local/go/src/runtime/mgcsweep.go:278 +0x94 fp=0x1c0000737c8 sp=0x1c000073780 pc=0x111dab794
runtime.gcenable.gowrap1()
        /usr/local/go/src/runtime/mgc.go:203 +0x25 fp=0x1c0000737e0 sp=0x1c0000737c8 pc=0x111da00e5
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0x1c0000737e8 sp=0x1c0000737e0 pc=0x111df1e21
created by runtime.gcenable in goroutine 1
        /usr/local/go/src/runtime/mgc.go:203 +0x66

goroutine 4 gp=0x1c0000076c0 m=nil [GC scavenge wait]:
runtime.gopark(0x1c000030070?, 0x1121dad48?, 0x1?, 0x0?, 0x1c0000076c0?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0x1c000073f78 sp=0x1c000073f58 pc=0x111dbf8ce
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:408
runtime.(*scavengerState).park(0x112549360)
        /usr/local/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0x1c000073fa8 sp=0x1c000073f78 pc=0x111da9189
runtime.bgscavenge(0x1c000030070)
        /usr/local/go/src/runtime/mgcscavenge.go:653 +0x3c fp=0x1c000073fc8 sp=0x1c000073fa8 pc=0x111da971c
runtime.gcenable.gowrap2()
        /usr/local/go/src/runtime/mgc.go:204 +0x25 fp=0x1c000073fe0 sp=0x1c000073fc8 pc=0x111da0085
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0x1c000073fe8 sp=0x1c000073fe0 pc=0x111df1e21
created by runtime.gcenable in goroutine 1
        /usr/local/go/src/runtime/mgc.go:204 +0xa5

goroutine 18 gp=0x1c0000a4700 m=nil [finalizer wait]:
runtime.gopark(0x1c000072648?, 0x111d93e65?, 0xa8?, 0x1?, 0x1c0000061c0?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0x1c000072620 sp=0x1c000072600 pc=0x111dbf8ce
runtime.runfinq()
        /usr/local/go/src/runtime/mfinal.go:194 +0x107 fp=0x1c0000727e0 sp=0x1c000072620 pc=0x111d9f127
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0x1c0000727e8 sp=0x1c0000727e0 pc=0x111df1e21
created by runtime.createfing in goroutine 1
        /usr/local/go/src/runtime/mfinal.go:164 +0x3d

goroutine 35 gp=0x1c000324380 m=nil [select, locked to thread]:
runtime.gopark(0x1c000338f58?, 0x2?, 0x60?, 0x0?, 0x1c000338f24?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0x1c000338dc8 sp=0x1c000338da8 pc=0x111dbf8ce
runtime.selectgo(0x1c000338f58, 0x1c000338f20, 0x0?, 0x0, 0x0?, 0x1)
        /usr/local/go/src/runtime/select.go:327 +0x725 fp=0x1c000338ee8 sp=0x1c000338dc8 pc=0x111dd0c45
github.com/status-im/status-keycard-go/internal.(*KeycardContextV2).cardCommunicationRoutine(0x1c0002d6000, {0x112290fa8, 0x1c0003040a0})
        /Users/khushboomehta/Documents/status-desktop/vendor/status-keycard-go/internal/keycard_context_v2.go:150 +0x16a fp=0x1c000338fb8 sp=0x1c000338ee8 pc=0x112056d8a
github.com/status-im/status-keycard-go/internal.(*KeycardContextV2).Start.gowrap1()
        /Users/khushboomehta/Documents/status-desktop/vendor/status-keycard-go/internal/keycard_context_v2.go:118 +0x28 fp=0x1c000338fe0 sp=0x1c000338fb8 pc=0x112056b28
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0x1c000338fe8 sp=0x1c000338fe0 pc=0x111df1e21
created by github.com/status-im/status-keycard-go/internal.(*KeycardContextV2).Start in goroutine 17
        /Users/khushboomehta/Documents/status-desktop/vendor/status-keycard-go/internal/keycard_context_v2.go:118 +0x2a6
SIGABRT: Abnormal termination.
make: *** [run-macos] Abort trap: 6

@igor-sirotin igor-sirotin self-assigned this Mar 7, 2025
@igor-sirotin igor-sirotin added this to the 2.33.0 milestone Mar 7, 2025
@igor-sirotin igor-sirotin moved this from Next to In Progress in Status Desktop/Mobile Board Mar 10, 2025
@igor-sirotin
Copy link
Contributor

Was looking for solution, found this: https://stackoverflow.com/a/74249047/28484223

it is not possible to have two Go runtimes in one image.

We have both status-go and status-keycard-go. But why does the issue only occurs on Intel Mac 🤔

@igor-sirotin
Copy link
Contributor

igor-sirotin commented Mar 10, 2025

Here's a thread about this:

I see a few solutions to this:

  1. Build status-keycard-go and status-go into a single shared library.
    For this, we would need a new go module that would import all these 2, and build it as a shared library. And then link to status-desktop.
  2. Use one (or both) of status-go/status-keycard-go as a separate process, and access it through REST API.
    This has some overhead, and might have some consequences that are difficult to predict, but it's also a nice improvement that would allow debugging of status-go in status-desktop.
  3. Embed status-keycard-go into status-go.
    This could be done in various ways.

@igor-sirotin
Copy link
Contributor

igor-sirotin commented Mar 10, 2025

Meanwhile, I've pushed a fix attempt here: fix/keycard-go-InitializeLibrary.
Now, when we most certainly know the origin of the issue, I'm sure the fix won't help.

But let's see just in case, I'm waitin for @Khushboo-dev-cpp to test it.

If the fix doesn't help, I think we should go with 2.33 as is. It will take quite some time to fix this, and it's not some simple change. Perhaps we could try to increase some delays, as it helped previously, but it might not improve it.

UPD

Thanks @Khushboo-dev-cpp for testing the hotfix. It doesn't work, as expected 🙂
But now I can clearly see that the crash happens at the first call to runtime.GC(), we don't even reach the Go runtime initialization complete output:

tarting Go runtime initialization
bad flushGen 12 in prepareForSweep; sweepgen 0
fatal error: bad flushGen

runtime stack:
...
...
...
goroutine 18 gp=0xc000188380 m=nil [finalizer wait]:
runtime.gopark(0xc000072648?, 0x10fdd1b45?, 0xa8?, 0x1?, 0xc0000081c0?)
        /usr/local/go/src/runtime/proc.go:402 +0xce fp=0xc000072620 sp=0xc000072600 pc=0x10fdfd5ae
runtime.runfinq()
        /usr/local/go/src/runtime/mfinal.go:194 +0x107 fp=0xc0000727e0 sp=0xc000072620 pc=0x10fddce07
runtime.goexit({})
        /usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x10fe2fb01
created by runtime.createfing in goroutine 1
        /usr/local/go/src/runtime/mfinal.go:164 +0x3d
SIGABRT: Abnormal termination.
make: *** [run-macos] Abort trap: 6

@igor-sirotin
Copy link
Contributor

igor-sirotin commented Mar 10, 2025

Damn, and we won't be able to do (1), because we can't import status-go into a go project, because of:

@igor-sirotin
Copy link
Contributor

Thinking of potential solutions.

  1. Build status-keycard-go and status-go into a single shared library.
  1. Use one (or both) of status-go/status-keycard-go as a separate process, and access it through REST API.
  • 🥴 Has some performance overhead (network stack, HTTP, etc)
  • 🥴 Might have some consequences that are difficult to predict
  • ✅ Will enable debugging status-go/status-keycard-go in status-desktop
  • ✅ Will enable running status-go on a different machine (useful for running a community control node in cloud)
  • ⚠ Might be not exactly simple to implement in status-desktop. Though I see there're some Nim json-rpc implementations that we could use.

NOTE: status-keycard-go API is quite slim, we can try to implement it with it first. Doing this with status-go will be more difficult.

  1. Embed status-keycard-go into status-go.This could be done in various ways.
  • 😵‍💫 absolute madness, makes no sense
  • ❌ will bring status-keycard-go to status-mobile, which is not needed there

@siddarthkay
Copy link
Contributor Author

siddarthkay commented Mar 11, 2025

@igor-sirotin : are you sure this panic is because of multiple version of go libs ?
As mentioned in issue description, this bug could also be because of garbage collection features introduced in go 1.21
https://github.com/golang/go/blob/8f5c6904b616fd97dde4a0ba2f5c71114e588afd/src/runtime/mcache.go#L325

@siddarthkay
Copy link
Contributor Author

if you could add a threadpool the same way we have in status-desktop to this repo https://github.com/siddarthkay/status-desktop-intel-crash-reproducer and if we can reproduce the same error. Then we can be really sure.

@igor-sirotin
Copy link
Contributor

igor-sirotin commented Mar 11, 2025

@siddarthkay Oh, so you were able to reproduce it with a single Go library, that's something I've missed 👀
UPD: oh wait, you wasn't 🤔

are you sure this panic is because of multiple version of go libs ?

I'm not 🙂 I'm currently reading the thread about this.
But if you reproduced it with a single Go library, then it's a different issue I guess.

@alexjba
Copy link
Contributor

alexjba commented Mar 11, 2025

This can be reproduced on an M1 Mac as well when compiling the app for x86_64 and running with rosetta.
It requires an x86_64 qt (I have 5.15.2). Maybe it helps with the dev process.

@igor-sirotin
Copy link
Contributor

This can be reproduced on an M1 Mac as well when compiling the app for x86_64 and running with rosetta. It requires an x86_64 qt (I have 5.15.2). Maybe it helps with the dev process.

Thanks @alexjba, that'd be very helpful. I'll try it out.

@alexjba
Copy link
Contributor

alexjba commented Mar 11, 2025

This can be reproduced on an M1 Mac as well when compiling the app for x86_64 and running with rosetta. It requires an x86_64 qt (I have 5.15.2). Maybe it helps with the dev process.

Thanks @alexjba, that'd be very helpful. I'll try it out.

One mention would be that status-go doesn't cross-compile anymore with make. I've needed to disable the nix shell for status-go in its Makefile and build it manually

@alexjba
Copy link
Contributor

alexjba commented Mar 11, 2025

5. Embed status-keycard-go into status-go.This could be done in various ways.

  • 😵‍💫 absolute madness, makes no sense
  • ❌ will bring status-keycard-go to status-mobile, which is not needed there

status-keycard-go cannot be compiled for mobile in the current state as it depends on pcsc-lite. Probably not worth chasing at all.

Maybe a dumb question, but what about using plugins? Isn't it an option? Was looking at the golang issue attached here and people suggest this as the preferred way of integrating multiple go libs.

@igor-sirotin
Copy link
Contributor

I was able to reproduce on M1 with x86_64 qt build. Thanks @alexjba!

@igor-sirotin
Copy link
Contributor

igor-sirotin commented Mar 11, 2025

I think I reproduced this with a simple C++ app that imports status-go and status-keycard-go

#include "status-go/build/bin/libstatus.h"
#include "status-keycard-go/build/libkeycard/libkeycard.h"

#include <string>

int main() {

    // status-go
    char* jsonRequest = "{'dataDir': '/Users/igorsirotin/Repositories/Status/_sandbox/test-double-go-runtime/accounts'}";
    InitializeApplication(jsonRequest);

    // status-keycard-go
    InitializeLibrary();

    return 0;
}
status-go: Starting Go runtime initialization
status-go: Go runtime initialization complete
status-keycard-go: Starting Go runtime initialization
bad flushGen 14 in prepareForSweep; sweepgen 0
fatal error: bad flushGen

Simple example

But it didn't crash with 2 simplest Go libs.

It's also reproducible with 2 simplest Go libs, that forceruntime.GC():

👉 foo: Go runtime initialization starting
✅ foo: Go runtime initialization complete
👉 bar: Go runtime initialization starting
bad flushGen 2 in prepareForSweep; sweepgen 0
fatal error: bad flushGen
fatal error: bad sweepgen in refill

Code

Will push full code tomorrow Code is located here:

Works with delay

When adding a 50ms sleep between the 2 initializer functions, it works:

👉 foo: Go runtime initialization starting
✅ foo: Go runtime initialization complete
👉 bar: Go runtime initialization starting
✅ bar: Go runtime initialization complete

TODO

TODO for myself:

  • Try this with arm64 build
  • Try combining status-go/status-keycard-go/simple-lib-1/simple-lib-2
  • Try simple-lib-1+simple-lib-2 with x86_64 build. I think now that I was running it with arm64.

@igor-sirotin
Copy link
Contributor

igor-sirotin commented Mar 12, 2025

Some more testing.

Here's the code:

int main() {
    InitializeFoo();
    delay();
    InitializeBar();
    return 0;
}
Arch Sleep, ms Linked Go Libraries Success
arm64 0 foo + bar
arm64 0 status-go + bar
arm64 0 status-go + status-keycard-go
x86_64 0 foo + bar
x86_64 0 status-go + bar
x86_64 0 status-go + status-keycard-go
x86_64 50 foo + bar
x86_64 50 status-go + bar
x86_64 50 status-go + status-keycard-go

Basically, a 50ms delay helps in 100% cases. No clue why.
Also, not sure if the app would work properly after this.

I'll try to introduce this delay in some very early stages of status-desktop, let's see if it helps there as well.

@igor-sirotin
Copy link
Contributor

igor-sirotin commented Mar 12, 2025

I've tried to force GC as the first thing in status-desktop:

status_go.initializeStatusGo()
sleep(2000)
keycard_go.initializeStatusKeycardGo()
sleep(2000)

This didn't help. The initialization functions pass fine, but the app still crashes at the next call to status-keyacrd-go:

Running: bin/StatusDev.app/Contents/MacOS/nim_status_client
gc 1 @0.045s 0%: 0.012+1.1+0.007 ms clock, 0.12+1.0/1.6/0.67+0.075 ms cpu, 3->4->2 MB, 4 MB goal, 0 MB stacks, 1 MB globals, 10 P
gc 2 @0.076s 0%: 0.035+0.91+0.28 ms clock, 0.35+0.085/1.0/1.1+2.8 ms cpu, 5->5->3 MB, 5 MB goal, 0 MB stacks, 1 MB globals, 10 P
gc 3 @0.091s 1%: 0.042+1.7+0.16 ms clock, 0.42+0.087/2.2/1.9+1.6 ms cpu, 8->8->5 MB, 8 MB goal, 0 MB stacks, 1 MB globals, 10 P
gc 4 @0.107s 1%: 0.080+2.8+0.15 ms clock, 0.80+0/4.4/2.6+1.5 ms cpu, 12->12->7 MB, 12 MB goal, 0 MB stacks, 1 MB globals, 10 P
gc 5 @0.123s 1%: 0.040+3.3+0.18 ms clock, 0.40+0/4.8/2.3+1.8 ms cpu, 15->16->11 MB, 16 MB goal, 0 MB stacks, 1 MB globals, 10 P
gc 6 @0.175s 1%: 0.054+3.6+0.15 ms clock, 0.54+0/7.5/9.5+1.5 ms cpu, 22->22->15 MB, 23 MB goal, 0 MB stacks, 1 MB globals, 10 P
👉 status-go: Starting Go runtime initialization
gc 7 @0.185s 2%: 0.032+2.3+0.004 ms clock, 0.32+0/5.7/8.0+0.043 ms cpu, 19->19->16 MB, 32 MB goal, 0 MB stacks, 1 MB globals, 10 P (forced)
✅ status-go: Go runtime initialization complete
👉 status-keycard-go: Starting Go runtime initialization
gc 1 @1.193s 0%: 0.012+1.0+0.004 ms clock, 0.12+0/1.7/0.38+0.040 ms cpu, 2->2->0 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 10 P (forced)
✅ status-keycard-go: Go runtime initialization complete
INF 2025-03-12 13:32:00.329Z app info                                   topics="status-app" tid=39678523 file=nim_status_client.nim:256 version=v2.34.0-dev-53-ga8386034d-dirty commit=a8386034d currentDateTime=2025-03-12T13:32:00+00:00
INF 2025-03-12 13:32:00.330Z starting application controller...         topics="status-app" tid=39678523 file=nim_status_client.nim:258
DBG 2025-03-12 13:32:00.330Z KeycardServiceV2 init                      topics="keycardV2-service" tid=39678523 file=service.nim:95
fatal error: bad sweepgen in refill

TODO:

  • Try to run more operations on the lbiraries in the C++ minimal example
  • Try plugins
  • Try to disable Go optimizations

@iurimatias iurimatias modified the milestones: 2.33.0, 2.34.0 Mar 12, 2025
@siddarthkay
Copy link
Contributor Author

@igor-sirotin : I would suggest trying to remove the -03 optimisation flag on both keycard and status-go in Makefile and verify if turning off these performance optimisations fix the issue or not.

ps: I have no evidence that this will fix the issue, but worth a try.

@igor-sirotin
Copy link
Contributor

igor-sirotin commented Mar 19, 2025

❌ Try to disable Go optimizations

Did not change anything.

❌ plugins

I looked at plugins, which are mentioned in golang/go#65050 indeed.

Theoretically, plugins would work, because they don't contain the go runtime, they expect the application (that loads the plugin) to have the Go runtime. Thus, there will be only a single Go runtime.

Problem is that Go plugins are:

  1. Designed to be used by Go applications, they can't be loaded into a C++/Nim application
  2. Not supported on Windows

❌ Try to run more operations on the libraries in the C++ minimal example

In my example app, as soon as I add more Go calls without a delay in between, it crashes.

e.g.

int main() {
    InitializeStatusGo();
    delay();
    InitializeStatusKeycardGo();
    delay();
    StatusGoFoo();
    // delay();            <--- no delay
    StatusKeycardFoo(); // <--- panic here
    return 0;
}

It seems that it's enough to make sure that 2 libraries never work in parallel.
Although it's probably possible, the effort is too high, while the actual problem of having 2 Go runtimes will be around. Such fix would be a time bomb (which is also difficult to build 😄).


Current conclusion

I will make PoC with running status-keycard-go as a binary and connect to it with JSON-RPC.
Let's see how difficult it is.
And try to benchmark the solution, though keycard is not constantly used, and there's no many API calls, so even if it's x10 slower (which I doubt), I believe it will be fine for the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core-team crash E:Desktop Keycard Bug Bug found after initial keycard development wallet-team
Projects
Status: Backlog
Development

No branches or pull requests

7 participants