Skip to content

sync: Map: internal/sync.HashTrieMap: ran out of hash bits while inserting #73427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MasterDimmy opened this issue Apr 17, 2025 · 25 comments
Open
Assignees
Labels
BugReport Issues describing a possible bug in the Go implementation. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@MasterDimmy
Copy link

MasterDimmy commented Apr 17, 2025

Go version

go1.24.2 windows/amd64

Output of go env in your module/workspace:

set AR=ar
set CC=gcc
set CGO_CFLAGS=-O2 -g
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-O2 -g
set CGO_ENABLED=1
set CGO_FFLAGS=-O2 -g
set CGO_LDFLAGS=-O2 -g
set CXX=g++
set GCCGO=gccgo
set GO111MODULE=
set GOAMD64=v1
set GOARCH=amd64
set GOAUTH=netrc
set GOBIN=
set GOCACHE=C:\Users\1\AppData\Local\go-build
set GOCACHEPROG=
set GODEBUG=
set GOENV=C:\Users\1\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFIPS140=off
set GOFLAGS=
set GOGCCFLAGS=-m64 -mthreads -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=E:\TEMP\go-build1607997771=/tmp/go-build -gno-record-gcc-switches
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMOD=M:\Projects\AY\Trapper.System\pool_lines_provider\go.mod
set GOMODCACHE=D:\gopath\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=D:\gopath
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=D:\go
set GOSUMDB=sum.golang.org
set GOTELEMETRY=off
set GOTELEMETRYDIR=C:\Users\1\AppData\Roaming\go\telemetry
set GOTMPDIR=
set GOTOOLCHAIN=auto
set GOTOOLDIR=D:\go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.24.2
set GOWORK=
set PKG_CONFIG=pkg-config

What did you do?

var updatePoolLineUsage_keys sync.Map // key => ip
var key, ip string
...
updatePoolLineUsage_keys.Store(key, ip)

What did you see happen?

Panic in

internal/sync.HashTrieMap: ran out of hash bits while inserting
STACK:
goroutine 870014 [running]:
panic({0xfb2660?, 0x1434870?})
D:/go/src/runtime/panic.go:792 +0x132
internal/sync.(*HashTrieMap[...]).expand(0x14602c0?, 0xc003e8aba0, 0xc008401bc0, 0xa879ce25c8134d64, 0x0, 0xc004f668c0)
D:/go/src/internal/sync/hashtriemap.go:181 +0x1e5
internal/sync.(*HashTrieMap[...]).Swap(0x14602c0, {0xfb2660, 0xc0048a8950}, {0xfb2660, 0xc0048a8960})
D:/go/src/internal/sync/hashtriemap.go:272 +0x397
internal/sync.(*HashTrieMap[...]).Store(...)
D:/go/src/internal/sync/hashtriemap.go:200
sync.(*Map).Store(...)
D:/go/src/sync/hashtriemap.go:55

What did you expect to see?

ok

@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label Apr 17, 2025
@seankhliao seankhliao changed the title internal/sync.HashTrieMap: ran out of hash bits while inserting internal/sync: HashTrieMap: ran out of hash bits while inserting Apr 17, 2025
@seankhliao seankhliao changed the title internal/sync: HashTrieMap: ran out of hash bits while inserting sync: Map: internal/sync.HashTrieMap: ran out of hash bits while inserting Apr 17, 2025
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Apr 17, 2025
@seankhliao seankhliao added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed compiler/runtime Issues related to the Go compiler and/or runtime. labels Apr 17, 2025
@MasterDimmy
Copy link
Author

This can help: this code reproduces bug every hour near 100% probability:

var (
 updatePoolLineUsage_tx sync.Once
 updatePoolLineUsage_keys sync.Map //key => ip
)

func updatePoolLineUsage(key, ip string) {
	defer handlePanic()

	updatePoolLineUsage_keys.Store(key, ip)

	updatePoolLineUsage_tx.Do(func() {
		go func() {
			defer handlePanic()

			ticker := time.NewTicker(time.Minute)
			for {
				select {
				case <-ticker.C:
					keys := []string{}
					updatePoolLineUsage_keys.Range(func(k, _ any) bool {
						keys = append(keys, k.(string))
						return true
					})

					for _, key := range keys {
                                               log(key)

						updatePoolLineUsage_keys.Delete(key)
					}
				}
			}
		}()
	})
}

No more usage updatePoolLineUsage_keys anywhere in code.
"key" is random of 16 or 64 bytes. There are near 100 active keys every hour.

@mknyszek
Copy link
Contributor

Thanks for the report.

Can you please provide a more complete reproducer? How is updatePoolLineUsage invoked in your program? Are the arguments created with unsafe.String or unsafely in some way? Also, is there a way to get it to reproduce faster? (By removing the time.Ticker and just iterating in a loop?)

@mknyszek mknyszek self-assigned this Apr 18, 2025
@mknyszek mknyszek added this to the Go1.25 milestone Apr 18, 2025
@seankhliao seankhliao added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Apr 18, 2025
@MasterDimmy
Copy link
Author

just

func readUserKey(r *http.Request) {
key = r.FormValue("key")
ip = r.RemoteAddr

updatePoolLineUsage(key, ip)
}

And nothing more. No more any call. It just collects keys and dumps used keys for each minute. Also this repeats on ubuntu, I build soft with:

gox -os="linux" -arch="amd64" --ldflags="-s"

Thus of this bug only solution is to awaid using sync.Map, how to revert to previous sync.Map ?

@MasterDimmy
Copy link
Author

I cant reproduce this faster. This happens since go updated from 1.24 (.1 ?) to 1.24.2
First versions ~1.23.2 at january works good.

@MasterDimmy
Copy link
Author

Has just cought this again in another code of this project:

internal/sync.HashTrieMap: ran out of hash bits while inserting
STACK:
goroutine 494781 [running]:
panic({0xfb2660?, 0x1434890?})
        D:/go/src/runtime/panic.go:792 +0x132
internal/sync.(*HashTrieMap[...]).expand(0x14602e0?, 0xc00383f5c0, 0xc00593fb90, 0x69ceaa8c17d6c757, 0x4, 0xc004b96b40)
        D:/go/src/internal/sync/hashtriemap.go:181 +0x1e5
internal/sync.(*HashTrieMap[...]).LoadOrStore(0x14602e0, {0xfb2660, 0xc004a95110}, {0xf83da0, 0xc002ceb46c})
        D:/go/src/internal/sync/hashtriemap.go:160 +0x38d
sync.(*Map).LoadOrStore(...)
        D:/go/src/sync/hashtriemap.go:67
main.(*SingleFunctionUsage).Using(0xc001a5d620, {0xc002b8c29c, 0x10})
        M:/Projects/proxy_pool/single_using.go:27 +0x6f

It happens constantly here

package main

import (
	"sync"
	"sync/atomic"
)

/*
	Checks function is used only once via
        usage := singleFunctionUsage.Get("somefunc")
        usegeVal, used := usage.Using(key)
        if used {
             log("key is used, please do just on call per other")
             return
        }
        usage.End(usegeVal)
*/

type SingleFunctionUsage struct {
	s sync.Map
}

var singleFunctionUsage = new(SingleFunctionUsage)

func (s *SingleFunctionUsage) Get(name string) *SingleFunctionUsage {
	old, _ := s.s.LoadOrStore(name, &SingleFunctionUsage{})
	return old.(*SingleFunctionUsage)
}

func (s *SingleFunctionUsage) Using(key string) (*int32, bool) {
	val, _ := s.s.LoadOrStore(key, new(int32))     // <<<<<<<<<<<<< panic
	return val.(*int32), atomic.CompareAndSwapInt32(val.(*int32), 0, 1) == false
}

func (s *SingleFunctionUsage) End(val *int32) {
	atomic.StoreInt32(val, 0)
}

Keys are random of 16 bytes only.
Name is const = "usageChecker"
I store pointer to struct once, other values are *int32.
If val wont convert it'll be other panic, not in HashTrieMap.

@seankhliao
Copy link
Member

we need a complete runnable example

@MasterDimmy
Copy link
Author

It needs to recreate all "key" database in desired order with fatalling "key" last... I'll try.
Sure here is some collision with "key" for map so it eats all hash bits not in planned order.

@mknyszek
Copy link
Contributor

Thus of this bug only solution is to awaid using sync.Map, how to revert to previous sync.Map ?

You can try building your Go program with GOEXPERIMENT=nosynchashtriemap as a workaround, but this option will likely go away in the next release. (This is documented in the release notes: https://go.dev/doc/go1.24#syncpkgsync)

Again, please be aware that sync.Map is fairly widely used, and Go 1.24 has been deployed fairly widely at this point. This is the first bug report we've received of this kind.

I understand the issue is frustrating, and it may well be a real bug in the map implementation, especially if you're using the map in a way that the existing test suite doesn't check for. Unfortunately without a complete runnable example, as @seankhliao mentioned, we have little way of knowing whether that's true. Based on your last message it sounds like you're working on that; thank you! That'll be really helpful to get to the bottom of this.

(Just looking at your examples briefly, I don't see usage that's out of the ordinary, or that we don't test. All the tests also run on windows/amd64, the same platform you're running on. That's why more context would be very helpful. An example that runs (doesn't need to reproduce instantly) on https://play.go.dev for example, where we can see all the code and the way it's being invoked. Or, detailed steps explaining how to set up the reproducing program and execute it.)

I cant reproduce this faster. This happens since go updated from 1.24 (.1 ?) to 1.24.2

Hm. I don't see any relevant changes on the release branch. The difference between 1.24.0 and 1.24.2 is fairly small. (https://go.googlesource.com/go/+log/refs/heads/release-branch.go1.24)

@MasterDimmy
Copy link
Author

I have added this:

func (s *SingleFunctionUsage) Using(key string) (*int32, bool) {
	defer func() {
		if e := recover(); e != nil {
			errpf("PANIC: %v", e)

			keys := []string{}
			s.s.Range(func(key any, _ any) bool {
				keys = append(keys, key.(string))
				return true
			})
			data := struct {
				key  string
				keys []string
			}{
				key:  key,
				keys: keys,
			}
			buf, _ := json.MarshalIndent(&data, "", " ")
			os.WriteFile("./logs/sync_map_panic.log", buf, 0644)
		}
	}()
	val, _ := s.s.LoadOrStore(key, new(int32))
	return val.(*int32), atomic.CompareAndSwapInt32(val.(*int32), 0, 1) == false
}

So i got:

root@server:/home/project/logs# tail error.log
2025/04/18 18:36:51 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 18:37:28 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 18:37:54 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 18:38:26 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 18:38:51 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
root@server:/home/project/logs# cat sync_map_panic.log
{}
root@server:/home/project/logs#

Yes, no ANY keys in json panic log file!

This version never got an error due 2 hours:

	usingKey, usingNow := singleFunctionUsage.Get("handler").Using(key) 
	if usingNow {
                logWait(w, r, "wait")
		return 
	}
	defer singleFunctionUsage.End(usingKey)

This version reproduces it every 5 minutes, but not constantly, all random:

        dc := singleFunctionUsage.Get("handler")
	usingKey, usingNow := dc.Using(key) 
	if usingNow {
                logWait(w, r, "wait")
		return 
	}
	defer singleFunctionUsage.End(usingKey)

I investigate further. If you can, you can give me detailed-log-enabled-sync-map lib package, so i use it instead of standart sync.Map package, so we can get more debug data on panic interseption inside sync.Map.

@mknyszek
Copy link
Contributor

Yes, no ANY keys in json panic log file!

I think your marshalling code is incorrect, causing nothing to actually get marshalled. IIRC JSON can only marshal fields that are exported.

I investigate further. If you can, you can give me detailed-log-enabled-sync-map lib package, so i use it instead of standart sync.Map package, so we can get more debug data on panic interseption inside sync.Map.

If you edit the source code of the standard library where your Go tool is located (go env GOROOT), rebuilding your code with the same Go tool will just work. You don't have to do anything special or extra.

@MasterDimmy
Copy link
Author

During tests got new interesting panic

runtime error: invalid memory address or nil pointer dereference
STACK:
panic({0x1014080?, 0x24e2a00?})
        D:/go/src/runtime/panic.go:792 +0x132
main.(*SingleFunctionUsage).End(...)
        single_using_old.go:55

its here:

func (s *SingleFunctionUsage) End(val *int32) {
	atomic.StoreInt32(val, 0)   // <<<<<<<<
}

reproduce:

        dc := singleFunctionUsage.Get("handler")
	usingKey, usingNow := dc.Using(key) 
	if usingNow {
                logWait(w, r, "wait")
		return 
	}
	defer singleFunctionUsage.End(usingKey)    // <<<<<<<<<<<

But val cant be nil by definition:

	val, _ := s.s.LoadOrStore(key, new(int32))

Here val never == nil ! All lib code i provided above.

So it seems like GC removed sync.Map data element as unused? or LoadOrStore someway returned cleaned object?

There is no any unsafe lib usage in code.
We have huge amount of api calls with random "key". Someone bruteforce us to find correct "key", so we got this.

@MasterDimmy
Copy link
Author

Again, please be aware that sync.Map is fairly widely used, and Go 1.24 has been deployed fairly widely at this point. This is the first bug report we've received of this kind.

It's not first. It was already before, but disappered
#69534

I'm trying to catch bug in small case, not in prod.

@mknyszek
Copy link
Contributor

Here's my attempt at a reproducer based on the information you provided.

package main

import (
	"math/rand/v2"
	"sync"
	"time"
)

func main() {
	var s sync.Map

	go func() {
		for {
			var keys []string
			s.Range(func(k, _ any) bool {
				keys = append(keys, k.(string))
				return true
			})
			for _, key := range keys {
				s.Delete(key)
			}
			time.Sleep(10 * time.Millisecond)
		}
	}()

	for range 4 {
		go func() {
			for {
				s.LoadOrStore(randomString(), new(int32))
				time.Sleep(100 * time.Microsecond)
			}
		}()
	}
	select {}
}

func randomString() string {
	var b [16]byte
	for i := range b {
		b[i] = rand.N[byte](0x7e-0x30) + 0x30 // random valid ASCII
	}
	return string(b[:])
}

Does this seem accurate to you? If not, why? FTR, I cannot produce any crash with this code. (Including if I increase the rate at which entries are inserted or deleted, but I wanted to match the ~100 active keys (this is 400) in the map.)

Small note:

var keys []string
s.Range(func(k, _ any) bool {
	keys = append(keys, k.(string))
	return true
})
for _, key := range keys {
	s.Delete(key)
}

is unnecessarily expensive. You can use the Clear method which is much faster, if you want to delete every entry in the map.

@mknyszek
Copy link
Contributor

mknyszek commented Apr 18, 2025

It's not first. It was already before, but disappered

I think you're misunderstanding what happened in #69534. The reporter discovered the same bug that was fixed by https://go.googlesource.com/go/+/21ac23a96f204dfb558a8d3071380c1d105a93ba, but made a mistake when they thought they tested with the fix. The issue was indeed fixed -- it did not just go away.

Also, the bug in that issue required using unsafe to construct the string value. The unique package was fixed to handle that gracefully by cloning the input, but note that sync.Map does not clone its input. If you are using unsafe to construct the string value (you say you're not using unsafe, so that's not it) and then mutating it via an aliasing byte slice (this violates Go invariants about strings), you can absolutely run into this error. (You would run into a crash with the builtin Go map, and the old sync.Map implementation, too.)

@MasterDimmy
Copy link
Author

MasterDimmy commented Apr 18, 2025

is unnecessarily expensive. You can use the Clear method which is much faster, if you want to delete every entry in the map.

Not every, just all current, but there are new keys, which are not dumped to log, so i cant use "Clear". But thanx.

Try to reproduce with this:

   dc := singleFunctionUsage.Get("handler")
	usingKey, usingNow := dc.Using(key) 
	if usingNow {
                logWait(w, r, "wait")
		return 
	}
	defer singleFunctionUsage.End(usingKey)  

and whole code of lib package i provided in
#73427 (comment)

This will be "clean" reproducing code. There are near 10-20 calls per second of it, bug appears after 5-10 minutes of app is started, so it takes near 15 * 60 * 10 calls to reproduce. I'm trying to catch "key" data to make small 100% reproducer now.

Thanx for help.

@MasterDimmy
Copy link
Author

Huge amount of this panic since i returned usage of sync.Map.
When i replace it with sync.Mutex + std::map it works ok.

2025/04/18 23:35:26 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:35:51 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:36:26 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:36:51 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:37:27 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:37:52 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:38:26 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:38:51 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:39:26 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:39:51 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:40:25 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:40:50 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:41:26 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:41:51 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:42:27 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:42:52 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:43:25 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:43:50 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:44:25 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:44:50 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:45:25 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:45:50 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:46:26 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting
2025/04/18 23:46:51 hashtriemap.go:160=>:181=>panic.go:792=>single_using_old.go:32=>logger.go:162:
PANIC: internal/sync.HashTrieMap: ran out of hash bits while inserting

This panic call string ierarchy is written by github.com/MasterDimmy/zipologger module, so i can see, who started goroutine and all call tree instead of just function name of standart panic name. ( there is no bug connected with this lib, it works 5+ years ok ).

There is no data for "key" value inside sync.Map after it has been catched. I cant reproduce this separetly from prod app now.
Somehow implement protection for this panic. Check all usecases with GC. I need to turn off usage sync.Map now for prod server, its too expensive already, sorrry. I'll try to recreate bug separatly, but dont neglet it, please. I think this happens not only when used new sync.Map , but it somehow connected with GC.

@thepudds
Copy link
Contributor

Hi @MasterDimmy, you should definitely take one of scenarios where this seems reproducible and build/run with the race detector enabled (-race).

@MasterDimmy
Copy link
Author

Hi @MasterDimmy, you should definitely take one of scenarios where this seems reproducible and build/run with the race detector enabled (-race).

#73441 (comment)

-race not builds in cross building (on Windows 10 x64 for Linux x64), it builds ok without -race

@thepudds
Copy link
Contributor

Hi @MasterDimmy , three questions for you:

Question 1: To keep things simple, can you build on Linux with -race and then run on Linux in a scenario that has been reproducible for this problem?

Question 2: If not, can you build on Windows with -race and then run on Windows in a scenario that has been reproducible for this problem? (It is usually ~trivial to use the race detector on Linux, but sometimes takes a little more work to get it working on Windows, though it's usually not too bad -- usually it just involves installing the right flavor and version of gcc. Some more details are in the race detector documentation).

Question 3: In #73427 (comment), @mknyszek attempted a standalone reproducer . Can you more directly answer @mknyszek's question about that standalone reproducer:

Does this seem accurate to you? If not, why?

For example, you could say "That seems accurate based on the data I have so far", or "I suspect that won't reproduce the problem because the standalone reproducer is not doing X", or "I'm certain that won't reproduce the problem because it is doing Y". As you answer, it is helpful to communicate what you know with certainty, vs. what has some perhaps inconclusive data, vs. what is based on a suspicion or hunch.

@MasterDimmy
Copy link
Author

MasterDimmy commented Apr 19, 2025

q1
i build app on windows x64 for ubuntu linux x64 via command string
gox -os="linux" -arch="amd64" -ldflags="-s -X main.appVersion=%ver% -X main.appBuild=%btime%"
i'll try to build it on linux via -race, but it takes more 3+ days (need to go).

q2
So i cant to reproduce and even run app in windows in similar env (databases + load) because app is running on big production server with huge user api load.

q3
In simple case - similar type values for sync.Map there are always all ok.
I think trouble is with simultaneous different value types used (maybe), so this case is near

package main

import (
	"math/rand/v2"
	"sync"
	"sync/atomic"
)

type Parent struct {
	s sync.Map
}

var base = new(Parent)

func main() {

	old_, _ := base.s.LoadOrStore("name1234", &Parent{})  //Here we got child , then work only with child sync.Map (value)
	old := old_.(*Parent)

	go func() {
		for {
			var keys []string
			old.s.Range(func(k, _ any) bool {
				keys = append(keys, k.(string))
				return true
			})
			for _, key := range keys {
				old.s.Delete(key)
			}
		}
	}()

	for range 4 {
		go func() {
			for {
				t, _ := old.s.LoadOrStore(randomString(), new(int32))  /// using "old" , not "base"
				atomic.StoreInt32(t.(*int32), 1)
			}
		}()
	}
	select {}
}

func randomString() string {
	var b [16]byte
	c := 0
	for i := range b {
		b[i] = rand.N[byte](0xFF)
		c++
	}
	return string(b[:])
}

But i cant reproduce load and panic on windows. It seems it happens just in cross-compilation.

In code

	old_, _ := base.s.LoadOrStore("name1234", &Parent{})  
	old := old_.(*Parent)

i take value by key, then work with it, maybe there is any swaps in LoadOrStore that moved original value &Parent{} away,
so old.s.LoadOrStore(randomString(), new(int32)) do it on unavailable pointer to sync.Map ?

The case is usage of sync.Map as child of another sync.Map, which can be moved due to swap in inner LoadOrStore. So it moves part of itself.

@thepudds
Copy link
Contributor

One general comment about sync.Map is that I believe it does not provide any locking of the values stored within the sync.Map. In other words, if you get a certain object out of a sync.Map in one goroutine and also get the same object out of the sync.Map in another goroutine and then manipulate that object without any additional synchronization protection, sync.Map is not providing any additional protection of the data. I don't know if that applies here.

Separately, thank you for your answers. Part of the reason I asked about build on Windows & run on Windows (q1), or build on Linux & run on Linux (q2), is that would also eliminate the cross compilation step. (It sounds like you are using gox for cross compilation. I'm only lightly familiar with gox, but as I understand it, it does some clever things to help cgo work during cross compilation, and some maybe small chance that is involved here. In any event, eliminating cross compilation would help with triage).

In addition to identifying races (which at least in theory might be related to issue here), -race also enables some additional execution-time sanity checks of pointers and some other things that are helpful in some cases (like randomizing some scheduling decisions, etc.).

For building & running on Linux with -race, hopefully that's possible for you to try (e.g., maybe zip up all of the source code on Windows to copy it over; in some cases doing a temporary go mod vendor can make that more convenient, or maybe not in your case).

@thepudds
Copy link
Contributor

The case is usage of sync.Map as child of another sync.Map, which can be moved due to swap in inner LoadOrStore. So it moves part of itself.

I see you added this in an edit to your prior comment (which is fine; I just missed it at first).

I might have misunderstood, but I will note that the sync.Map documentation includes:

A Map must not be copied after first use.

@MasterDimmy
Copy link
Author

I'll try to build -race on prod server, it takes time.

The case is usage of sync.Map as child of another sync.Map, which can be moved due to swap in inner LoadOrStore. So it moves part of itself.

I see you added this in an edit to your prior comment (which is fine; I just missed it at first).

I might have misunderstood, but I will note that the sync.Map documentation includes:

A Map must not be copied after first use.

Sure! But i didnt do any copy (check code above i provided, there is nothing instead of this), i think moving do sync.Map inside itself !! - inner swap function in LoadOrStore i think.

To reproduce: create sync.Map as values inside parent sync.Map, then insert values into parent sync.Map - it moves childs anywhere inside? swaps? And in goroutine loop child sync.Map , in one moment it loses its control bits and panics.

@MasterDimmy
Copy link
Author

D:/go/src/internal/sync/hashtriemap.go:181 +0x1e5
internal/sync.(*HashTrieMap[...]).Swap(0x14602c0, {0xfb2660, 0xc0048a8950}, {0xfb2660, 0xc0048a8960})

This swap i talking about. It moves sync.Map as value to another place.
Before 1.24 this code works ok always:

var a sync.Map
a.LoadOrStore( 1, sync.Map )    
b  , _ := a.LoadOrStore( 2, sync.Map )    
b.LoadOrStore( 3 ,4)  << panic here

Trying to reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BugReport Issues describing a possible bug in the Go implementation. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
Development

No branches or pull requests

6 participants