-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: performance issue with the new page allocator on aix/ppc64 #35451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The problem might be larger than just this new page allocator. Its tests seem to pass correctly but a freeze (about 30s) occurs after each one, even with the old page allocator. It seems to loop inside
|
The problem comes from |
My understanding is that (Nonetheless, I hope we can figure out how to resolve it before the release.) |
@Helflym thanks for looking into it. I was aware of this issue but was unable to dig deeper because I was unable to get a gomote over the last few days. The mapping of 2 TiB is completely intentional, AIX claims to have a 60-bit address space in the code. This mapping is also Also, when you say reducing |
The allocation is actually working pretty find. There is no freeze at this moment. The freeze occurs for once a PageAlloc test have ended (maybe during a GC phase or something like this).
This unmmap is needed because AIX mmap doesn't work on an already mmap area, which is the case when doing
Yes, I've set the Linux value, it does work. (I haven't run all.bash though). |
@Helflym OK, it would be nice to not have to support the full 60 bit address space if this is the cause of the performance slowdown. The comment on I also don't fully understand why that would be the cause of the slowdown unless manipulating that mapping is particularly expensive (which could be because of the munmap, @rsc tells me that on some systems mprotect is better at not messing with VMAs and such in the kernel). This is the assumption I'm making, since you say the "allocation" is working fine, by which I assume you mean the original sysReserve call at start-up. I can write a couple of patches and if you have some time could you try them for me? Or if I had access to a machine (the gomote now is going to be really hard to get, due to slowness) I could dig deeper into debugging this myself. With regard to the freeze, each PageAlloc test creates two of these mappings and then frees both at the end of the test, so if manipulating the mapping is this expensive that test will definitely cause an apparent hang. |
Anyway, I've made some tests with different values for
As you can see, performances are falling down after 0x0400000000. I think it should be possible to have this value on AIX. Because if I understand correctly and please tell me if I'm completely wrong, you need a chunk for every possible address returned by mmap ie from
I've created a special user for this kind of purpose on our build machine, just send me your SSH key by mail and I'll grant you the access. Meanwhile, I'll continue to search for a solution. |
If we know all returned addresses are that high, then we can set This is a fine enough solution for me by the way. It would be nice to understand why AIX doesn't like this large mapping but not a priority for release. But ah, I remember now that |
I do agree with you. I've always wanted to use this Note that I've traced a little which unmapping is slowing the tests and it's indeed when releasing the object created by this |
Does it matter if any pages have been touched in the mapping? |
@Helflym When you say "similar Also: thank you for looking into this! |
It seems that slow munmap, yes. I've been using this code.
Anyway, Note, using |
Change https://golang.org/cl/206841 mentions this issue: |
It would be better to use However, the fact that |
What policy do you mean by this ? The fact that
|
Change https://golang.org/cl/207237 mentions this issue: |
Yeah, that's what I meant. If it's that strict then perhaps it's OK. @aclements?
While that's true, very large mappings are not nearly as costly on other systems (though to be honest I only have hard numbers for Linux right now, and for 32 TiB With that said, we're exploring an incremental mapping approach. We could also add a layer of indirection but this complicates the code more than the incremental mapping approach @aclements suggested (IMO), and is probably safer considering we're in the freeze. Anything else that limits how much address space we map is likely going to be more complex than what we have now, which is why I proposed the current approach in the first place. |
That's good to know. If there's a strict compatibility policy on this, I'm much more comfortable with the runtime assuming mappings will start at 0x0a00000000000000.
There are certainly ways to engineer around this, but the current approach keeps the runtime much simpler and generally performs better. Adding a level of indirection would also be unfortunate considering the MMU hardware already implements all of the indirection we need directly in hardware. The arena index was originally just one level and depended entirely on MMU indirection (it's still one level on most platforms), and switching to two levels significantly increased its complexity and cost about 2% in performance IIRC. |
Ok, thanks for your answers. And, if one day, the incremental mapping is implemented, we can still try to remove this arenaBaseOffset and see how AIX is performing. But at the moment, I'd rather have a working AIX builder and I don't think we have any other options, right ? |
@Helflym I'm working on a couple of patches so that this shouldn't be a problem for the foreseeable future (involving incremental mapping). I hope to have them up for review today. |
@Helflym Ahhh... I spoke too soon. There are some problems. I think landing the arenaBaseOffset change is the way to go for now to unblock the builder. |
Let's please unblock the builder somehow. |
@ianlancetaylor I +2'd both changes. @Helflym if you can confirm that the patches work on/fix AIX, we can land them any time and unblock the builder (modulo some minor comments, submit when ready). |
Change https://golang.org/cl/207497 mentions this issue: |
On AIX, addresses returned by mmap are between 0x0a00000000000000 and 0x0afffffffffffff. The previous solution to handle these large addresses was to increase the arena size up to 60 bits addresses, cf CL 138736. However, with the new page allocator, the 60bit heap addresses are causing huge memory allocations, especially by (s *pageAlloc).init. mmap and munmap syscalls dealing with these allocations are reducing performances of every Go programs. In order to avoid these allocations, arenaBaseOffset is set to 0x0a00000000000000 and heap addresses are on 48bit, as others operating systems. Updates: #35451 Change-Id: Ice916b8578f76703428ec12a82024147a7592bc0 Reviewed-on: https://go-review.googlesource.com/c/go/+/206841 Run-TryBot: Michael Knyszek <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
@mknyszek I've tried your patch with the 60bits addresses and that's far more quicker but not enough sadly. With your patch,
With the arenaBaseOffset patch,
I didn't have time to investigate much further though. I'll try to do it by the end of the day or tomorrow and will keep you update. |
@Helflym Yeah that's what I figured. I think for this release we're going to stick with the 48-bit address space assumption as the fix. I tried to do an incremental mapping approach wherein we would keep making mappings twice the size, copying data between them each time (there are at most 64 times that we would perform this doubling, and the copying wouldn't actually take very long, a very conservative upper bound would be 500µs for a 1 TiB heap, and it would only happen once). This seemed like a great idea in principle, but there are some sharp edges here. The following is more for me to document why we ended up not doing this, so I apologize for the low level of specifics in the rest of this comment: Firstly, picking the contiguous section of the summaries that the mapping represents is tricky. Ideally you want some kind of "centering" in case the heap keeps growing down in the address space. Next, we want to maintain the illusion to the page allocator that this mapping is actually one enormous mapping, but it's possible to get an address from mmap that's too low to store the pointer, and we'd have to rely on slice calculations in the compiler doing the right overflow. It would work but could lead to unexpected bugs in the future. The alternative would be to maintain an explicit offset and apply it everywhere but this significantly complicates the code. We also considered doing a sparse-array approach for the summaries, but that complicates the code a lot as well, since each summary level is a different size (generics over constants would fix this, but that's a slippery slope). |
I think that's better anyway. Even if it's not exactly how AIX is working, it will be closer to AIX and avoid unnecessary build failures everytime something dealing with the memory is changed. If one day, there is other huge changes which might allow 60bits addresses again, just warn me and I'll try asap. But until then, let's keep AIX with an arena similar to Linux, it will be safer. |
@Helflym Would you consider this bug fixed for now? We can open a new issue if we see a need for 60-bit addresses again. |
Currently the page allocator bitmap is implemented as a single giant memory mapping which is reserved at init time and committed as needed. This causes problems on systems that don't handle large uncommitted mappings well, or institute low virtual address space defaults as a memory limiting mechanism. This change modifies the implementation of the page allocator bitmap away from a directly-mapped set of bytes to a sparse array in same vein as mheap.arenas. This will hurt performance a little but the biggest gains are from the lockless allocation possible with the page allocator, so the impact of this extra layer of indirection should be minimal. In fact, this is exactly what we see: https://perf.golang.org/search?q=upload:20191125.5 This reduces the amount of mapped (PROT_NONE) memory needed on systems with 48-bit address spaces to ~600 MiB down from almost 9 GiB. The bulk of this remaining memory is used by the summaries. Go processes with 32-bit address spaces now always commit to 128 KiB of memory for the bitmap. Previously it would only commit the pages in the bitmap which represented the range of addresses (lowest address to highest address, even if there are unused regions in that range) used by the heap. Updates #35568. Updates #35451. Change-Id: I0ff10380156568642b80c366001eefd0a4e6c762 Reviewed-on: https://go-review.googlesource.com/c/go/+/207497 Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Austin Clements <[email protected]> Reviewed-by: Cherry Zhang <[email protected]>
AIX doesn't allow to mmap an already mmap address. The previous way to deal with this behavior was to munmap before calling mmap again. However, mprotect syscall is able to change protections on a memory range. Thus, memory mapped by sysReserve can be remap using it. Note that sysMap is always called with a non-nil pointer so mprotect is always possible. Updates: #35451 Change-Id: I1fd1e1363d9ed9eb5a8aa7c8242549bd6dad8cd0 Reviewed-on: https://go-review.googlesource.com/c/go/+/207237 Reviewed-by: Austin Clements <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
Since the new page allocator (#35112) has been enabled by default, the runtime is extremely slow. Some recent failures on aix/ppc64 builder seems also related, starting from CL 190622. There is a timeout during runtime tests (cf https://build.golang.org/log/7e68765a5fe5e9887ef06fd90de1c9ae6682e73d https://build.golang.org/log/d09de5259a97061169ce2648541666ba1101fc1c)
Before CL 201765:
After
cc @mknyszek
The text was updated successfully, but these errors were encountered: