Skip to content

Segfault in 24.1.0-dev GFTC builds with pg driver #3590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nirvdrum opened this issue Jun 14, 2024 · 6 comments · Fixed by oracle/graal#9146
Closed

Segfault in 24.1.0-dev GFTC builds with pg driver #3590

nirvdrum opened this issue Jun 14, 2024 · 6 comments · Fixed by oracle/graal#9146
Assignees

Comments

@nirvdrum
Copy link
Collaborator

nirvdrum commented Jun 14, 2024

While running the benchmarks from the ORM benchmarks discussion, I ran into a segfault using the latest 24.1.0-dev GFTC JVM builds. I haven't seen the issue with the GFTC native builds. The crash occurs 100% reliably on my Ubuntu 24.04 x86_64 system.

> ruby -v
truffleruby 24.1.0-dev-51b497f9, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]

Steps:

  1. Install the latest 24.1.0-dev GFTC build (24.1.0-ea10 at the moment)
  2. Clone the ORM benchmark repo
  3. cd activerecord_truffleruby
  4. bundle install
  5. Start the PostgreSQL container (either Docker or Podman)
  6. Set the DATABASE_URL environment variable to connect into the container (e.g., the value is postgres://postgres:postgres@localhost:36319/TestAR on my machine because local post 36319 forwards to 5432 in the container)
  7. Run ruby benchmark.rb

hs_err_pid700062.log

internal issue: [GR-54771]

@nirvdrum nirvdrum changed the title Segfault in Segfault in 24.1.0-dev GFTC builds with pg driver Jun 14, 2024
@eregon
Copy link
Member

eregon commented Jun 14, 2024

Stack: [0x00007b9840af4000,0x00007b9840bf4000],  sp=0x00007b9840bf27a0,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libffi.so.8+0x383a]
C  [libtrufflenfi.so+0x723c]  Java_com_oracle_truffle_nfi_backend_libffi_ClosureNativePointer_freeClosure+0x6c
j  com.oracle.truffle.nfi.backend.libffi.ClosureNativePointer.freeClosure(J)V+0 com.oracle.truffle.truffle_nfi_libffi
j  com.oracle.truffle.nfi.backend.libffi.ClosureNativePointer$NativeDestructor.destroy()V+4 com.oracle.truffle.truffle_nfi_libffi
j  com.oracle.truffle.nfi.backend.libffi.NativeAllocation$1.run()V+22 com.oracle.truffle.truffle_nfi_libffi
j  java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@23
j  java.lang.Thread.run()V+19 java.base@23
v  ~StubRoutines::call_stub 0x00007b9866d03ca6
V  [libjvm.so+0x8d8ebb]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x2db
V  [libjvm.so+0x8da822]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x1c2
V  [libjvm.so+0x9b22ac]  thread_entry(JavaThread*, JavaThread*)+0x8c
V  [libjvm.so+0x8ef3a8]  JavaThread::thread_main_inner() [clone .part.0]+0xb8
V  [libjvm.so+0xeab1df]  Thread::call_run()+0x9f
V  [libjvm.so+0xcc8095]  thread_native_entry(Thread*)+0xd5
C  [libc.so.6+0x9ca94]
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  com.oracle.truffle.nfi.backend.libffi.ClosureNativePointer.freeClosure(J)V+0 com.oracle.truffle.truffle_nfi_libffi
j  com.oracle.truffle.nfi.backend.libffi.ClosureNativePointer$NativeDestructor.destroy()V+4 com.oracle.truffle.truffle_nfi_libffi
j  com.oracle.truffle.nfi.backend.libffi.NativeAllocation$1.run()V+22 com.oracle.truffle.truffle_nfi_libffi
j  java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@23
j  java.lang.Thread.run()V+19 java.base@23
v  ~StubRoutines::call_stub 0x00007b9866d03ca6

So that sounds like an issue in TruffleNFI.

Could you try with 24.0.1 (JVM) too?

@nirvdrum
Copy link
Collaborator Author

I'm sorry. I had tested with 24.0.1 but forgot to note it. I'm only seeing the problem with the 24.1.0-dev GFTC JVM build. I don't see it with native builds and I don't see it with a CE JVM build. I also tried with the cext lock enabled and disabled -- that has no impact. The stack does look NFI related, but I wonder if it's something about the pg driver. I tried the sqlite3 benchmark and that didn't crash.

@andrykonchin andrykonchin self-assigned this Jun 17, 2024
@rschatz
Copy link
Member

rschatz commented Jun 18, 2024

I can reproduce this using your docker containers. Strangely enough I can't reproduce it on my host system.

I'm pretty sure the issue is that there is a second libffi coming from somewhere. The first one is statically linked into libtrufflenfi.so. Not sure where the second one comes from, this might just a transitive library dependency, either of hotspot or the postgres driver.

What's happening here is that the dynamic loader is confusing those two libraries, and it seems to be mixing symbols from them. E.g. use ffi_closure_allocate from our libffi, but ffi_closure_free from the other one. And that leads to the segfault.

I tried to rename all the libffi symbols in libtrufflenfi.so manually, and that seems to fix the issue. I'm not 100% sure how to actually do this without manually messing with the libtrufflenfi.so, but there has to be some way. objcopy --redefine-symbols unfortunately doesn't work, it renames only the static symbols, we need to rename the dynamic symbols.

@nirvdrum
Copy link
Collaborator Author

@rschatz Interesting. If it helps any, I'm seeing the crash when running on my Ubuntu 24.04 host. Is there something in particular I can search for that would help you see if it's a naming conflict?

@rschatz
Copy link
Member

rschatz commented Jun 19, 2024

This was actually easier than I thought. Just adding -fvisibility=hidden to the libffi build fixes the problem, no need to actually rename any symbols.

I made a PR: oracle/graal#9146
For convenience I made the PR based on the commit of the 24.1.0-ea10 build. If you want to try it out, you can just cd truffle; mx build, and swap out the libtrufflenfi.so in the GFTC build.

This fixes the problem for me on your containers.

@nirvdrum
Copy link
Collaborator Author

Thanks. I can confirm the process no longer segfaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants