Skip to content

Exception stack traces on macOS/iOS are in a bad state #4895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
s-ludwig opened this issue Mar 29, 2025 · 24 comments
Open

Exception stack traces on macOS/iOS are in a bad state #4895

s-ludwig opened this issue Mar 29, 2025 · 24 comments

Comments

@s-ludwig
Copy link
Contributor

At least since the change to use "atos" for resolving stack frames, the state of printing call stacks is pretty disastrous on macOS, regardless of the -g switch:

  1. Printing an exception stack trace will freeze the calling thread for an extended period of time (can be a few hundred ms), which means that a process that logs/prints a larger number of exceptions can easily become unusable or would be a very easy DDoS target.

  2. For x86-64 builds in particular, the stack unwinding functionality appears to be broken and only the topmost stack frame is being displayed.

  3. Resolving file names and line numbers doesn't seem to work at all.

  4. Stack traces generated for system crash reports usually stop at the first extern (D) function (I'm not sure whether this might be a simple issue of --frame-pointer, I'll have to check whether that makes a difference)

  5. On some systems atos isn't available at all

In comparison, dmd appears to work fine (stack traces with symbols working regardless and file/line number working correctly when passing -g.

This has been an issue that wanted to dig deeper into for a while now, but never really got to it, so I just wanted to at least get this written down to maybe start a discussion.

@kinke
Copy link
Member

kinke commented Mar 29, 2025

This can hardly be the general case, as macOS CI includes backtrace tests and is happy on both x86_64 and arm64. So at least some aspects of this seem specific to your setup.

Are you talking about unoptimized -g code, or optimized builds incl. -O? In the latter case, the missing frame pointers might be a problem (not just in your code, but druntime/Phobos too, unless using -link-defaultlib-debug). But those are going to be kept by default for Apple targets with the next version (#4889).

The atos-based implementation has been upstreamed, but DMD's druntime most likely doesn't use it because it probably still prevents the linker from stripping the __debug_line section (which we did before LDC v1.31 by hacking LLVM, before the atos-based implementation was merged in #4291).

Are you talking about distributed binaries, where e.g. the .dSYM files might be missing? [Note that I don't have any Apple stuff and cannot test anything myself.] Ref: #4006

@s-ludwig
Copy link
Contributor Author

I've tested with and without -g and with and without optimizations. It's very well possible that there are some system dependent factors here. I'll try to summarize what I got here:

Snippet used for testing:

void foo(int i)
{
	if (i > 0)
		foo(i - 1);
	else assert(false);
}

void bar()
{
	foo(4);
}

void main()
{
	bar();
}

macOS 15.3.1 arm64, latest Xcode, LDC 1.40.1 - same result for ldc2, ldc2 -g, ldc2 -d --frame-pointer=all:

[email protected](5): Assertion failure
----------------
??:? object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) [0x10472b103]
??:? void test.foo(int) [0x1046eb773]
??:? void test.foo(int) [0x1046eb773]
??:? void test.foo(int) [0x1046eb773]
??:? void test.foo(int) [0x1046eb773]
??:? void test.bar() [0x1046eb7bb]
??:? _Dmain [0x1046eb7cf]

With ldc2 -g --frame-pointer=all -O:

[email protected](5): Assertion failure
----------------
??:? object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) [0x102f63103]
??:? void rt.dmain2._d_run_main2(char[][], ulong, extern (C) int function(char[][])*).runAll() [0x102f6b893]
??:? _d_run_main [0x102f6b5bf]
??:? start [0x189f70273]
??:? .constarray [0xffffffffffffffff]

Compiling for x64 (ldc2 -g --frame-pointer=all -march=x86-64 -run test.d):

[email protected](5): Assertion failure
----------------
??:? object.Throwable.TraceInfo core.runtime.defaultTraceHandler(void*) [0x10015d4d2]

I'm also getting the same result when compiling natively on an Intel Mac, both for macOS 15.3 and macOS 10.15.

Running dsymutil on the executable did fix line numbers for the arm64 build, but not for the x64 one.

However, compiling with LDC 1.31.0 for x64 results in a stack trace comparable to arm64:

[email protected](5): Assertion failure
----------------
??:? _d_assert [0x104d091bf]
??:? void test.foo(int) [0x104b8ba6d]
??:? void test.foo(int) [0x104b8ba4b]
??:? void test.foo(int) [0x104b8ba4b]
??:? void test.foo(int) [0x104b8ba4b]
??:? void test.foo(int) [0x104b8ba4b]
??:? void test.bar() [0x104b8ba8d]
??:? _Dmain [0x104b8ba98]

Interestingly, for LDC 1.31.0 on x64, line numbers are getting displayed correctly after dsymutil has been run - even after deleting the test.dSYM folder and rebuilding the executable with a different name and changed code... but not when using -run. Seems like dsymutil stores/caches something somewhere else, too?!

[email protected](5): Assertion failure
----------------
??:? _d_assert [0x10485a1bf]
/Users/build/test.d:5 void test.foo(int) [0x1046dca6d]
/Users/build/test.d:4 void test.foo(int) [0x1046dca4b]
/Users/build/test.d:4 void test.foo(int) [0x1046dca4b]
/Users/build/test.d:4 void test.foo(int) [0x1046dca4b]
/Users/build/test.d:4 void test.foo(int) [0x1046dca4b]
/Users/build/test.d:10 void test.bar() [0x1046dca8d]
/Users/build/test.d:15 _Dmain [0x1046dca98]

The excessive CPU use of the atos invocations seems to be consistent across all systems, developer machine or not.

@kinke
Copy link
Member

kinke commented Mar 30, 2025

As said above, you need an additional -link-defaultlib-debug to make sure druntime & Phobos keep the frame pointers.

Then my understanding is that Apple's ld64 linker at one point decided not to keep the __debug_line section in the executable anymore (without a switch to keep it AFAIK - as that would resolve all of your problems, incl. runtime performance - atos is only used as fallback), but to reference that DWARF data from the linked object files: https://stackoverflow.com/questions/10044697/where-how-does-apples-gcc-store-dwarf-inside-an-executable/12827463#12827463

With -run, the object file is temporary and removed right after linking. So when running the executable, the DWARF data isn't available anymore. So please don't use -run for these tests here.

AFAIU, running dsymutil extracts the debug infos from the object files to .dSYM files, so that they can be distributed with the executables. Similar to .pdb files on Windows. And atos then takes care of processing these files for us. Ideally, we'd have an in-process library solution as on Windows (dbghelp.dll), but no idea what Apple provides.

@JohanEngelen
Copy link
Member

JohanEngelen commented Mar 30, 2025

Just now I did some testing on my macbook (macOS Sonoma 14.6.1 arm64), with LDC 1.40.1 and got different results for the testsnippet given above.
ldc2(i.e. compile to executable separately from running that executable) --> no file or line numbers, as expected because there is no debug info.
ldc2 -g --> file and line numbers as expected
ldc2 -g --frame-pointer=all --> no delta with -g (as expected, I think)
ldc2 -g --frame-pointer=all -run foo.d --> no line numbers.
ldc2 -g -O --> No file or line numbers as expected, because the code is too simply optimized to nop, and then the stack trace is purely druntime with no debug info linked in.
ldc2 -g -O -frame-pointer=all --> No file or line numbers as expected. Framepointer=all does not prevent inlining/elision etc. It only prevents frame pointer elision in functions that managed to survive other optimizations (possibly helping with stack variable detection), I was not expecting it to help with call stack tracing.

All results are as expected except for no line number with -run. There was no need to run dsymutil for these tests. (although the point #4006 still stands)

The first post contains a number of separate items without clear testcase or actionable items, so I don't know what to do with those. I never had issues with atos taking too long, but also don't really do a lot of programming / testing.

@s-ludwig
Copy link
Contributor Author

As said above, you need an additional -link-defaultlib-debug to make sure druntime & Phobos keep the frame pointers.

The only part from Druntime should be _d_assert here, so I don't think this should be relevant in this case?

W.r.t. -run, I am now getting line numbers for the arm64 build when compiling normally, so maybe that was actually okay all along and there is something else going on in our build setup.

However, the x86-64 build is only showing a single frame, regardless of -run - @JohanEngelen, can you try what happens for you with -march=x86-64?

Regarding atos, calling Throwable.toString() on Windows with a very short call stack takes about 160 us, whereas on macOS the same will take about 430 ms - so it's over 2500 times slower and even logging a single exception is out of the question for many applications. At least on macOS, there is the backtrace/backtrace_symbols API that AFAIR worked fine in my tests a while ago, could that be an alternative?

@kinke
Copy link
Member

kinke commented Mar 30, 2025

Btw guys, what is that .dSYM bundle? From the linked stackoverflow post:

dsymutil can be thought of as a debug info linker. It does this same process -- read the debug map, load the DWARF from the .o files, relocate all the addresses -- and then outputs a single binary of all the DWARF at their final, linked addresses. This is the dSYM bundle.

Once you have a dSYM bundle, you've got plain old standard DWARF that any dwarf reading tool (which can deal with Mach-O binary files) can process.

Does this mean that running dsymutil on a binary fattens it up with DWARF data, in-place? So no extra files to bundle as I thought, but bloating the executable with the debuginfos, so that druntime doesn't have to resort to atos (https://github.com/dlang/dmd/blob/6656cd920cefebff836aaf70fbf4c570f908e959/druntime/src/core/internal/backtrace/dwarf.d#L247-L251)?

@s-ludwig
Copy link
Contributor Author

From what I can see, the original binary stays untouched. The dSYM output is a special folder (bundle) that contains one or more files with DWARF debug information, as well as separate address relocation data as YAML files.

@s-ludwig
Copy link
Contributor Author

At least on macOS, there is the backtrace/backtrace_symbols API that AFAIR worked fine in my tests a while ago, could that be an alternative?

Unfortunately, backtrace_symbols doesn't return line numbers, so scratch that.

@kinke
Copy link
Member

kinke commented Mar 30, 2025

At least on macOS, there is the backtrace/backtrace_symbols API that AFAIR worked fine in my tests a while ago, could that be an alternative?

No, atos is specifically only used to resolve the file/line infos in case the executable doesn't have the section directly (which DMD enforces by setting an unconventional flag; something we can't do in LLVM without hacking it, as far as we know). We don't need it for the rest, i.e., druntime already uses backtrace[_symbols].

If the implementation can be sped up, that should be done upstream, even if DMD can work around this performance issue.

From what I can see, the original binary stays untouched.

Ah too bad, so we do need atos or some other stuff handling the debuginfo reading for us - either from the executable's map and the linked object files, or these extra .dSYM bundles.

@s-ludwig
Copy link
Contributor Author

Yeah, from everything I can find, it looks like atos is unfortunately the only viable option, apart from resorting to using private frameworks, which would mean AppStore distribution wouldn't be possible. In that case I think we should at least have a Druntime runtime option to disable atos in order to avoid performance hazards.

Another possible improvement would be to cache the results of atos so that at least repeating call stacks would be fast.

@JohanEngelen
Copy link
Member

However, the x86-64 build is only showing a single frame, regardless of -run - @JohanEngelen, can you try what happens for you with -march=x86-64?

Yeah I see the same single frame issue here.

@jacob-carlborg
Copy link
Contributor

As long as macOS has used DWARF the linker has stripped all debug info. You can see my comment how this works on macOS: https://github.com/dlang/dmd/blob/605fb8bf5b522f197798ab4a5197883a7fb0c964/compiler/src/dmd/backend/dwarfdbginf.d#L16-L27

@jacob-carlborg
Copy link
Contributor

Yeah, from everything I can find, it looks like atos is unfortunately the only viable option

We can implement whatever atos is doing ourselves. We're reverse engineering everything else, I don't see why the line is drawn right here at atos.

@JohanEngelen
Copy link
Member

Yeah, from everything I can find, it looks like atos is unfortunately the only viable option

We can implement whatever atos is doing ourselves. We're reverse engineering everything else, I don't see why the line is drawn right here at atos.

Can you be more specific on "everything else"? I am personally quite opposed to implementing (and maintaining!!!!!!) things ourselves that already have implementations elsewhere. I feel that there is already too much work for our available resources.

@jacob-carlborg
Copy link
Contributor

Can you be more specific on "everything else"?

  • Compiler
  • Runtime
  • Standard library
  • Build tool
  • Librarian (or whatever it's called, ar)
  • Linker (if anyone is still maintaining the OMF linker)

@JohanEngelen
Copy link
Member

Can you be more specific on "everything else"?

  • Compiler
  • Runtime
  • Standard library
  • Build tool
  • Librarian (or whatever it's called, ar)
  • Linker (if anyone is still maintaining the OMF linker)

:) when I see this list, I see a huge confirmation of my point that it is very bad idea to implement things ourselves. Build tool, librarian and linker, are all very bad when compared to alternatives. Compiler, runtime, and stdlib cannot be avoided when making a new language, and all are sub-par perhaps partly due to the effort spent on the rest... Even for the compiler, "outsourcing" codegen to LLVM (or GCC) is a huge win in quality, performance, and platform support.

Back on topic: implementing atos ourselves did not happen for the >2 years that symbolization was badly broken on macOS with vanilla LLVM (e.g. homebrew ldc), see https://issues.dlang.org/show_bug.cgi?id=20510. I don't think this is a viable solution path.
For faster symbolization / exception work, we can make external tools (file:line lookup) optional. We can also give an option to select another symbolization program (I read that llvm_symbolizer is much faster than atos).

@jacob-carlborg
Copy link
Contributor

I read that llvm_symbolizer is much faster than atos

Well there you go 😉. Seems like llvm-symbolizer is available as a library as well, in true LLVM fashion [1]. LDC can use that at least, DMD won't.

[1] https://llvm.org/doxygen/classllvm_1_1symbolize_1_1LLVMSymbolizer.html

@s-ludwig
Copy link
Contributor Author

s-ludwig commented Apr 25, 2025

Just noticed another issue - the atos processes get left behind as zombie processes and eventually the process will run out of available file descriptors (after around 2400 exception backtraces, further exceptions print "some pipe creation error!").

@kinke
Copy link
Member

kinke commented Apr 26, 2025

One thing I could easily do is restoring the -preserve-dwarf-line-section workaround, by re-hacking the upcoming LLVM 20 binaries used for the official LDC binaries from GitHub.

However, the x86-64 build is only showing a single frame

Is that still the case with v1.41.0-beta1 and its kept frame pointers?

@s-ludwig
Copy link
Contributor Author

-march x86-64 does seem to work as expected with v1.41.0-beta1 now!

@kinke
Copy link
Member

kinke commented Apr 28, 2025

Okay as expected, and if you had tested -link-defaultlib-debug as suggested earlier, we should have known in advance. ;)

@s-ludwig
Copy link
Contributor Author

@kinke I actually did and it didn't make a difference for x86-64. But can you explain why the stack walking the top frames would be affected by DRuntime code when that only occurs in the lower stack frames?

@kinke
Copy link
Member

kinke commented Apr 28, 2025

I actually did and it didn't make a difference for x86-64

That doesn't make a lot of sense - with -frame-pointer=all for the new code? Anyway, doesn't matter anymore.

But can you explain why the stack walking the top frames

I think it starts from the top (innermost) frames, which are in druntime (_d_throw_exception and friends), so if that goes wrong already, it can't make any progress.

@s-ludwig
Copy link
Contributor Author

Hm, of course, I completely forgot about the exception related code that precedes the actual stack walking invocation.

I don't think I used -frame-pointer=all and -link-defaultlib-debug together, though, so that might have worked indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants