-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Micro optimizations to improve the performance of EH stackwalking, particularly in the X86 with Funclets model #114582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Micro optimizations to improve the performance of EH stackwalking, particularly in the X86 with Funclets model #114582
Conversation
…rticularly in the X86 with Funclets model - Implement EHEnumInitFromStackFrameIterator as a SuppressGCTransition QCall and optimize its performance - This allows skipping setting hte InlinedCallFrame to indicate that it is an EH frame (as suppress GC transition frames are NOT generated in that situation) - On X86 this also allows skipping using an EH prolog for this function - Only Update the MethodRegionInfo if there are EH regions to walk - Improve the codegen and reduce the usage of UpdateRuntimeWrappedExceptions api - Previously we would call IsRuntimeWrappedExceptions, which would lazily compute the flag. However, since we can't actually run the lazy computation during EH, we had already forced it to be initialized, so we didn't actually need to have the full lazy computation logic in place. - Also, we were setting this flag as we walked the stack frame via SfiInit and SfiNext, but only checkinging it when parsing the EH clause data. Move the computation to the EHEnumInitFromStackFrameIterator api, and only compute the correct version of the flag IF there are clauses to walk. - Only Update the IsRuntimeWrappedExceptions flag and the MethodRegionInfo if there are EH regions to walk - Improve the performance of AppendExceptionStackFrame slightly be using the variant of GCX_COOP which takes a Thread* instead of getting it from the TLS data. - Improve the performance of StackTraceInfo::AppendElement - It always calls EnsureStackTraceArray which ALSO needs to have a protected GC variable. Instead of doing that locally in EnsureStackTraceArray, instead make the GCPROTECT in EnsureStackTraceArray be a bit larger. This allows avoiding modifying the TLS linked list of GCFrames, as well as avoids needing an x86 EH prolog for EnsureStackTraceArray - Update ExceptionObject::GetStackTrace to use an out of line copy of code to clone the stack trace array in the presence of the multi-threaded scenario. This avoids the EH prolog on X86. - Update ExceptionObject::GetStackTrace to avoid needing to regather the current thread, instead taking it as a parameter - Change ExceptionObject::GetStackTraceParts to use a faster technique for checking to see if the array is an sbyte array or an object[]. - Update NotifyFunctionEnter to check CORProfileTrackExceptions before calling the various profiler reporting functions. This allows the check to happen only once instead of 4 times, and also allowed me to outline some logic so that the function didn't need an EH prolog on X86 - For handling of the m_emptyDebuggerExState on the ThreadExceptionState object, move that into a global static variable to improve the performance of calls of the ThreadExceptionState::GetDebuggerState api, and add a new ThreadExceptionState::SetDebuggerIndicatedFramePointer api to avoid even touching the empty debugger state - Refactor the EECodeInfo::DecodeGCHdrInfo function into a fast inlineable path, and a slow path that does a lot of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
@@ -56,8 +56,13 @@ struct StackTraceElement | |||
|
|||
class StackTraceInfo | |||
{ | |||
struct StackTraceArrayProtect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding inline documentation to clarify the roles of m_pStackTraceArray and m_pStackTraceArrayNew within StackTraceArrayProtect to improve future maintainability.
Copilot uses AI. Check for mistakes.
} | ||
CONTRACTL_END | ||
|
||
_ASSERTE(IsRuntimeWrapExceptionsStatusComputed()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be helpful to document the assumption that the runtime wrap exception status is computed before calling IsRuntimeWrapExceptionsDuringEH, to aid future maintainers in understanding the precondition.
Copilot uses AI. Check for mistakes.
GCPROTECT_BEGIN(newStackTrace); | ||
|
||
size_t stackTraceCapacity = pStackTrace->Capacity(); | ||
size_t stackTraceCapacity = pStackTraceProtected->m_pStackTraceArray.Capacity(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a comment to explain the use of m_pStackTraceArray within the StackTraceArrayProtect structure, to clarify how capacity and copying are managed in the new implementation.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
src/coreclr/vm/exstate.cpp
Outdated
@@ -291,6 +291,8 @@ ExceptionFlags* ThreadExceptionState::GetFlags() | |||
#if !defined(DACCESS_COMPILE) | |||
|
|||
#ifdef DEBUGGING_SUPPORTED | |||
static DebuggerExState m_emptyDebuggerExState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static DebuggerExState m_emptyDebuggerExState; | |
static DebuggerExState s_emptyDebuggerExState; |
src/coreclr/vm/exstate.cpp
Outdated
|
||
#if defined(_MSC_VER) | ||
#pragma warning(default : 4640) | ||
#endif | ||
return &m_emptyDebuggerExState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return &m_emptyDebuggerExState; | |
return &s_emptyDebuggerExState; |
_ASSERTE(hdrInfoSize != 0); | ||
m_hdrInfoTable = (PTR_CBYTE)gcInfoToken.Info + hdrInfoSize; | ||
} | ||
GCInfoToken gcInfoToken = GetGCInfoToken(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCInfoToken gcInfoToken = GetGCInfoToken(); | |
_ASSERTE(m_hdrInfoTable == NULL); | |
GCInfoToken gcInfoToken = GetGCInfoToken(); |
Some of the test failures look legit:
I didn't try to track it down yet, but it seems related to the |
Co-authored-by: Filip Navara <[email protected]>
…ghton/runtime into X86FuncletPerfMicroOpts
Seems like the tests passed. Can we get this merged or is there still outstanding work to be done? |
Thanks a lot! |
@filipnavara I'm sorry about the delay, I've been unavoidably away from work for the last week, and I'm only now getting back. I didn't get a chance to look into the debugger scenarios, so I'll try to see if I can get that done some time this week, but I may not be able to. |
No worries. I appreciate any help with this! Thanks for keeping me in the loop. |
This was caused by changes in PR dotnet#114582 where Exception::GetStackTrace called GetThread() which throws an error exception in the DAC. Changed the enummem.cpp code in ClrDataAccess::DumpManagedExcepObject() to call the GetStackTrace overload that allows a NULL pCurrentThread parameter to be passed.
) This was caused by changes in PR #114582 where Exception::GetStackTrace called GetThread() which throws an error exception in the DAC. Changed the enummem.cpp code in ClrDataAccess::DumpManagedExcepObject() to call the GetStackTrace overload that allows a NULL pCurrentThread parameter to be passed.
MethodRegionInfo
if there are EH regions to walkUpdateRuntimeWrappedExceptions
apiIsRuntimeWrappedExceptions
, which would lazily compute the flag. However, since we can't actually run the lazy computation during EH, we had already forced it to be initialized, so we didn't actually need to have the full lazy computation logic in place.SfiInit
andSfiNext
, but only checking it when parsing the EH clause data. Move the computation to theEHEnumInitFromStackFrameIterator
api, and only compute the correct version of the flag IF there are clauses to walk.sRuntimeWrappedExceptions
flag and theMethodRegionInfo
if there are EH regions to walkEnsureStackTraceArray
, instead make the GCPROTECT inEnsureStackTraceArray
be a bit larger. This allows avoiding modifying the TLS linked list ofGCFrames
, as well as avoids needing an x86 EH prolog forEnsureStackTraceArray
ExceptionObject::GetStackTrace
to use an out of line copy of code to clone the stack trace array in the presence of the multi-threaded scenario. This avoids the EH prolog on X86.ExceptionObject::GetStackTrace
to avoid needing to regather the current thread, instead taking it as a parameterExceptionObject::GetStackTraceParts
to use a faster technique for checking to see if the array is an sbyte array or an object[].CORProfileTrackExceptions
before calling the various profiler reporting functions. This allows the check to happen only once instead of 4 times, and also allowed me to outline some logic so that the function didn't need an EH prolog on X86m_emptyDebuggerExState
on theThreadExceptionState
object, move that into a global static variable to improve the performance of calls of theThreadExceptionState::GetDebuggerState
api, and add a newThreadExceptionState::SetDebuggerIndicatedFramePointer
api to avoid even touching the empty debugger stateEECodeInfo::DecodeGCHdrInfo
function into a fast inlineable path, and a slow path that does a lot of work for better inlining behavior.This PR improves the performance of deep stack EH throwing by micro-optimizing a number of scenarios. In particular in the Windows X86 Funclet model, it achieves about a 15% improvement to a simple benchmark.