-
Notifications
You must be signed in to change notification settings - Fork 2.6k
GCStress: try to reduce races and tolerate races better #17330
Conversation
@swaroop-sridhar @janvorli @jkotas let me know if you think this is the wrong angle to pursue. It does not look like there is any easy way to eliminate the races in stress instruction updates so this change tries to tolerate them as typically they should sort themselves out fairly quickly. Without something like this we'll likely have to block running stress on a many multi-threaded test cases to get the stress legs into acceptable shape. test Windows_NT x64 Checked gcstress0xc test Windows_NT x86 Checked gcstress0xc |
👍 |
Jenkins seems to have lost track of some legs? @dotnet-bot retest test Windows_NT x64 Checked gcstress0xc |
Seems like early "publishing" of the |
Hmm, lots of odd failures |
test Windows_NT x64 Checked gcstress0xc test Windows_NT x86 Checked gcstress0xc |
Looks like tests that intentionally AV on x64 now fail under stress. So something is amiss in the fault retry path. Need to dig in more. x86 stress looks pretty good though, with only the HW intrinsic failures and one other test failing. Also another appearance of #17242 in the x86 minopts gcstress leg. |
Looks like on x64 maybe the redirect stuff messes up the retry logic, as the same fault gets associated with different IPs over time, and we call Am looking at trying to recognize these "spurious" AVs by their exception info signature instead. The kernel describes these as reads from address 0xFF..FF so we'll use that as a screen too. We still might get confused by an explicit AV that is trying to read from 0xFF...FF but hopefully those are rare. |
test Windows_NT x64 Checked gcstress0xc test Windows_NT x86 Checked gcstress0xc |
1 similar comment
test Windows_NT x64 Checked gcstress0xc test Windows_NT x86 Checked gcstress0xc |
Peeking in at the logs there are a few failures here and there, but no more than a couple so far in any leg and the total set of failures is not all that large, maybe 7 tests in all. So if this holds up and these changes look reasonable, we may be able to just exclude a small number of tests from gc stress. The only surprise so far is a failure in regex-redux-5 which is one of the cases I was stressing locally with the fix. Should be a few more hours before all the legs are done. |
@fiigii seeing this failure pretty consistently from
You probably can repro with just x64 jitstress=1 or =2. |
This can be fixed by #17180 |
Several fixes were checked in over the weekend. Also am curious how repeatable some of these failures are. So I am going to do another round of tests and summarize what is still failing. test Windows_NT x64 Checked gcstress0xc test Windows_NT x86 Checked gcstress0xc |
7 of the gc stress legs passed, 7 others failed. Here's a breakdown of the failures: baseservices_threading._mutex_misc_waitone2 Not able to repro this locally. Also failure in CI seems rare. Am going to ignore this one for now. JIT_Methodical._refany__il_relarray3 -- fix in progress #17363 baseservices_threading._commitstackonlyasneeded_DefaultStackCommit -- #17404 * x86 gcstress 0xC zapdisable heapverify Not able to repro these locally so far. Not multithreaded so the fact that failures are not deterministic is a bit odd. Return code in failing CI has odd pattern 0xCCCCCD40 (debug) or 0xCCCCCDC4 (rel) which may be a clue. Will keep at it. Note we have 0x0CCCCCCCD used in the gc at places as an "invalid gc value" when the shadow heap is enabled. However, enabling the shadow heap appears to require heapverify=2 and we're running with heapverify=1. Note heapverify=1 also fills free memory within a heap to 0xCC. JIT_Methodical._ELEMENT_TYPE_IU__il_dbgu_fld Forking this off as #17428.
JIT_HardwareIntrinsics._X86_Avx2_Avx2_ro -- failed to validate. Not gc stress related. Opened #17389. CoreMangLib_cti._system_gc_GCCollect_GCCollect -- seems incompatible; exempted via #17390. Also still not clear if this change really improves things. Am going to look at the now-passing legs and see if there's sufficient evidence. |
Per #17319 we should extend exclusion of CoreMangLib_cti._system_gc_GCCollect to x86/x64. |
test Windows_NT x64 Checked gcstress0xc_jitstress2 |
Interesting, the rerun of x64 gcstress 0xC jitstress 2 also kicks off the x64 gcstress 0xC again, and this time the latter shows 4 new failures, all AVs. x64 gcstress 0xC jitstress 2 reproed the Avx2_ro failure. |
It looks like this does help reduce some instances of artificial stress failures, both on x64 and x86. And the remaining failures have been investigated and now are tracked by their own issues. So I'm removing the no merge tag and asking for reviews. @swaroop-sridhar @janvorli @jkotas PTAL |
It seems the following test has uncovered a real GC hole on x86 Windows: |
Ah, I can see that we already have an issue tracking it. |
Thread* pThread = GetThread(); | ||
if (exceptionCode == STATUS_ACCESS_VIOLATION && | ||
GCStress<cfg_instr>::IsEnabled() && | ||
pExceptionRecord->ExceptionInformation[0] == 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new check will always be false on Unix. So we can disable all the last AV address stuff for Unix. It seems that the cleanest way would be to create a new ifdef that would be used instead of the defined(_TARGET_X86_) || defined(_TARGET_AMD64_)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit unclear on your feedback here.
It looks like in some cases signal processing can fall back to parsing the instruction to classify the signal. If that's true then we might want to keep the last address check (eg require "true" AVs be ones that happen again if we re-execute the faulting instruction) and just drop the exception information checks.
Or if we think the signal to seh conversion is largely exempt from spurious AVs we can disable the whole thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually not sure what would happen on Unix in this case. We would need to debug it. Since this code was not enabled for Unix amd64 before and we didn't do any GC stress testing of Unix x86, we would need to try to enable it and do some debugging to see if we ever hit this path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe something like this?
#if defined(HAVE_GCCOVER) && (defined(_TARGET_X86_) || defined(_TARGET_AMD64_)) && !defined(FEATURE_PAL)
#define GCCOVER_TOLERATE_SPURIOUS_AV
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
src/vm/excep.cpp
Outdated
|
||
if (exceptionCode == STATUS_CLR_GCCOVER_CODE) | ||
{ | ||
if (OnGcCoverageInterrupt(pContext)) | ||
{ | ||
// printf("was indeed GC..\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit - can you please remove the commented out debug printfs at several places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
src/vm/excep.cpp
Outdated
pThread->GetLastAVAddress() != (LPVOID)GetIP(pContext) && | ||
pThread->PreemptiveGCDisabled() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this condition means that we care only about AV that happened in managed code. Why have you removed it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When pinvoke stubs are inlined we transition the thread into and back out of cooperative mode while still within managed code.
See the big block comment above GCCoverageInfo::SprinkleBreakpoints
for why we instrument for gc coverage within this region and then try and un-instrument once we realize where we are. The un-instrumenting is what leads to all the trouble....
So we can see spurious AVs in managed code with preemptive GC disabled.
src/vm/gccover.cpp
Outdated
@@ -84,6 +84,10 @@ void SetupAndSprinkleBreakpoints( | |||
gcCover->callerThread = 0; | |||
gcCover->doingEpilogChecks = true; | |||
|
|||
// Expose m_GcCover early before we add breakpoints. | |||
_ASSERTE(!pMD->m_GcCover); | |||
*EnsureWritablePages(&pMD->m_GcCover) = gcCover; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand the reason for this change - it is to ensure that if another thread was running the function into which we sprinkle the breakpoints and it hits one of the breakpoints, we have the m_GcCover set, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a fairly common spurious failure mode is for a thread to hit a GC cover interrupt and then discover that m_GcCover is null. There are comments elsewhere about locks preventing other threads from running the code while it's being instrumented but it does seem to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will replace easy to diagnose failure mode (m_GcCover is null) with a different failure mode that is much harder to diagnose (m_GcCover still being initialized by one thread while the other thread is using it to handle coverage interrupts).
Should we rather fix the race that leads to code running before m_GcCover
is initialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't appear that m_GcCover is lacking anything important for correctness when it's published early. A racing thread may see that lastMD
was not cleared and so change the logging a little, but I currently don't see how the early publish will lead to a new class of hard to diagnose failures.
But this kind of thing can be hard to reason about with confidence and it would be better to eliminate the races. The one I saw frequently was in regex-redux-5 and came from a dynamically generated regex method. I'll look more closely at that case and try and figure out where the race comes from.
@@ -1677,7 +1680,8 @@ void DoGcStress (PCONTEXT regs, MethodDesc *pMD) | |||
} | |||
|
|||
// Must flush instruction cache before returning as instruction has been modified. | |||
FlushInstructionCache(GetCurrentProcess(), (LPCVOID)instrPtr, 6); | |||
// Note this needs to reach beyond the call by up to 4 bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain why we need to reach beyond the call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the atCall
clause in DoGcStress
we sometimes decide to add a new GC interrupt at the instruction after the the call. So it seemed prudent to extend the flush to cover this instruction update too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This change addresses races that cause spurious failures in when running GC stress on multithreaded applications. * Instruction update race Threads that hit a gc cover interrupt where gc is not safe can race to overrwrite the interrupt instruction and change it back to the original instruction. This can cause confusion when handling stress exceptions as the exception code raised by the kernel may be determined by disassembling the instruction that caused the fault, and this instruction may now change between the time the fault is raised and the instruction is disassembled. When this happens the kernel may report an ACCESS_VIOLATION where there was actually an attempt to execute a priveledged instruction. x86 already had a tolerance mechanism here where when gc stress was active and the exception status was ACCESS_VIOLATION the faulting instruction would be retried to see if it faults the same way again. In this change we extend this to tolerance to cover x64 and also enable it regardless of the gc mode. We use the exception information to further screen as these spurious AVs look like reads from address 0xFF..FF. * Instrumentation vs execution race The second race happens when one thread is jitting a method and another is about to call the method. The first thread finishes jitting and publishes the method code, then starts instrumenting the method for gc coverage. While this instrumentation is ongoing, the second thread then calls the method and hits a gc interrupt instruction. The code that recognizes the fault as a gc coverage interrupt gets confused as the instrumentation is not yet complete -- in particular the m_GcCover member of the MethodDesc is not yet set. So the second thread triggers an assert. The fix for this is to instrument for GcCoverage before publishing the code. Since multiple threads can be jitting a method concurrently the instrument and public steps are done under a lock to ensure that the instrumentation and code are consistent (come from the same thread). With this lock in place we have removed the secondary locking done in SetupGcCoverage as it is no longer needed; only one thread can be instrumenting a given jitted method for GcCoverage. * Instruction cache flushes In some cases when replacing the interrupt instruction with the original the instruction cache was either not flushed or not flushed with sufficient length. This possibly leads to an increased frequency of the above races. No impact for non-gc stress scenarios. Addresses the spurious GC stress failures seen in #17027 and #17610.
cbbfdc6
to
86718e1
Compare
Rebased and squashed. test Windows_NT x64 Checked gcstress0xc test Windows_NT x86 Checked gcstress0xc |
test Ubuntu arm Cross Checked gcstress0xc Build and Test |
Looks like NGENd code is getting instrumented twice. So will need to sort this out. |
Tier1 rejits will also will need to reset |
We invoke Previously we'd just bail out of It might be better down the road to not do up-front instrumentation for all prejitted methods in a module since this probably contributes to making GC stress slow. We can instead use the new lock to protect the instrument/publish dance for precompiled methods and so instrument them as they are prepared. But it will take some reworking of the code paths as the instrumentation and publishing points need to be brought closer together, and there are some variant paths here (eg IL stubs). For tier1 rejits we are in a bind as the tier0 code may remain active even after the tier1 code is published. So we can't safely clear GC Coverage is vulnerable to other cases of silent drop-out since there is currently no easy way to audit that all methods that are executed are also instrumented. Methods without coverage instrumentation typically won't provoke new GC stress failures and so won't call attention to themselves. That was happening for dynamic methods with recycled MethodDescs previously and there may be other cases I haven't found yet. In particular I suspect R2R code never gets instrumented for GC Coverage though still need to verify this. But if true, that is also something we might want to address separately. Similarly, profiler injected code that overrides a prejitted method won't get new instrumentation. So to sum up, I propose:
|
test Windows_NT x64 Checked gcstress0xc test Windows_NT x86 Checked gcstress0xc test Ubuntu arm Cross Checked gcstress0xc Build and Test |
Sounds good.
I do not think this is going to work because of both old and new method can be executing at the same time. I think that to handle cases like these, we would need to change the |
Failures in gc stress legs include a few new tests failing with timeout, and the two hard to repro failures we've been seeing:
Increased number of timeouts is not too suprising given we are now holding a lock while instrumenting. However the failing tests don't appear to be multithreaded and in single threaded cases the lock shouldn't add much overhead. So not sure what to think of those failures. Non-stress failure seen in (Windows_NT arm64 Cross Checked)[https://ci.dot.net/job/dotnet_coreclr/job/master/job/arm64_cross_checked_windows_nt_innerloop_flow_prtest/751/] looks like it was possibly an infrastructure issue:
Not sure if the FATAL here came from the test spew or the infrastructure. Suspect the latter. |
I thought this was one of the race types you were trying to fix. |
No, I haven't looked into this "pending update" assert yet. |
FWIW, while I 100% support eliminating GCStress flaky failure, when we start talking about missing coverage, such as potentially uninstrumented code, we might want to step back and think of all ways we can improve GCStress effectiveness. One current problem: GCStress doesn't actually move objects most of the time, so bad GC info doesn't always get noticed. Another: we only instrument once, so if the hole isn't uncovered on first run, it isn't noticed. Perhaps we could re-instrument periodically so non-first-run holes get found. |
Yes, looks like R2R code is bypassing GcCover instrumentation. Opened some follow-on issues:
@BruceForstall feel free to open issues and we can collect them up and have some idea about how to prioritize. |
Ubuntu arm Cross Checked gcstress0xc looks like it may take up to 44 hours, so should finish sometime Friday morning. |
|
8 instances of the pending update failure in the arm 0xC stress. Also one other failure which I haven't seen before:
|
I don't believe any of the above failures are caused by this change, and there were no spurious AVs or privileged instruction asserts. So think we probably should merge this. |
fyi, that eventsourcetrace issue is https://github.com/dotnet/coreclr/issues/17188 |
This change addresses races that cause spurious failures in when running
GC stress on multithreaded applications.
Threads that hit a gc cover interrupt where gc is not safe can race to
overrwrite the interrupt instruction and change it back to the original
instruction.
This can cause confusion when handling stress exceptions as the exception code
raised by the kernel may be determined by disassembling the instruction that
caused the fault, and this instruction may now change between the time the
fault is raised and the instruction is disassembled. When this happens the
kernel may report an ACCESS_VIOLATION where there was actually an attempt to
execute a priveledged instruction.
x86 already had a tolerance mechanism here where when gc stress was active
and the exception status was ACCESS_VIOLATION the faulting instruction would
be retried to see if it faults the same way again. In this change we extend
this to tolerance to cover x64 and also enable it regardless of the gc mode.
We use the exception information to further screen as these spurious AVs look
like reads from address 0xFF..FF.
The second race happens when one thread is jitting a method and another is
about to call the method. The first thread finishes jitting and publishes the
method code, then starts instrumenting the method for gc coverage. While this
instrumentation is ongoing, the second thread then calls the method and hits
a gc interrupt instruction. The code that recognizes the fault as a gc coverage
interrupt gets confused as the instrumentation is not yet complete -- in
particular the m_GcCover member of the MethodDesc is not yet set. So the second
thread triggers an assert.
The fix for this is to instrument for GcCoverage before publishing the code.
Since multiple threads can be jitting a method concurrently the instrument and
public steps are done under a lock to ensure that the instrumentation and code
are consistent (come from the same thread).
With this lock in place we have removed the secondary locking done in
SetupGcCoverage as it is no longer needed; only one thread can be instrumenting
a given jitted method for GcCoverage.
In some cases when replacing the interrupt instruction with the original the
instruction cache was either not flushed or not flushed with sufficient length.
This possibly leads to an increased frequency of the above races.
No impact for non-gc stress scenarios.
Addresses the spurious GC stress failures seen in #17027 and #17610.