GCStress: try to reduce races and tolerate races better #17330

AndyAyersMS · 2018-03-29T22:08:27Z

This change addresses races that cause spurious failures in when running
GC stress on multithreaded applications.

Instruction update race

Threads that hit a gc cover interrupt where gc is not safe can race to
overrwrite the interrupt instruction and change it back to the original
instruction.

This can cause confusion when handling stress exceptions as the exception code
raised by the kernel may be determined by disassembling the instruction that
caused the fault, and this instruction may now change between the time the
fault is raised and the instruction is disassembled. When this happens the
kernel may report an ACCESS_VIOLATION where there was actually an attempt to
execute a priveledged instruction.

x86 already had a tolerance mechanism here where when gc stress was active
and the exception status was ACCESS_VIOLATION the faulting instruction would
be retried to see if it faults the same way again. In this change we extend
this to tolerance to cover x64 and also enable it regardless of the gc mode.
We use the exception information to further screen as these spurious AVs look
like reads from address 0xFF..FF.

Instrumentation vs execution race

The second race happens when one thread is jitting a method and another is
about to call the method. The first thread finishes jitting and publishes the
method code, then starts instrumenting the method for gc coverage. While this
instrumentation is ongoing, the second thread then calls the method and hits
a gc interrupt instruction. The code that recognizes the fault as a gc coverage
interrupt gets confused as the instrumentation is not yet complete -- in
particular the m_GcCover member of the MethodDesc is not yet set. So the second
thread triggers an assert.

The fix for this is to instrument for GcCoverage before publishing the code.
Since multiple threads can be jitting a method concurrently the instrument and
public steps are done under a lock to ensure that the instrumentation and code
are consistent (come from the same thread).

With this lock in place we have removed the secondary locking done in
SetupGcCoverage as it is no longer needed; only one thread can be instrumenting
a given jitted method for GcCoverage.

Instruction cache flushes

In some cases when replacing the interrupt instruction with the original the
instruction cache was either not flushed or not flushed with sufficient length.
This possibly leads to an increased frequency of the above races.

No impact for non-gc stress scenarios.

Addresses the spurious GC stress failures seen in #17027 and #17610.

AndyAyersMS · 2018-03-29T22:12:34Z

@swaroop-sridhar @janvorli @jkotas let me know if you think this is the wrong angle to pursue.
cc @dotnet/jit-contrib

It does not look like there is any easy way to eliminate the races in stress instruction updates so this change tries to tolerate them as typically they should sort themselves out fairly quickly.

Without something like this we'll likely have to block running stress on a many multi-threaded test cases to get the stress legs into acceptable shape.

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

swaroop-sridhar · 2018-03-29T22:51:02Z

👍

AndyAyersMS · 2018-03-29T23:24:50Z

Jenkins seems to have lost track of some legs?

@dotnet-bot retest test Windows_NT x64 Checked gcstress0xc

AndyAyersMS · 2018-03-30T00:40:25Z

Seems like early "publishing" of the m_gcCover value may remove the need for that second retry loop I added, as when a thread hits a cover interrrupt now the most common need for a second retry is that m_gc_Cover is null.

AndyAyersMS · 2018-03-30T02:32:58Z

Hmm, lots of odd failures
Lets see if my updated version fares any better.

AndyAyersMS · 2018-03-30T02:33:45Z

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

AndyAyersMS · 2018-03-30T07:10:54Z

Looks like tests that intentionally AV on x64 now fail under stress. So something is amiss in the fault retry path. Need to dig in more.

x86 stress looks pretty good though, with only the HW intrinsic failures and one other test failing. Also another appearance of #17242 in the x86 minopts gcstress leg.

AndyAyersMS · 2018-03-30T18:41:10Z

Looks like on x64 maybe the redirect stuff messes up the retry logic, as the same fault gets associated with different IPs over time, and we call IsGcMarker multiple times when resolving one fault.

Am looking at trying to recognize these "spurious" AVs by their exception info signature instead. The kernel describes these as reads from address 0xFF..FF so we'll use that as a screen too.

We still might get confused by an explicit AV that is trying to read from 0xFF...FF but hopefully those are rare.

AndyAyersMS · 2018-03-30T18:45:47Z

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

AndyAyersMS · 2018-03-30T20:50:31Z

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

AndyAyersMS · 2018-03-30T23:45:09Z

Peeking in at the logs there are a few failures here and there, but no more than a couple so far in any leg and the total set of failures is not all that large, maybe 7 tests in all. So if this holds up and these changes look reasonable, we may be able to just exclude a small number of tests from gc stress.

The only surprise so far is a failure in regex-redux-5 which is one of the cases I was stressing locally with the fix.

Should be a few more hours before all the legs are done.

AndyAyersMS · 2018-03-31T00:32:00Z

@fiigii seeing this failure pretty consistently from X86_Avx2_Avx2_ro

Assert failure(PID 2976 [0x00000ba0], Thread: 10328 [0x2858]): 
Assertion failed 'emitInstCodeSz(id) >= csz' in
'JIT.HardwareIntrinsics.X86.SimpleUnaryOpTest__ExtractVector128SByte1Store:RunBasicScenario_UnsafeRead():this' (IL size 67)

You probably can repro with just x64 jitstress=1 or =2.

fiigii · 2018-03-31T01:28:20Z

This can be fixed by #17180

AndyAyersMS · 2018-04-02T18:13:39Z

Several fixes were checked in over the weekend. Also am curious how repeatable some of these failures are. So I am going to do another round of tests and summarize what is still failing.

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

AndyAyersMS · 2018-04-03T01:16:59Z

7 of the gc stress legs passed, 7 others failed. Here's a breakdown of the failures:

x86 gcstress 0xC

baseservices_threading._mutex_misc_waitone2

Not able to repro this locally. Also failure in CI seems rare. Am going to ignore this one for now.

x86 gcstress 0xC minopts heapverify

JIT_Methodical._refany__il_relarray3 -- fix in progress #17363

x86 gcstress 0xC jitstress 1

baseservices_threading._commitstackonlyasneeded_DefaultStackCommit -- #17404

* x86 gcstress 0xC zapdisable heapverify

Not able to repro these locally so far. Not multithreaded so the fact that failures are not deterministic is a bit odd. Return code in failing CI has odd pattern 0xCCCCCD40 (debug) or 0xCCCCCDC4 (rel) which may be a clue. Will keep at it.

Note we have 0x0CCCCCCCD used in the gc at places as an "invalid gc value" when the shadow heap is enabled. However, enabling the shadow heap appears to require heapverify=2 and we're running with heapverify=1.

Note heapverify=1 also fills free memory within a heap to 0xCC.

JIT_Methodical._ELEMENT_TYPE_IU__il_dbgu_fld
JIT_Methodical._ELEMENT_TYPE_IU__il_relu_fld

Forking this off as #17428.

x86 gcstress 0xC jitstress 2 zapdisable

~~Suspect #17387 is responsible for many of these.~~
Suspect #17398 is responsible for many of these

~~CoreMangLib_cti._system_gc_GCCollect~~ (now marked stress incompatible)
baseservices_threading._waithandle_waitall_waitallex10a
baseservices_threading._waithandle_waitall_waitallex10
baseservices_threading._waithandle_waitall_waitallex6a
baseservices_threading._waithandle_waitall_waitallex11a
baseservices_threading._waithandle_waitall_waitallex2a
baseservices_threading._waithandle_waitall_waitallex7a
baseservices_threading._waithandle_waitall_waitallex3a
baseservices_threading._waithandle_waitall_waitallex7
baseservices_threading._waithandle_waitall_waitallex6
baseservices_threading._waithandle_waitall_waitallex3
baseservices_threading._mutex_abandonedmutex_am01waitallneg
baseservices_threading._waithandle_waitall_waitallex11
baseservices_threading._waithandle_waitall_waitallex2
CoreMangLib_cti._system_string_StringFormat2
CoreMangLib_cti._system_string_StringFormat1
CoreMangLib_cti._system_collections_generic_dictionaryvaluecollection_DictionaryValueCollectionCopyTo

x64 gcstress 0xC jitstress 2

JIT_HardwareIntrinsics._X86_Avx2_Avx2_ro -- failed to validate.

Not gc stress related. Opened #17389.

x64 gcstress 0xC jitstress 2 zapdisable

CoreMangLib_cti._system_gc_GCCollect_GCCollect -- seems incompatible; exempted via #17390.

Also still not clear if this change really improves things. Am going to look at the now-passing legs and see if there's sufficient evidence.

AndyAyersMS · 2018-04-03T02:04:36Z

Per #17319 we should extend exclusion of CoreMangLib_cti._system_gc_GCCollect to x86/x64.

fiigii · 2018-04-03T05:04:53Z

test Windows_NT x64 Checked gcstress0xc_jitstress2

AndyAyersMS · 2018-04-03T15:23:13Z

Interesting, the rerun of x64 gcstress 0xC jitstress 2 also kicks off the x64 gcstress 0xC again, and this time the latter shows 4 new failures, all AVs.

x64 gcstress 0xC jitstress 2 reproed the Avx2_ro failure.

AndyAyersMS · 2018-04-04T23:54:33Z

It looks like this does help reduce some instances of artificial stress failures, both on x64 and x86. And the remaining failures have been investigated and now are tracked by their own issues.

So I'm removing the no merge tag and asking for reviews.

@swaroop-sridhar @janvorli @jkotas PTAL
cc @dotnet/jit-contrib

janvorli · 2018-04-05T09:09:05Z

It seems the following test has uncovered a real GC hole on x86 Windows:
_commitstackonlyasneeded_DefaultStackCommit_DefaultStackCommit_cmd

janvorli · 2018-04-05T09:10:43Z

Ah, I can see that we already have an issue tracking it.

janvorli · 2018-04-05T10:02:48Z

src/vm/excep.cpp

        Thread* pThread = GetThread();
        if (exceptionCode == STATUS_ACCESS_VIOLATION &&
            GCStress<cfg_instr>::IsEnabled() &&
+            pExceptionRecord->ExceptionInformation[0] == 0 &&


This new check will always be false on Unix. So we can disable all the last AV address stuff for Unix. It seems that the cleanest way would be to create a new ifdef that would be used instead of the defined(_TARGET_X86_) || defined(_TARGET_AMD64_).

I am a bit unclear on your feedback here.

It looks like in some cases signal processing can fall back to parsing the instruction to classify the signal. If that's true then we might want to keep the last address check (eg require "true" AVs be ones that happen again if we re-execute the faulting instruction) and just drop the exception information checks.

Or if we think the signal to seh conversion is largely exempt from spurious AVs we can disable the whole thing.

I am actually not sure what would happen on Unix in this case. We would need to debug it. Since this code was not enabled for Unix amd64 before and we didn't do any GC stress testing of Unix x86, we would need to try to enable it and do some debugging to see if we ever hit this path.

So maybe something like this?

#if defined(HAVE_GCCOVER) && (defined(_TARGET_X86_) || defined(_TARGET_AMD64_)) && !defined(FEATURE_PAL) #define GCCOVER_TOLERATE_SPURIOUS_AV #endif

Sounds good

janvorli · 2018-04-05T10:11:39Z

src/vm/excep.cpp


        if (exceptionCode == STATUS_CLR_GCCOVER_CODE)
        {
            if (OnGcCoverageInterrupt(pContext))
            {
+                // printf("was indeed GC..\n");


A nit - can you please remove the commented out debug printfs at several places?

janvorli · 2018-04-05T10:31:58Z

src/vm/excep.cpp

            pThread->GetLastAVAddress() != (LPVOID)GetIP(pContext) &&
-            pThread->PreemptiveGCDisabled() &&


I think this condition means that we care only about AV that happened in managed code. Why have you removed it?

When pinvoke stubs are inlined we transition the thread into and back out of cooperative mode while still within managed code.

See the big block comment above GCCoverageInfo::SprinkleBreakpoints for why we instrument for gc coverage within this region and then try and un-instrument once we realize where we are. The un-instrumenting is what leads to all the trouble....

So we can see spurious AVs in managed code with preemptive GC disabled.

janvorli · 2018-04-05T10:47:39Z

src/vm/gccover.cpp

@@ -84,6 +84,10 @@ void SetupAndSprinkleBreakpoints(
    gcCover->callerThread      = 0;
    gcCover->doingEpilogChecks = true;    

+    // Expose m_GcCover early before we add breakpoints.
+    _ASSERTE(!pMD->m_GcCover);
+    *EnsureWritablePages(&pMD->m_GcCover) = gcCover;


Just to make sure I understand the reason for this change - it is to ensure that if another thread was running the function into which we sprinkle the breakpoints and it hits one of the breakpoints, we have the m_GcCover set, right?

Yes, a fairly common spurious failure mode is for a thread to hit a GC cover interrupt and then discover that m_GcCover is null. There are comments elsewhere about locks preventing other threads from running the code while it's being instrumented but it does seem to happen.

This will replace easy to diagnose failure mode (m_GcCover is null) with a different failure mode that is much harder to diagnose (m_GcCover still being initialized by one thread while the other thread is using it to handle coverage interrupts).

Should we rather fix the race that leads to code running before m_GcCover is initialized?

It doesn't appear that m_GcCover is lacking anything important for correctness when it's published early. A racing thread may see that lastMD was not cleared and so change the logging a little, but I currently don't see how the early publish will lead to a new class of hard to diagnose failures.

But this kind of thing can be hard to reason about with confidence and it would be better to eliminate the races. The one I saw frequently was in regex-redux-5 and came from a dynamically generated regex method. I'll look more closely at that case and try and figure out where the race comes from.

janvorli · 2018-04-05T10:49:33Z

src/vm/gccover.cpp

@@ -1677,7 +1680,8 @@ void DoGcStress (PCONTEXT regs, MethodDesc *pMD)
        }

        // Must flush instruction cache before returning as instruction has been modified.
-        FlushInstructionCache(GetCurrentProcess(), (LPCVOID)instrPtr, 6);
+        // Note this needs to reach beyond the call by up to 4 bytes.


Can you please explain why we need to reach beyond the call?

In the atCall clause in DoGcStress we sometimes decide to add a new GC interrupt at the instruction after the the call. So it seemed prudent to extend the flush to cover this instruction update too.

janvorli

LGTM

This change addresses races that cause spurious failures in when running GC stress on multithreaded applications. * Instruction update race Threads that hit a gc cover interrupt where gc is not safe can race to overrwrite the interrupt instruction and change it back to the original instruction. This can cause confusion when handling stress exceptions as the exception code raised by the kernel may be determined by disassembling the instruction that caused the fault, and this instruction may now change between the time the fault is raised and the instruction is disassembled. When this happens the kernel may report an ACCESS_VIOLATION where there was actually an attempt to execute a priveledged instruction. x86 already had a tolerance mechanism here where when gc stress was active and the exception status was ACCESS_VIOLATION the faulting instruction would be retried to see if it faults the same way again. In this change we extend this to tolerance to cover x64 and also enable it regardless of the gc mode. We use the exception information to further screen as these spurious AVs look like reads from address 0xFF..FF. * Instrumentation vs execution race The second race happens when one thread is jitting a method and another is about to call the method. The first thread finishes jitting and publishes the method code, then starts instrumenting the method for gc coverage. While this instrumentation is ongoing, the second thread then calls the method and hits a gc interrupt instruction. The code that recognizes the fault as a gc coverage interrupt gets confused as the instrumentation is not yet complete -- in particular the m_GcCover member of the MethodDesc is not yet set. So the second thread triggers an assert. The fix for this is to instrument for GcCoverage before publishing the code. Since multiple threads can be jitting a method concurrently the instrument and public steps are done under a lock to ensure that the instrumentation and code are consistent (come from the same thread). With this lock in place we have removed the secondary locking done in SetupGcCoverage as it is no longer needed; only one thread can be instrumenting a given jitted method for GcCoverage. * Instruction cache flushes In some cases when replacing the interrupt instruction with the original the instruction cache was either not flushed or not flushed with sufficient length. This possibly leads to an increased frequency of the above races. No impact for non-gc stress scenarios. Addresses the spurious GC stress failures seen in #17027 and #17610.

AndyAyersMS · 2018-04-18T02:03:45Z

Rebased and squashed.

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

AndyAyersMS · 2018-04-18T02:15:37Z

test Ubuntu arm Cross Checked gcstress0xc Build and Test

AndyAyersMS · 2018-04-18T05:15:28Z

Looks like NGENd code is getting instrumented twice. So will need to sort this out.

AndyAyersMS · 2018-04-18T05:22:15Z

Tier1 rejits will also will need to reset m_GcCover.

AndyAyersMS · 2018-04-18T08:14:55Z

We invoke SetupGcCoverage for NGEN code when the module is first loaded (for all methods via SetupGcCoverageForNativeImage) and also when a particular method in that module is prepared (similarly for prejitted IL stubs). The second instrumentation call triggers the assert in my new code that m_GcCover is not set entering SetupGcCoverage.

Previously we'd just bail out of SetupGcCoverage if m_GcCover was set, and that is likely an acceptable workaround for now, as we sort through the issues below. It will mean that sometimes the GcCover info and the method code won't match, but when that happens the method won't have any gc stress breakpoints so I think it works out.

It might be better down the road to not do up-front instrumentation for all prejitted methods in a module since this probably contributes to making GC stress slow. We can instead use the new lock to protect the instrument/publish dance for precompiled methods and so instrument them as they are prepared. But it will take some reworking of the code paths as the instrumentation and publishing points need to be brought closer together, and there are some variant paths here (eg IL stubs).

For tier1 rejits we are in a bind as the tier0 code may remain active even after the tier1 code is published. So we can't safely clear m_GcCover in a tier1 rejit, or update the m_GcCover info. So for now it seems best to leave m_GcCover set when we tier1 rejit and use the proposed bail out logic in SetupGcCoverage to avoid a second instrumentation. This means we will lose coverage for that rejitted method. Down the road perhaps we will need to retain multiple versions of m_GcCover.

GC Coverage is vulnerable to other cases of silent drop-out since there is currently no easy way to audit that all methods that are executed are also instrumented. Methods without coverage instrumentation typically won't provoke new GC stress failures and so won't call attention to themselves. That was happening for dynamic methods with recycled MethodDescs previously and there may be other cases I haven't found yet.

In particular I suspect R2R code never gets instrumented for GC Coverage though still need to verify this. But if true, that is also something we might want to address separately.

Similarly, profiler injected code that overrides a prejitted method won't get new instrumentation.

So to sum up, I propose:

As a workaround, re-introduce the early bail-out in SetupGcCoverage if m_GcCover is already set
Open issues for cases where this causes us to miss out on instrumenting methods
Look at R2R more closely and open an issue if those methods are not getting instrumented
Think about ways of verifying that all expected code gets instrumented

AndyAyersMS · 2018-04-18T08:45:48Z

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

test Ubuntu arm Cross Checked gcstress0xc Build and Test

jkotas · 2018-04-18T11:29:40Z

Sounds good.

rejits will also will need to reset m_GcCover

I do not think this is going to work because of both old and new method can be executing at the same time. I think that to handle cases like these, we would need to change the m_GcCover field from MethodDesc to hashtable with code start address as the key; or something similar.

AndyAyersMS · 2018-04-18T16:48:45Z

Failures in gc stress legs include a few new tests failing with timeout, and the two hard to repro failures we've been seeing:

x86 gcstress 0xC: baseservices_threading._mutex_misc_waitone2 (no tracking bug yet)
x86 gcstress 0xC zapdisable heapverify: JIT_Methodical._ELEMENT_TYPE_IU__il_dbgu_fld (#17428)
and an instance of #17598 -- in a (x64 gcstress 0xC,)[https://ci.dot.net/job/dotnet_coreclr/job/master/job/jitstress/job/x64_checked_windows_nt_gcstress0xc_prtest/66/testReport/baseservices_threading/mutex_misc_mutexctor1_mutexctor1/_mutex_misc_mutexctor1_mutexctor1_cmd/} run.

Increased number of timeouts is not too suprising given we are now holding a lock while instrumenting. However the failing tests don't appear to be multithreaded and in single threaded cases the lock shouldn't add much overhead. So not sure what to think of those failures.

Non-stress failure seen in (Windows_NT arm64 Cross Checked)[https://ci.dot.net/job/dotnet_coreclr/job/master/job/arm64_cross_checked_windows_nt_innerloop_flow_prtest/751/] looks like it was possibly an infrastructure issue:

08:13:56 Pass: 1944 Fail: 0 Complete: 99% Start:08:13:59 Test:[mandelbrot-7.cmd_11717] JIT\Performance\CodeQuality\BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.cmd
08:13:56 Pass: 1945 Fail: 0 Complete: 99% Start:08:13:59 Test:[k-nucleotide-FATAL: command execution failed
08:15:52 java.nio.channels.ClosedChannelException
08:15:52 	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
08:15:52 	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179)

Not sure if the FATAL here came from the test spew or the infrastructure. Suspect the latter.

sdmaclea · 2018-04-18T16:57:28Z

and an instance of #17598 -- in a (x64 gcstress 0xC,)

I thought this was one of the race types you were trying to fix.

AndyAyersMS · 2018-04-18T17:19:16Z

and an instance of #17598 -- in a (x64 gcstress 0xC,)

I thought this was one of the race types you were trying to fix.

No, I haven't looked into this "pending update" assert yet.

BruceForstall · 2018-04-18T18:34:18Z

FWIW, while I 100% support eliminating GCStress flaky failure, when we start talking about missing coverage, such as potentially uninstrumented code, we might want to step back and think of all ways we can improve GCStress effectiveness. One current problem: GCStress doesn't actually move objects most of the time, so bad GC info doesn't always get noticed. Another: we only instrument once, so if the hole isn't uncovered on first run, it isn't noticed. Perhaps we could re-instrument periodically so non-first-run holes get found.

AndyAyersMS · 2018-04-18T18:36:58Z

Yes, looks like R2R code is bypassing GcCover instrumentation.

Opened some follow-on issues:

#17645 for prejitting related issues in gc stress (instrumenting NGEN 2x, R2R never)
#17646 for rejitting related issues in gc stress
#17647 for some kind of coverage validation

@BruceForstall feel free to open issues and we can collect them up and have some idea about how to prioritize.

AndyAyersMS · 2018-04-18T19:24:33Z

Ubuntu arm Cross Checked gcstress0xc looks like it may take up to 44 hours, so should finish sometime Friday morning.

BruceForstall · 2018-04-18T19:52:07Z

https://github.com/dotnet/coreclr/issues/17651, "GCStress: instrumented instructions only force a GC once"
https://github.com/dotnet/coreclr/issues/17653, "GCStress: force fully interruptible GC info"
https://github.com/dotnet/coreclr/issues/17649, "GCStress: ensure every object moves on a GC"

AndyAyersMS · 2018-04-19T15:05:32Z

8 instances of the pending update failure in the arm 0xC stress. Also one other failure which I haven't seen before:

00:59:11 FAILED   - tracing/eventsourcetrace/eventsourcetrace/eventsourcetrace.sh
00:59:11                BEGIN EXECUTION
00:59:11                /ssd/j/workspace/dotnet_coreclr/master/jitstress/arm_cross_checked_ubuntu_gcstress0xc_tst_prtest/bin/tests/Linux.arm.Checked/Tests/Core_Root/corerun eventsourcetrace.exe
00:59:11                	Start: Enable tracing.
00:59:11                	End: Enable tracing.
00:59:11                
00:59:11                	Start: Messaging.
00:59:11                	End: Messaging.
00:59:11                
00:59:11                	Start: Disable tracing.
00:59:11                FailFast:
00:59:11                A callback was made on a garbage collected delegate of type 'System.Private.CoreLib!Microsoft.Win32.UnsafeNativeMethods+ManifestEtw+EtwEnableCallback::Invoke'.
00:59:11                
00:59:11                   at System.Diagnostics.Tracing.EventPipeInternal.Disable()
00:59:11                   at System.Diagnostics.Tracing.EventPipe.Disable()
00:59:11                   at System.RuntimeMethodHandle.InvokeMethod(System.Object, System.Object[], System.Signature, Boolean, Boolean)
00:59:11                   at System.Reflection.RuntimeMethodInfo.Invoke(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo)
00:59:11                   at Tracing.Tests.EventSourceTrace.Main(System.String[])
00:59:11                ./eventsourcetrace.sh: line 244: 20300 Aborted                 $_DebuggerFullPath "$CORE_ROOT/corerun" $ExePath $CLRTestExecutionArguments
00:59:11                Expected: 100
00:59:11                Actual: 134

AndyAyersMS · 2018-04-19T15:06:57Z

I don't believe any of the above failures are caused by this change, and there were no spurious AVs or privileged instruction asserts.

So think we probably should merge this.

BruceForstall · 2018-04-19T17:51:07Z

fyi, that eventsourcetrace issue is https://github.com/dotnet/coreclr/issues/17188

AndyAyersMS added GCStress * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) labels Mar 29, 2018

AndyAyersMS mentioned this pull request Mar 30, 2018

Fix GCStress failures from hardware intrinsic tests #17314

Merged

AndyAyersMS mentioned this pull request Apr 3, 2018

Mark GcCollect test as incompatible with GcStress #17390

Merged

This was referenced Apr 4, 2018

JIT: refine x86 gc reg kill set for CORINFO_HELP_INIT_PINVOKE_FRAME #17421

Merged

GC info fix: correctly adjust argCnt #17363

Merged

AndyAyersMS removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Apr 4, 2018

janvorli reviewed Apr 5, 2018

View reviewed changes

briansull approved these changes Apr 5, 2018

View reviewed changes

janvorli approved these changes Apr 18, 2018

View reviewed changes

AndyAyersMS force-pushed the GcStressAvRetryX64 branch from cbbfdc6 to 86718e1 Compare April 18, 2018 02:01

restore bailout clause in SetupGcCoverage

03d8d9a

jkotas approved these changes Apr 19, 2018

View reviewed changes

AndyAyersMS merged commit 571b1a7 into dotnet:master Apr 19, 2018

AndyAyersMS deleted the GcStressAvRetryX64 branch April 19, 2018 17:17

sdmaclea mentioned this pull request Jan 31, 2020

[Arm64] Port of GCStress failure reductions in #17330 dotnet/runtime#10068

Closed

		pThread->GetLastAVAddress() != (LPVOID)GetIP(pContext) &&
		pThread->PreemptiveGCDisabled() &&

GCStress: try to reduce races and tolerate races better #17330

GCStress: try to reduce races and tolerate races better #17330

Uh oh!

Conversation

AndyAyersMS commented Mar 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS commented Mar 29, 2018

Uh oh!

swaroop-sridhar commented Mar 29, 2018

Uh oh!

AndyAyersMS commented Mar 29, 2018

Uh oh!

AndyAyersMS commented Mar 30, 2018

Uh oh!

AndyAyersMS commented Mar 30, 2018

Uh oh!

AndyAyersMS commented Mar 30, 2018

Uh oh!

AndyAyersMS commented Mar 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS commented Mar 30, 2018

Uh oh!

AndyAyersMS commented Mar 30, 2018

Uh oh!

AndyAyersMS commented Mar 30, 2018

Uh oh!

AndyAyersMS commented Mar 30, 2018

Uh oh!

AndyAyersMS commented Mar 31, 2018

Uh oh!

fiigii commented Mar 31, 2018

Uh oh!

AndyAyersMS commented Apr 2, 2018

Uh oh!

AndyAyersMS commented Apr 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS commented Apr 3, 2018

Uh oh!

fiigii commented Apr 3, 2018

Uh oh!

AndyAyersMS commented Apr 3, 2018

Uh oh!

AndyAyersMS commented Apr 4, 2018

Uh oh!

janvorli commented Apr 5, 2018

Uh oh!

janvorli commented Apr 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

AndyAyersMS commented Mar 29, 2018 •

edited

Loading

AndyAyersMS commented Mar 30, 2018 •

edited

Loading

AndyAyersMS commented Apr 3, 2018 •

edited

Loading

AndyAyersMS commented Apr 18, 2018 •

edited

Loading