Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

GCStress: try to reduce races and tolerate races better #17330

Merged
merged 2 commits into from
Apr 19, 2018

Conversation

AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented Mar 29, 2018

This change addresses races that cause spurious failures in when running
GC stress on multithreaded applications.

  • Instruction update race

Threads that hit a gc cover interrupt where gc is not safe can race to
overrwrite the interrupt instruction and change it back to the original
instruction.

This can cause confusion when handling stress exceptions as the exception code
raised by the kernel may be determined by disassembling the instruction that
caused the fault, and this instruction may now change between the time the
fault is raised and the instruction is disassembled. When this happens the
kernel may report an ACCESS_VIOLATION where there was actually an attempt to
execute a priveledged instruction.

x86 already had a tolerance mechanism here where when gc stress was active
and the exception status was ACCESS_VIOLATION the faulting instruction would
be retried to see if it faults the same way again. In this change we extend
this to tolerance to cover x64 and also enable it regardless of the gc mode.
We use the exception information to further screen as these spurious AVs look
like reads from address 0xFF..FF.

  • Instrumentation vs execution race

The second race happens when one thread is jitting a method and another is
about to call the method. The first thread finishes jitting and publishes the
method code, then starts instrumenting the method for gc coverage. While this
instrumentation is ongoing, the second thread then calls the method and hits
a gc interrupt instruction. The code that recognizes the fault as a gc coverage
interrupt gets confused as the instrumentation is not yet complete -- in
particular the m_GcCover member of the MethodDesc is not yet set. So the second
thread triggers an assert.

The fix for this is to instrument for GcCoverage before publishing the code.
Since multiple threads can be jitting a method concurrently the instrument and
public steps are done under a lock to ensure that the instrumentation and code
are consistent (come from the same thread).

With this lock in place we have removed the secondary locking done in
SetupGcCoverage as it is no longer needed; only one thread can be instrumenting
a given jitted method for GcCoverage.

  • Instruction cache flushes

In some cases when replacing the interrupt instruction with the original the
instruction cache was either not flushed or not flushed with sufficient length.
This possibly leads to an increased frequency of the above races.

No impact for non-gc stress scenarios.

Addresses the spurious GC stress failures seen in #17027 and #17610.

@AndyAyersMS AndyAyersMS added GCStress * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) labels Mar 29, 2018
@AndyAyersMS
Copy link
Member Author

@swaroop-sridhar @janvorli @jkotas let me know if you think this is the wrong angle to pursue.
cc @dotnet/jit-contrib

It does not look like there is any easy way to eliminate the races in stress instruction updates so this change tries to tolerate them as typically they should sort themselves out fairly quickly.

Without something like this we'll likely have to block running stress on a many multi-threaded test cases to get the stress legs into acceptable shape.

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

@swaroop-sridhar
Copy link

👍

@AndyAyersMS
Copy link
Member Author

Jenkins seems to have lost track of some legs?

@dotnet-bot retest test Windows_NT x64 Checked gcstress0xc

@AndyAyersMS
Copy link
Member Author

Seems like early "publishing" of the m_gcCover value may remove the need for that second retry loop I added, as when a thread hits a cover interrrupt now the most common need for a second retry is that m_gc_Cover is null.

@AndyAyersMS
Copy link
Member Author

Hmm, lots of odd failures
Lets see if my updated version fares any better.

@AndyAyersMS
Copy link
Member Author

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Mar 30, 2018

Looks like tests that intentionally AV on x64 now fail under stress. So something is amiss in the fault retry path. Need to dig in more.

x86 stress looks pretty good though, with only the HW intrinsic failures and one other test failing. Also another appearance of #17242 in the x86 minopts gcstress leg.

@AndyAyersMS
Copy link
Member Author

Looks like on x64 maybe the redirect stuff messes up the retry logic, as the same fault gets associated with different IPs over time, and we call IsGcMarker multiple times when resolving one fault.

Am looking at trying to recognize these "spurious" AVs by their exception info signature instead. The kernel describes these as reads from address 0xFF..FF so we'll use that as a screen too.

We still might get confused by an explicit AV that is trying to read from 0xFF...FF but hopefully those are rare.

@AndyAyersMS
Copy link
Member Author

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

1 similar comment
@AndyAyersMS
Copy link
Member Author

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

@AndyAyersMS
Copy link
Member Author

Peeking in at the logs there are a few failures here and there, but no more than a couple so far in any leg and the total set of failures is not all that large, maybe 7 tests in all. So if this holds up and these changes look reasonable, we may be able to just exclude a small number of tests from gc stress.

The only surprise so far is a failure in regex-redux-5 which is one of the cases I was stressing locally with the fix.

Should be a few more hours before all the legs are done.

@AndyAyersMS
Copy link
Member Author

@fiigii seeing this failure pretty consistently from X86_Avx2_Avx2_ro

Assert failure(PID 2976 [0x00000ba0], Thread: 10328 [0x2858]): 
Assertion failed 'emitInstCodeSz(id) >= csz' in
'JIT.HardwareIntrinsics.X86.SimpleUnaryOpTest__ExtractVector128SByte1Store:RunBasicScenario_UnsafeRead():this' (IL size 67)

You probably can repro with just x64 jitstress=1 or =2.

@fiigii
Copy link

fiigii commented Mar 31, 2018

This can be fixed by #17180

@AndyAyersMS
Copy link
Member Author

Several fixes were checked in over the weekend. Also am curious how repeatable some of these failures are. So I am going to do another round of tests and summarize what is still failing.

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 3, 2018

7 of the gc stress legs passed, 7 others failed. Here's a breakdown of the failures:

baseservices_threading._mutex_misc_waitone2

Not able to repro this locally. Also failure in CI seems rare. Am going to ignore this one for now.

JIT_Methodical._refany__il_relarray3 -- fix in progress #17363

baseservices_threading._commitstackonlyasneeded_DefaultStackCommit -- #17404

* x86 gcstress 0xC zapdisable heapverify

Not able to repro these locally so far. Not multithreaded so the fact that failures are not deterministic is a bit odd. Return code in failing CI has odd pattern 0xCCCCCD40 (debug) or 0xCCCCCDC4 (rel) which may be a clue. Will keep at it.

Note we have 0x0CCCCCCCD used in the gc at places as an "invalid gc value" when the shadow heap is enabled. However, enabling the shadow heap appears to require heapverify=2 and we're running with heapverify=1.

Note heapverify=1 also fills free memory within a heap to 0xCC.

JIT_Methodical._ELEMENT_TYPE_IU__il_dbgu_fld
JIT_Methodical._ELEMENT_TYPE_IU__il_relu_fld

Forking this off as #17428.

Suspect #17387 is responsible for many of these.
Suspect #17398 is responsible for many of these

CoreMangLib_cti._system_gc_GCCollect (now marked stress incompatible)
baseservices_threading._waithandle_waitall_waitallex10a
baseservices_threading._waithandle_waitall_waitallex10
baseservices_threading._waithandle_waitall_waitallex6a
baseservices_threading._waithandle_waitall_waitallex11a
baseservices_threading._waithandle_waitall_waitallex2a
baseservices_threading._waithandle_waitall_waitallex7a
baseservices_threading._waithandle_waitall_waitallex3a
baseservices_threading._waithandle_waitall_waitallex7
baseservices_threading._waithandle_waitall_waitallex6
baseservices_threading._waithandle_waitall_waitallex3
baseservices_threading._mutex_abandonedmutex_am01waitallneg
baseservices_threading._waithandle_waitall_waitallex11
baseservices_threading._waithandle_waitall_waitallex2
CoreMangLib_cti._system_string_StringFormat2
CoreMangLib_cti._system_string_StringFormat1
CoreMangLib_cti._system_collections_generic_dictionaryvaluecollection_DictionaryValueCollectionCopyTo

JIT_HardwareIntrinsics._X86_Avx2_Avx2_ro -- failed to validate.

Not gc stress related. Opened #17389.

CoreMangLib_cti._system_gc_GCCollect_GCCollect -- seems incompatible; exempted via #17390.

Also still not clear if this change really improves things. Am going to look at the now-passing legs and see if there's sufficient evidence.

@AndyAyersMS
Copy link
Member Author

Per #17319 we should extend exclusion of CoreMangLib_cti._system_gc_GCCollect to x86/x64.

@fiigii
Copy link

fiigii commented Apr 3, 2018

test Windows_NT x64 Checked gcstress0xc_jitstress2

@AndyAyersMS
Copy link
Member Author

Interesting, the rerun of x64 gcstress 0xC jitstress 2 also kicks off the x64 gcstress 0xC again, and this time the latter shows 4 new failures, all AVs.

x64 gcstress 0xC jitstress 2 reproed the Avx2_ro failure.

@AndyAyersMS AndyAyersMS removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Apr 4, 2018
@AndyAyersMS
Copy link
Member Author

It looks like this does help reduce some instances of artificial stress failures, both on x64 and x86. And the remaining failures have been investigated and now are tracked by their own issues.

So I'm removing the no merge tag and asking for reviews.

@swaroop-sridhar @janvorli @jkotas PTAL
cc @dotnet/jit-contrib

@janvorli
Copy link
Member

janvorli commented Apr 5, 2018

It seems the following test has uncovered a real GC hole on x86 Windows:
_commitstackonlyasneeded_DefaultStackCommit_DefaultStackCommit_cmd

@janvorli
Copy link
Member

janvorli commented Apr 5, 2018

Ah, I can see that we already have an issue tracking it.

Thread* pThread = GetThread();
if (exceptionCode == STATUS_ACCESS_VIOLATION &&
GCStress<cfg_instr>::IsEnabled() &&
pExceptionRecord->ExceptionInformation[0] == 0 &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new check will always be false on Unix. So we can disable all the last AV address stuff for Unix. It seems that the cleanest way would be to create a new ifdef that would be used instead of the defined(_TARGET_X86_) || defined(_TARGET_AMD64_).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit unclear on your feedback here.

It looks like in some cases signal processing can fall back to parsing the instruction to classify the signal. If that's true then we might want to keep the last address check (eg require "true" AVs be ones that happen again if we re-execute the faulting instruction) and just drop the exception information checks.

Or if we think the signal to seh conversion is largely exempt from spurious AVs we can disable the whole thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not sure what would happen on Unix in this case. We would need to debug it. Since this code was not enabled for Unix amd64 before and we didn't do any GC stress testing of Unix x86, we would need to try to enable it and do some debugging to see if we ever hit this path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe something like this?

#if defined(HAVE_GCCOVER) && (defined(_TARGET_X86_) || defined(_TARGET_AMD64_)) && !defined(FEATURE_PAL)
#define GCCOVER_TOLERATE_SPURIOUS_AV
#endif 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

src/vm/excep.cpp Outdated

if (exceptionCode == STATUS_CLR_GCCOVER_CODE)
{
if (OnGcCoverageInterrupt(pContext))
{
// printf("was indeed GC..\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit - can you please remove the commented out debug printfs at several places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

src/vm/excep.cpp Outdated
pThread->GetLastAVAddress() != (LPVOID)GetIP(pContext) &&
pThread->PreemptiveGCDisabled() &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this condition means that we care only about AV that happened in managed code. Why have you removed it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When pinvoke stubs are inlined we transition the thread into and back out of cooperative mode while still within managed code.

See the big block comment above GCCoverageInfo::SprinkleBreakpoints for why we instrument for gc coverage within this region and then try and un-instrument once we realize where we are. The un-instrumenting is what leads to all the trouble....

So we can see spurious AVs in managed code with preemptive GC disabled.

@@ -84,6 +84,10 @@ void SetupAndSprinkleBreakpoints(
gcCover->callerThread = 0;
gcCover->doingEpilogChecks = true;

// Expose m_GcCover early before we add breakpoints.
_ASSERTE(!pMD->m_GcCover);
*EnsureWritablePages(&pMD->m_GcCover) = gcCover;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand the reason for this change - it is to ensure that if another thread was running the function into which we sprinkle the breakpoints and it hits one of the breakpoints, we have the m_GcCover set, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a fairly common spurious failure mode is for a thread to hit a GC cover interrupt and then discover that m_GcCover is null. There are comments elsewhere about locks preventing other threads from running the code while it's being instrumented but it does seem to happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will replace easy to diagnose failure mode (m_GcCover is null) with a different failure mode that is much harder to diagnose (m_GcCover still being initialized by one thread while the other thread is using it to handle coverage interrupts).

Should we rather fix the race that leads to code running before m_GcCover is initialized?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't appear that m_GcCover is lacking anything important for correctness when it's published early. A racing thread may see that lastMD was not cleared and so change the logging a little, but I currently don't see how the early publish will lead to a new class of hard to diagnose failures.

But this kind of thing can be hard to reason about with confidence and it would be better to eliminate the races. The one I saw frequently was in regex-redux-5 and came from a dynamically generated regex method. I'll look more closely at that case and try and figure out where the race comes from.

@@ -1677,7 +1680,8 @@ void DoGcStress (PCONTEXT regs, MethodDesc *pMD)
}

// Must flush instruction cache before returning as instruction has been modified.
FlushInstructionCache(GetCurrentProcess(), (LPCVOID)instrPtr, 6);
// Note this needs to reach beyond the call by up to 4 bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why we need to reach beyond the call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the atCall clause in DoGcStress we sometimes decide to add a new GC interrupt at the instruction after the the call. So it seemed prudent to extend the flush to cover this instruction update too.

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This change addresses races that cause spurious failures in when running
GC stress on multithreaded applications.

* Instruction update race

Threads that hit a gc cover interrupt where gc is not safe can race to
overrwrite the interrupt instruction and change it back to the original
instruction.

This can cause confusion when handling stress exceptions as the exception code
raised by the kernel may be determined by disassembling the instruction that
caused the fault, and this instruction may now change between the time the
fault is raised and the instruction is disassembled. When this happens the
kernel may report an ACCESS_VIOLATION where there was actually an attempt to
execute a priveledged instruction.

x86 already had a tolerance mechanism here where when gc stress was active
and the exception status was ACCESS_VIOLATION the faulting instruction would
be retried to see if it faults the same way again. In this change we extend
this to tolerance to cover x64 and also enable it regardless of the gc mode.
We use the exception information to further screen as these spurious AVs look
like reads from address 0xFF..FF.

* Instrumentation vs execution race

The second race happens when one thread is jitting a method and another is
about to call the method. The first thread finishes jitting and publishes the
method code, then starts instrumenting the method for gc coverage. While this
instrumentation is ongoing, the second thread then calls the method and hits
a gc interrupt instruction. The code that recognizes the fault as a gc coverage
interrupt gets confused as the instrumentation is not yet complete -- in
particular the m_GcCover member of the MethodDesc is not yet set. So the second
thread triggers an assert.

The fix for this is to instrument for GcCoverage before publishing the code.
Since multiple threads can be jitting a method concurrently the instrument and
public steps are done under a lock to ensure that the instrumentation and code
are consistent (come from the same thread).

With this lock in place we have removed the secondary locking done in
SetupGcCoverage as it is no longer needed; only one thread can be instrumenting
a given jitted method for GcCoverage.

* Instruction cache flushes

In some cases when replacing the interrupt instruction with the original the
instruction cache was either not flushed or not flushed with sufficient length.
This possibly leads to an increased frequency of the above races.

No impact for non-gc stress scenarios.

Addresses the spurious GC stress failures seen in #17027 and #17610.
@AndyAyersMS
Copy link
Member Author

Rebased and squashed.

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

@AndyAyersMS
Copy link
Member Author

test Ubuntu arm Cross Checked gcstress0xc Build and Test

@AndyAyersMS
Copy link
Member Author

Looks like NGENd code is getting instrumented twice. So will need to sort this out.

@AndyAyersMS
Copy link
Member Author

Tier1 rejits will also will need to reset m_GcCover.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 18, 2018

We invoke SetupGcCoverage for NGEN code when the module is first loaded (for all methods via SetupGcCoverageForNativeImage) and also when a particular method in that module is prepared (similarly for prejitted IL stubs). The second instrumentation call triggers the assert in my new code that m_GcCover is not set entering SetupGcCoverage.

Previously we'd just bail out of SetupGcCoverage if m_GcCover was set, and that is likely an acceptable workaround for now, as we sort through the issues below. It will mean that sometimes the GcCover info and the method code won't match, but when that happens the method won't have any gc stress breakpoints so I think it works out.

It might be better down the road to not do up-front instrumentation for all prejitted methods in a module since this probably contributes to making GC stress slow. We can instead use the new lock to protect the instrument/publish dance for precompiled methods and so instrument them as they are prepared. But it will take some reworking of the code paths as the instrumentation and publishing points need to be brought closer together, and there are some variant paths here (eg IL stubs).

For tier1 rejits we are in a bind as the tier0 code may remain active even after the tier1 code is published. So we can't safely clear m_GcCover in a tier1 rejit, or update the m_GcCover info. So for now it seems best to leave m_GcCover set when we tier1 rejit and use the proposed bail out logic in SetupGcCoverage to avoid a second instrumentation. This means we will lose coverage for that rejitted method. Down the road perhaps we will need to retain multiple versions of m_GcCover.

GC Coverage is vulnerable to other cases of silent drop-out since there is currently no easy way to audit that all methods that are executed are also instrumented. Methods without coverage instrumentation typically won't provoke new GC stress failures and so won't call attention to themselves. That was happening for dynamic methods with recycled MethodDescs previously and there may be other cases I haven't found yet.

In particular I suspect R2R code never gets instrumented for GC Coverage though still need to verify this. But if true, that is also something we might want to address separately.

Similarly, profiler injected code that overrides a prejitted method won't get new instrumentation.

So to sum up, I propose:

  • As a workaround, re-introduce the early bail-out in SetupGcCoverage if m_GcCover is already set
  • Open issues for cases where this causes us to miss out on instrumenting methods
  • Look at R2R more closely and open an issue if those methods are not getting instrumented
  • Think about ways of verifying that all expected code gets instrumented

@AndyAyersMS
Copy link
Member Author

test Windows_NT x64 Checked gcstress0xc
test Windows_NT x64 Checked gcstress0xc_jitstress1
test Windows_NT x64 Checked gcstress0xc_jitstress2
test Windows_NT x64 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable
test Windows_NT x64 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x64 Checked gcstress0xc_zapdisable_jitstress2

test Windows_NT x86 Checked gcstress0xc
test Windows_NT x86 Checked gcstress0xc_jitstress1
test Windows_NT x86 Checked gcstress0xc_jitstress2
test Windows_NT x86 Checked gcstress0xc_minopts_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable
test Windows_NT x86 Checked gcstress0xc_zapdisable_heapverify1
test Windows_NT x86 Checked gcstress0xc_zapdisable_jitstress2

test Ubuntu arm Cross Checked gcstress0xc Build and Test

@jkotas
Copy link
Member

jkotas commented Apr 18, 2018

Sounds good.

rejits will also will need to reset m_GcCover

I do not think this is going to work because of both old and new method can be executing at the same time. I think that to handle cases like these, we would need to change the m_GcCover field from MethodDesc to hashtable with code start address as the key; or something similar.

@AndyAyersMS
Copy link
Member Author

Failures in gc stress legs include a few new tests failing with timeout, and the two hard to repro failures we've been seeing:

  • x86 gcstress 0xC: baseservices_threading._mutex_misc_waitone2 (no tracking bug yet)
  • x86 gcstress 0xC zapdisable heapverify: JIT_Methodical._ELEMENT_TYPE_IU__il_dbgu_fld (#17428)
  • and an instance of #17598 -- in a (x64 gcstress 0xC,)[https://ci.dot.net/job/dotnet_coreclr/job/master/job/jitstress/job/x64_checked_windows_nt_gcstress0xc_prtest/66/testReport/baseservices_threading/mutex_misc_mutexctor1_mutexctor1/_mutex_misc_mutexctor1_mutexctor1_cmd/} run.

Increased number of timeouts is not too suprising given we are now holding a lock while instrumenting. However the failing tests don't appear to be multithreaded and in single threaded cases the lock shouldn't add much overhead. So not sure what to think of those failures.

Non-stress failure seen in (Windows_NT arm64 Cross Checked)[https://ci.dot.net/job/dotnet_coreclr/job/master/job/arm64_cross_checked_windows_nt_innerloop_flow_prtest/751/] looks like it was possibly an infrastructure issue:

08:13:56 Pass: 1944 Fail: 0 Complete: 99% Start:08:13:59 Test:[mandelbrot-7.cmd_11717] JIT\Performance\CodeQuality\BenchmarksGame\mandelbrot\mandelbrot-7\mandelbrot-7.cmd
08:13:56 Pass: 1945 Fail: 0 Complete: 99% Start:08:13:59 Test:[k-nucleotide-FATAL: command execution failed
08:15:52 java.nio.channels.ClosedChannelException
08:15:52 	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
08:15:52 	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179)

Not sure if the FATAL here came from the test spew or the infrastructure. Suspect the latter.

@sdmaclea
Copy link

and an instance of #17598 -- in a (x64 gcstress 0xC,)

I thought this was one of the race types you were trying to fix.

@AndyAyersMS
Copy link
Member Author

and an instance of #17598 -- in a (x64 gcstress 0xC,)

I thought this was one of the race types you were trying to fix.

No, I haven't looked into this "pending update" assert yet.

@BruceForstall
Copy link

FWIW, while I 100% support eliminating GCStress flaky failure, when we start talking about missing coverage, such as potentially uninstrumented code, we might want to step back and think of all ways we can improve GCStress effectiveness. One current problem: GCStress doesn't actually move objects most of the time, so bad GC info doesn't always get noticed. Another: we only instrument once, so if the hole isn't uncovered on first run, it isn't noticed. Perhaps we could re-instrument periodically so non-first-run holes get found.

@AndyAyersMS
Copy link
Member Author

Yes, looks like R2R code is bypassing GcCover instrumentation.

Opened some follow-on issues:

  • #17645 for prejitting related issues in gc stress (instrumenting NGEN 2x, R2R never)
  • #17646 for rejitting related issues in gc stress
  • #17647 for some kind of coverage validation

@BruceForstall feel free to open issues and we can collect them up and have some idea about how to prioritize.

@AndyAyersMS
Copy link
Member Author

Ubuntu arm Cross Checked gcstress0xc looks like it may take up to 44 hours, so should finish sometime Friday morning.

@BruceForstall
Copy link

@AndyAyersMS
Copy link
Member Author

8 instances of the pending update failure in the arm 0xC stress. Also one other failure which I haven't seen before:

00:59:11 FAILED   - tracing/eventsourcetrace/eventsourcetrace/eventsourcetrace.sh
00:59:11                BEGIN EXECUTION
00:59:11                /ssd/j/workspace/dotnet_coreclr/master/jitstress/arm_cross_checked_ubuntu_gcstress0xc_tst_prtest/bin/tests/Linux.arm.Checked/Tests/Core_Root/corerun eventsourcetrace.exe
00:59:11                	Start: Enable tracing.
00:59:11                	End: Enable tracing.
00:59:11                
00:59:11                	Start: Messaging.
00:59:11                	End: Messaging.
00:59:11                
00:59:11                	Start: Disable tracing.
00:59:11                FailFast:
00:59:11                A callback was made on a garbage collected delegate of type 'System.Private.CoreLib!Microsoft.Win32.UnsafeNativeMethods+ManifestEtw+EtwEnableCallback::Invoke'.
00:59:11                
00:59:11                   at System.Diagnostics.Tracing.EventPipeInternal.Disable()
00:59:11                   at System.Diagnostics.Tracing.EventPipe.Disable()
00:59:11                   at System.RuntimeMethodHandle.InvokeMethod(System.Object, System.Object[], System.Signature, Boolean, Boolean)
00:59:11                   at System.Reflection.RuntimeMethodInfo.Invoke(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo)
00:59:11                   at Tracing.Tests.EventSourceTrace.Main(System.String[])
00:59:11                ./eventsourcetrace.sh: line 244: 20300 Aborted                 $_DebuggerFullPath "$CORE_ROOT/corerun" $ExePath $CLRTestExecutionArguments
00:59:11                Expected: 100
00:59:11                Actual: 134

@AndyAyersMS
Copy link
Member Author

I don't believe any of the above failures are caused by this change, and there were no spurious AVs or privileged instruction asserts.

So think we probably should merge this.

@AndyAyersMS AndyAyersMS merged commit 571b1a7 into dotnet:master Apr 19, 2018
@AndyAyersMS AndyAyersMS deleted the GcStressAvRetryX64 branch April 19, 2018 17:17
@BruceForstall
Copy link

fyi, that eventsourcetrace issue is https://github.com/dotnet/coreclr/issues/17188

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants