Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Commit 571b1a7

Browse files
authored
GCStress: try to reduce races and tolerate races better (#17330)
This change addresses races that cause spurious failures in when running GC stress on multithreaded applications. * Instruction update race Threads that hit a gc cover interrupt where gc is not safe can race to overrwrite the interrupt instruction and change it back to the original instruction. This can cause confusion when handling stress exceptions as the exception code raised by the kernel may be determined by disassembling the instruction that caused the fault, and this instruction may now change between the time the fault is raised and the instruction is disassembled. When this happens the kernel may report an ACCESS_VIOLATION where there was actually an attempt to execute a priveledged instruction. x86 already had a tolerance mechanism here where when gc stress was active and the exception status was ACCESS_VIOLATION the faulting instruction would be retried to see if it faults the same way again. In this change we extend this to tolerance to cover x64 and also enable it regardless of the gc mode. We use the exception information to further screen as these spurious AVs look like reads from address 0xFF..FF. * Instrumentation vs execution race The second race happens when one thread is jitting a method and another is about to call the method. The first thread finishes jitting and publishes the method code, then starts instrumenting the method for gc coverage. While this instrumentation is ongoing, the second thread then calls the method and hits a gc interrupt instruction. The code that recognizes the fault as a gc coverage interrupt gets confused as the instrumentation is not yet complete -- in particular the m_GcCover member of the MethodDesc is not yet set. So the second thread triggers an assert. The fix for this is to instrument for GcCoverage before publishing the code. Since multiple threads can be jitting a method concurrently the instrument and public steps are done under a lock to ensure that the instrumentation and code are consistent (come from the same thread). With this lock in place we have removed the secondary locking done in SetupGcCoverage as it is no longer needed; only one thread can be instrumenting a given jitted method for GcCoverage. However we retain a bailout` clause that first looks to see if m_GcCover is set and if so skips instrumentation, as there are prejit and rejit cases where we will retry instrumentation. * Instruction cache flushes In some cases when replacing the interrupt instruction with the original the instruction cache was either not flushed or not flushed with sufficient length. This possibly leads to an increased frequency of the above races. No impact expected for non-gc stress scenarios, though some of the code changes are in common code paths. Addresses the spurious GC stress failures seen in #17027 and #17610.
1 parent 204c11b commit 571b1a7

14 files changed

+237
-211
lines changed

src/inc/CrstTypes.def

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -352,6 +352,10 @@ Crst NativeImageCache
352352
Unordered
353353
End
354354

355+
Crst GCCover
356+
AcquiredBefore LoaderHeap
357+
End
358+
355359
Crst GCMemoryPressure
356360
End
357361

src/inc/crsttypes.h

Lines changed: 108 additions & 105 deletions
Original file line numberDiff line numberDiff line change
@@ -83,111 +83,112 @@ enum CrstType
8383
CrstFusionPolicyConfigPool = 64,
8484
CrstFusionSingleUse = 65,
8585
CrstFusionWarningLog = 66,
86-
CrstGCMemoryPressure = 67,
87-
CrstGlobalStrLiteralMap = 68,
88-
CrstHandleTable = 69,
89-
CrstHostAssemblyMap = 70,
90-
CrstHostAssemblyMapAdd = 71,
91-
CrstIbcProfile = 72,
92-
CrstIJWFixupData = 73,
93-
CrstIJWHash = 74,
94-
CrstILFingerprintCache = 75,
95-
CrstILStubGen = 76,
96-
CrstInlineTrackingMap = 77,
97-
CrstInstMethodHashTable = 78,
98-
CrstInterfaceVTableMap = 79,
99-
CrstInterop = 80,
100-
CrstInteropData = 81,
101-
CrstIOThreadpoolWorker = 82,
102-
CrstIsJMCMethod = 83,
103-
CrstISymUnmanagedReader = 84,
104-
CrstJit = 85,
105-
CrstJitGenericHandleCache = 86,
106-
CrstJitPerf = 87,
107-
CrstJumpStubCache = 88,
108-
CrstLeafLock = 89,
109-
CrstListLock = 90,
110-
CrstLoaderAllocator = 91,
111-
CrstLoaderAllocatorReferences = 92,
112-
CrstLoaderHeap = 93,
113-
CrstMda = 94,
114-
CrstMetadataTracker = 95,
115-
CrstModIntPairList = 96,
116-
CrstModule = 97,
117-
CrstModuleFixup = 98,
118-
CrstModuleLookupTable = 99,
119-
CrstMulticoreJitHash = 100,
120-
CrstMulticoreJitManager = 101,
121-
CrstMUThunkHash = 102,
122-
CrstNativeBinderInit = 103,
123-
CrstNativeImageCache = 104,
124-
CrstNls = 105,
125-
CrstNotifyGdb = 106,
126-
CrstObjectList = 107,
127-
CrstOnEventManager = 108,
128-
CrstPatchEntryPoint = 109,
129-
CrstPEFileSecurityManager = 110,
130-
CrstPEImage = 111,
131-
CrstPEImagePDBStream = 112,
132-
CrstPendingTypeLoadEntry = 113,
133-
CrstPinHandle = 114,
134-
CrstPinnedByrefValidation = 115,
135-
CrstProfilerGCRefDataFreeList = 116,
136-
CrstProfilingAPIStatus = 117,
137-
CrstPublisherCertificate = 118,
138-
CrstRCWCache = 119,
139-
CrstRCWCleanupList = 120,
140-
CrstRCWRefCache = 121,
141-
CrstReadyToRunEntryPointToMethodDescMap = 122,
142-
CrstReDacl = 123,
143-
CrstReflection = 124,
144-
CrstReJITDomainTable = 125,
145-
CrstReJITGlobalRequest = 126,
146-
CrstReJITSharedDomainTable = 127,
147-
CrstRemoting = 128,
148-
CrstRetThunkCache = 129,
149-
CrstRWLock = 130,
150-
CrstSavedExceptionInfo = 131,
151-
CrstSaveModuleProfileData = 132,
152-
CrstSecurityPolicyCache = 133,
153-
CrstSecurityPolicyInit = 134,
154-
CrstSecurityStackwalkCache = 135,
155-
CrstSharedAssemblyCreate = 136,
156-
CrstSharedBaseDomain = 137,
157-
CrstSigConvert = 138,
158-
CrstSingleUseLock = 139,
159-
CrstSpecialStatics = 140,
160-
CrstSqmManager = 141,
161-
CrstStackSampler = 142,
162-
CrstStressLog = 143,
163-
CrstStrongName = 144,
164-
CrstStubCache = 145,
165-
CrstStubDispatchCache = 146,
166-
CrstStubUnwindInfoHeapSegments = 147,
167-
CrstSyncBlockCache = 148,
168-
CrstSyncHashLock = 149,
169-
CrstSystemBaseDomain = 150,
170-
CrstSystemDomain = 151,
171-
CrstSystemDomainDelayedUnloadList = 152,
172-
CrstThreadIdDispenser = 153,
173-
CrstThreadpoolEventCache = 154,
174-
CrstThreadpoolTimerQueue = 155,
175-
CrstThreadpoolWaitThreads = 156,
176-
CrstThreadpoolWorker = 157,
177-
CrstThreadStaticDataHashTable = 158,
178-
CrstThreadStore = 159,
179-
CrstTPMethodTable = 160,
180-
CrstTypeEquivalenceMap = 161,
181-
CrstTypeIDMap = 162,
182-
CrstUMEntryThunkCache = 163,
183-
CrstUMThunkHash = 164,
184-
CrstUniqueStack = 165,
185-
CrstUnresolvedClassLock = 166,
186-
CrstUnwindInfoTableLock = 167,
187-
CrstVSDIndirectionCellLock = 168,
188-
CrstWinRTFactoryCache = 169,
189-
CrstWrapperTemplate = 170,
190-
kNumberOfCrstTypes = 171
86+
CrstGCCover = 67,
87+
CrstGCMemoryPressure = 68,
88+
CrstGlobalStrLiteralMap = 69,
89+
CrstHandleTable = 70,
90+
CrstHostAssemblyMap = 71,
91+
CrstHostAssemblyMapAdd = 72,
92+
CrstIbcProfile = 73,
93+
CrstIJWFixupData = 74,
94+
CrstIJWHash = 75,
95+
CrstILFingerprintCache = 76,
96+
CrstILStubGen = 77,
97+
CrstInlineTrackingMap = 78,
98+
CrstInstMethodHashTable = 79,
99+
CrstInterfaceVTableMap = 80,
100+
CrstInterop = 81,
101+
CrstInteropData = 82,
102+
CrstIOThreadpoolWorker = 83,
103+
CrstIsJMCMethod = 84,
104+
CrstISymUnmanagedReader = 85,
105+
CrstJit = 86,
106+
CrstJitGenericHandleCache = 87,
107+
CrstJitPerf = 88,
108+
CrstJumpStubCache = 89,
109+
CrstLeafLock = 90,
110+
CrstListLock = 91,
111+
CrstLoaderAllocator = 92,
112+
CrstLoaderAllocatorReferences = 93,
113+
CrstLoaderHeap = 94,
114+
CrstMda = 95,
115+
CrstMetadataTracker = 96,
116+
CrstModIntPairList = 97,
117+
CrstModule = 98,
118+
CrstModuleFixup = 99,
119+
CrstModuleLookupTable = 100,
120+
CrstMulticoreJitHash = 101,
121+
CrstMulticoreJitManager = 102,
122+
CrstMUThunkHash = 103,
123+
CrstNativeBinderInit = 104,
124+
CrstNativeImageCache = 105,
125+
CrstNls = 106,
126+
CrstNotifyGdb = 107,
127+
CrstObjectList = 108,
128+
CrstOnEventManager = 109,
129+
CrstPatchEntryPoint = 110,
130+
CrstPEFileSecurityManager = 111,
131+
CrstPEImage = 112,
132+
CrstPEImagePDBStream = 113,
133+
CrstPendingTypeLoadEntry = 114,
134+
CrstPinHandle = 115,
135+
CrstPinnedByrefValidation = 116,
136+
CrstProfilerGCRefDataFreeList = 117,
137+
CrstProfilingAPIStatus = 118,
138+
CrstPublisherCertificate = 119,
139+
CrstRCWCache = 120,
140+
CrstRCWCleanupList = 121,
141+
CrstRCWRefCache = 122,
142+
CrstReadyToRunEntryPointToMethodDescMap = 123,
143+
CrstReDacl = 124,
144+
CrstReflection = 125,
145+
CrstReJITDomainTable = 126,
146+
CrstReJITGlobalRequest = 127,
147+
CrstReJITSharedDomainTable = 128,
148+
CrstRemoting = 129,
149+
CrstRetThunkCache = 130,
150+
CrstRWLock = 131,
151+
CrstSavedExceptionInfo = 132,
152+
CrstSaveModuleProfileData = 133,
153+
CrstSecurityPolicyCache = 134,
154+
CrstSecurityPolicyInit = 135,
155+
CrstSecurityStackwalkCache = 136,
156+
CrstSharedAssemblyCreate = 137,
157+
CrstSharedBaseDomain = 138,
158+
CrstSigConvert = 139,
159+
CrstSingleUseLock = 140,
160+
CrstSpecialStatics = 141,
161+
CrstSqmManager = 142,
162+
CrstStackSampler = 143,
163+
CrstStressLog = 144,
164+
CrstStrongName = 145,
165+
CrstStubCache = 146,
166+
CrstStubDispatchCache = 147,
167+
CrstStubUnwindInfoHeapSegments = 148,
168+
CrstSyncBlockCache = 149,
169+
CrstSyncHashLock = 150,
170+
CrstSystemBaseDomain = 151,
171+
CrstSystemDomain = 152,
172+
CrstSystemDomainDelayedUnloadList = 153,
173+
CrstThreadIdDispenser = 154,
174+
CrstThreadpoolEventCache = 155,
175+
CrstThreadpoolTimerQueue = 156,
176+
CrstThreadpoolWaitThreads = 157,
177+
CrstThreadpoolWorker = 158,
178+
CrstThreadStaticDataHashTable = 159,
179+
CrstThreadStore = 160,
180+
CrstTPMethodTable = 161,
181+
CrstTypeEquivalenceMap = 162,
182+
CrstTypeIDMap = 163,
183+
CrstUMEntryThunkCache = 164,
184+
CrstUMThunkHash = 165,
185+
CrstUniqueStack = 166,
186+
CrstUnresolvedClassLock = 167,
187+
CrstUnwindInfoTableLock = 168,
188+
CrstVSDIndirectionCellLock = 169,
189+
CrstWinRTFactoryCache = 170,
190+
CrstWrapperTemplate = 171,
191+
kNumberOfCrstTypes = 172
191192
};
192193

193194
#endif // __CRST_TYPES_INCLUDED
@@ -265,6 +266,7 @@ int g_rgCrstLevelMap[] =
265266
4, // CrstFusionPolicyConfigPool
266267
5, // CrstFusionSingleUse
267268
6, // CrstFusionWarningLog
269+
3, // CrstGCCover
268270
0, // CrstGCMemoryPressure
269271
11, // CrstGlobalStrLiteralMap
270272
1, // CrstHandleTable
@@ -441,6 +443,7 @@ LPCSTR g_rgCrstNameMap[] =
441443
"CrstFusionPolicyConfigPool",
442444
"CrstFusionSingleUse",
443445
"CrstFusionWarningLog",
446+
"CrstGCCover",
444447
"CrstGCMemoryPressure",
445448
"CrstGlobalStrLiteralMap",
446449
"CrstHandleTable",

src/inc/switches.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,12 @@
114114
#define HAVE_GCCOVER
115115
#endif
116116

117+
// Some platforms may see spurious AVs when GcCoverage is enabled because of races.
118+
// Enable further processing to see if they recur.
119+
#if defined(HAVE_GCCOVER) && (defined(_TARGET_X86_) || defined(_TARGET_AMD64_)) && !defined(FEATURE_PAL)
120+
#define GCCOVER_TOLERATE_SPURIOUS_AV
121+
#endif
122+
117123
//Turns on a startup delay to allow simulation of slower and faster startup times.
118124
#define ENABLE_STARTUP_DELAY
119125

src/vm/ceemain.cpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1124,6 +1124,10 @@ void EEStartupHelper(COINITIEE fFlags)
11241124

11251125
#endif // _DEBUG
11261126

1127+
#ifdef HAVE_GCCOVER
1128+
MethodDesc::Init();
1129+
#endif
1130+
11271131
#endif // !CROSSGEN_COMPILE
11281132

11291133
ErrExit: ;

src/vm/dynamicmethod.cpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -280,6 +280,11 @@ DynamicMethodDesc* DynamicMethodTable::GetDynamicMethod(BYTE *psig, DWORD sigSiz
280280
pNewMD->m_pszDebugClassName = (LPUTF8)"dynamicclass";
281281
pNewMD->m_pszDebugMethodSignature = "DynamicMethod Signature not available";
282282
#endif // _DEBUG
283+
284+
#ifdef HAVE_GCCOVER
285+
pNewMD->m_GcCover = NULL;
286+
#endif
287+
283288
pNewMD->SetNotInline(TRUE);
284289
pNewMD->GetLCGMethodResolver()->Reset();
285290

src/vm/excep.cpp

Lines changed: 21 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -6811,34 +6811,39 @@ DWORD GetGcMarkerExceptionCode(LPVOID ip)
68116811
}
68126812

68136813
// Did we hit an DO_A_GC_HERE marker in JITted code?
6814-
bool IsGcMarker(DWORD exceptionCode, CONTEXT *pContext)
6814+
bool IsGcMarker(CONTEXT* pContext, EXCEPTION_RECORD *pExceptionRecord)
68156815
{
6816+
DWORD exceptionCode = pExceptionRecord->ExceptionCode;
68166817
#ifdef HAVE_GCCOVER
68176818
WRAPPER_NO_CONTRACT;
68186819

68196820
if (GCStress<cfg_any>::IsEnabled())
68206821
{
6821-
#ifdef _TARGET_X86_
6822-
// on x86 we can't suspend EE to update the GC marker instruction so
6823-
// we update it directly without suspending. this can sometimes yield
6824-
// a STATUS_ACCESS_VIOLATION instead of STATUS_CLR_GCCOVER_CODE. in
6825-
// this case we let the AV through and retry the instruction. we'll
6826-
// track the IP of the instruction that generated an AV so we don't
6827-
// mix up a real AV with a "fake" AV.
6828-
// see comments in function DoGcStress for more details on this race.
6829-
// also make sure that the thread is actually in managed code since AVs
6830-
// outside of of JIT code will never be potential GC markers
6822+
#if defined(GCCOVER_TOLERATE_SPURIOUS_AV)
6823+
6824+
// We sometimes can't suspend the EE to update the GC marker instruction so
6825+
// we update it directly without suspending. This can sometimes yield
6826+
// a STATUS_ACCESS_VIOLATION instead of STATUS_CLR_GCCOVER_CODE. In
6827+
// this case we let the AV through and retry the instruction as hopefully
6828+
// the race will have resolved. We'll track the IP of the instruction
6829+
// that generated an AV so we don't mix up a real AV with a "fake" AV.
6830+
//
6831+
// See comments in function DoGcStress for more details on this race.
6832+
//
6833+
// Note these "fake" AVs will be reported by the kernel as reads from
6834+
// address 0xF...F so we also use that as a screen.
68316835
Thread* pThread = GetThread();
68326836
if (exceptionCode == STATUS_ACCESS_VIOLATION &&
68336837
GCStress<cfg_instr>::IsEnabled() &&
6838+
pExceptionRecord->ExceptionInformation[0] == 0 &&
6839+
pExceptionRecord->ExceptionInformation[1] == ~0 &&
68346840
pThread->GetLastAVAddress() != (LPVOID)GetIP(pContext) &&
6835-
pThread->PreemptiveGCDisabled() &&
68366841
!IsIPInEE((LPVOID)GetIP(pContext)))
68376842
{
68386843
pThread->SetLastAVAddress((LPVOID)GetIP(pContext));
68396844
return true;
68406845
}
6841-
#endif // _TARGET_X86_
6846+
#endif // defined(GCCOVER_TOLERATE_SPURIOUS_AV)
68426847

68436848
if (exceptionCode == STATUS_CLR_GCCOVER_CODE)
68446849
{
@@ -7737,7 +7742,7 @@ VEH_ACTION WINAPI CLRVectoredExceptionHandlerPhase3(PEXCEPTION_POINTERS pExcepti
77377742
// NOTE: this is effectively ifdef (_TARGET_AMD64_ || _TARGET_ARM_), and does not actually trigger
77387743
// a GC. This will redirect the exception context to a stub which will
77397744
// push a frame and cause GC.
7740-
if (IsGcMarker(exceptionCode, pContext))
7745+
if (IsGcMarker(pContext, pExceptionRecord))
77417746
{
77427747
return VEH_CONTINUE_EXECUTION;;
77437748
}
@@ -8161,11 +8166,11 @@ LONG WINAPI CLRVectoredExceptionHandlerShim(PEXCEPTION_POINTERS pExceptionInfo)
81618166

81628167
#ifdef USE_REDIRECT_FOR_GCSTRESS
81638168
// This is AMD64 & ARM specific as the macro above is defined for AMD64 & ARM only
8164-
bIsGCMarker = IsGcMarker(dwCode, pExceptionInfo->ContextRecord);
8169+
bIsGCMarker = IsGcMarker(pExceptionInfo->ContextRecord, pExceptionInfo->ExceptionRecord);
81658170
#elif defined(_TARGET_X86_) && defined(HAVE_GCCOVER)
81668171
// This is the equivalent of the check done in COMPlusFrameHandler, incase the exception is
81678172
// seen by VEH first on x86.
8168-
bIsGCMarker = IsGcMarker(dwCode, pExceptionInfo->ContextRecord);
8173+
bIsGCMarker = IsGcMarker(pExceptionInfo->ContextRecord, pExceptionInfo->ExceptionRecord);
81698174
#endif // USE_REDIRECT_FOR_GCSTRESS
81708175

81718176
// Do not update the TLS with exception details for exceptions pertaining to GCStress

src/vm/excep.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -766,7 +766,7 @@ void CPFH_AdjustContextForThreadSuspensionRace(T_CONTEXT *pContext, Thread *pThr
766766
#endif // _TARGET_X86_
767767

768768
DWORD GetGcMarkerExceptionCode(LPVOID ip);
769-
bool IsGcMarker(DWORD exceptionCode, T_CONTEXT *pContext);
769+
bool IsGcMarker(T_CONTEXT *pContext, EXCEPTION_RECORD *pExceptionRecord);
770770

771771
void InitSavedExceptionInfo();
772772

src/vm/exceptionhandling.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -936,7 +936,7 @@ ProcessCLRException(IN PEXCEPTION_RECORD pExceptionRecord
936936
//
937937
{
938938
#ifndef USE_REDIRECT_FOR_GCSTRESS
939-
if (IsGcMarker(pExceptionRecord->ExceptionCode, pContextRecord))
939+
if (IsGcMarker(pContextRecord, pExceptionRecord))
940940
{
941941
returnDisposition = ExceptionContinueExecution;
942942
goto lExit;
@@ -5227,7 +5227,7 @@ BOOL HandleHardwareException(PAL_SEHException* ex)
52275227
// A hardware exception is handled only if it happened in a jitted code or
52285228
// in one of the JIT helper functions (JIT_MemSet, ...)
52295229
PCODE controlPc = GetIP(ex->GetContextRecord());
5230-
if (ExecutionManager::IsManagedCode(controlPc) && IsGcMarker(ex->GetExceptionRecord()->ExceptionCode, ex->GetContextRecord()))
5230+
if (ExecutionManager::IsManagedCode(controlPc) && IsGcMarker(ex->GetContextRecord(), ex->GetExceptionRecord()))
52315231
{
52325232
// Exception was handled, let the signal handler return to the exception context. Some registers in the context can
52335233
// have been modified by the GC.

0 commit comments

Comments
 (0)