Skip to content

[release/9.0] Fix edge cases in Tarjan GC bridge (Android) #114391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 11, 2025

Conversation

filipnavara
Copy link
Member

@filipnavara filipnavara commented Apr 8, 2025

Backport of #112825, #112970, #113044 and #113703 to release/9.0-staging

/cc @BrzVlad @vitek-karas

Customer Impact

  • Customer reported
  • Found internally

Two customer issues on Android were traced back to bugs in the MonoVM Tarjan GC bridge responsible for bridging .NET and Java garbage collectors.

In the first issue (#115611) the runtime failed to report edges to the Android GC bridge code, resulting in alive objects collected in the Java runtime and their .NET peers throwing ObjectDisposedException. The particular pattern could have been triggered by code using await on the UI thread where the continuation was scheduled to run through synchronization context on the same thread.

In the second issue (#106410) the runtime crashed on an assertion for object graphs of certain shape. The cause of the crash was an incorrect assumption about duplicate edges in a compressed object graph and how later stages of the algorithm process them. When the later stages of the processing deduplicated the edges it may have resulted in an unforseen change to some of the graph properties leading to a logic error in the algorithm.

Regression

  • Yes
  • No

Both of these bugs existed in .NET MonoVM / Android code since the inception. Legacy Xamarin.Android offered 3 different bridge algorithms (old, new, and tarjan) with the new bridge being the default for a while. The default was changed to tarjan before MonoVM was merged into .NET.

Testing

Test cases reproducing both issues were presented and verified against the fix. Additionally, a new testing code was introduced to specifically test the problematic patterns as part of .NET runtime testing infrastructure.

Risk

Medium - the changes simply make reporting the graph more accurate. Even though we have high confidence in the changes, we recognize it does touch a sensitive area.

filipnavara and others added 4 commits April 8, 2025 19:12
…otnet#112970)

Fix typo in GC bridge comparison message (SCCS -> XREFS)
* [mono][sgen] Fix DUMP_GRAPH debug option build for tarjan bridge

* [mono][sgen] Don't create ScanData* during debug dumping of SCCs

It serves no purpose and it would later crash the runtime since we didn't patch the lockword back in place.

* [mono][sgen] Fix some null deref crashes in DUMP_GRAPH debug option

* [mono][tests] Add bridge tests

These are ported from some of the bridge tests we had on mono/mono. In order to test them we compare between the output of the new and the tarjan bridge.
…formation (dotnet#112825)

* Fix an edge case in the Tarjan SCC that lead to losing xref information

In the Tarjan SCC bridge processing there's a color graph used to find out
connections between SCCs. There was a rare case which only manifested when
a cycle in the object graph points to another cycle that points to a bridge
object. We only recognized direct bridge pointers but not pointers to other
non-bridge SCCs that in turn point to bridges and where we already calculated
the xrefs. These xrefs were then lost.

* Add test case to sgen-bridge-pathologies and add an assert to catch the original bug

* Add review

---------

Co-authored-by: Vlad Brezae <[email protected]>
…duplication (dotnet#113044)

* [SGen/Tarjan] Handle edge case with node heaviness changing due to deduplication

Do early deduplication

Fix Windows build

Add test cases to sgen-bridge-pathologies

* Move test code

* Remove old code

* Add extra check (no change to functionality)
@steveisok steveisok removed the request for review from lambdageek April 10, 2025 10:53
Copy link
Member

@jeffschwMSFT jeffschwMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. we will take for consideration in 9.0.x

@jeffschwMSFT jeffschwMSFT added this to the 9.0.x milestone Apr 10, 2025
@leecow leecow added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Apr 10, 2025
@leecow leecow modified the milestones: 9.0.x, 9.0.5 Apr 10, 2025
@akoeplinger
Copy link
Member

@filipnavara there's a failure in the new test on wasm:

    GC/Features/Bridge/BridgeTester/BridgeTester.sh [FAIL]
      
      Return code:      1
      Raw output file:      /root/helix/work/workitem/uploads/Reports/GC.Features/Bridge/BridgeTester/BridgeTester.output.txt
      Raw output:
      BEGIN EXECUTION
      MSBuild version 17.12.27+e0b90a9a8 for .NET
        AppDir: /root/helix/work/workitem/e/GC/Features/Bridge/BridgeTester/WasmApp/
        TestBinDir: /root/helix/work/workitem/e/GC/Features/Bridge/BridgeTester
        ArtifactsBinDir: 
        Generated app bundle at /root/helix/work/workitem/e/GC/Features/Bridge/BridgeTester/WasmApp/
      Warning: unknown flag --expose_wasm.
      Try --help for options
      Incoming arguments: --run BridgeTester.dll
      Application arguments: --run BridgeTester.dll
      console.info: Initializing dotnet version 9.0.5 commit hash badb953d260984e8897ad6a65194e5bf1f27a078
      System.ArgumentNullException: Value cannot be null. (Parameter 'path1')
         at System.ArgumentNullException.Throw(String paramName) in /_/src/libraries/System.Private.CoreLib/src/System/ArgumentNullException.cs:line 97
         at System.ArgumentNullException.ThrowIfNull(Object argument, String paramName) in /_/src/libraries/System.Private.CoreLib/src/System/ArgumentNullException.cs:line 59
         at System.IO.Path.Combine(String path1, String path2) in /_/src/libraries/System.Private.CoreLib/src/System/IO/Path.cs:line 346
         at BridgeTester.RunTests()
         at __GeneratedMainWrapper.Main()

Not sure why it doesn't happen in main

@filipnavara
Copy link
Member Author

Would it make sense to disable the test on WASM to unblock us? WASM doesn't use the GC bridge anyway.

(I am currently in middle of moving our office so I don't know when I will be able to address it.)

@akoeplinger
Copy link
Member

akoeplinger commented Apr 11, 2025

Sure, I can handle it.

@carlossanlop
Copy link
Contributor

Friendly reminder that code complete for the may release is next Monday April 14th. If you want this change included in that release, please merge the PR before EOD Monday.

@akoeplinger
Copy link
Member

/ba-g failures in the wasm tests are unrelated, it doesn't even compile this code.

@akoeplinger akoeplinger merged commit 703efd5 into dotnet:release/9.0-staging Apr 11, 2025
120 of 124 checks passed
@simonrozsival
Copy link
Member

I can no longer reproduce the original bug reported by a customer in #115611 using their repro project with the latest changes.

steveisok added a commit that referenced this pull request Apr 14, 2025
steveisok added a commit that referenced this pull request Apr 14, 2025
…114391)" (#114641)

This reverts commit 703efd5.

Reverting due to assertions found in Android SDK testing.
@github-actions github-actions bot locked and limited conversation to collaborators May 15, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-mono community-contribution Indicates that the PR has been added by a community member os-android Servicing-approved Approved for servicing release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants