[lld-macho] Fix branch extension thunk estimation logic #120529

alx32 · 2024-12-19T07:44:31Z

This patch improves the linker’s ability to estimate stub reachability in the TextOutputSection::estimateStubsInRangeVA function. It does so by including thunks that have already been placed ahead of the current call site address when calculating the threshold for direct stub calls.

Before this fix, the estimation process overlooked existing forward thunks. This could result in some thunks not being inserted where needed. In rare situations, particularly with large and specially arranged codebases, this might lead to branch instructions being out of range, causing linking errors.

Although this patch successfully addresses the problem, it is not feasible to create a test for this issue. The specific layout and order of thunk creation required to reproduce the corner case are too complex, making test creation impractical.

Example error messages the issue could generate:

ld64.lld: error: banana.o:(symbol OUTLINED_FUNCTION_24949_3875): relocation BRANCH26 is out of range: 134547892 is not in [-134217728, 134217727]; references objc_autoreleaseReturnValue
ld64.lld: error: main.o:(symbol _main+0xc): relocation BRANCH26 is out of range: 134544132 is not in [-134217728, 134217727]; references objc_release

llvmbot · 2024-12-19T07:46:53Z

@llvm/pr-subscribers-lld

Author: None (alx32)

Changes

This patch improves the linker’s ability to estimate stub reachability in the TextOutputSection::estimateStubsInRangeVA function. It does so by including thunks that have already been placed ahead of the current call site address when calculating the threshold for direct stub calls.

Before this fix, the estimation process overlooked existing forward thunks. This could result in some thunks not being inserted where needed. In rare situations, particularly with large and specially arranged codebases, this might lead to branch instructions being out of range, causing linking errors.

Although this patch successfully addresses the problem, it is not feasible to create a test for this issue. The specific layout and order of thunk creation required to reproduce the corner case are too complex, making test creation impractical.

Full diff: https://github.com/llvm/llvm-project/pull/120529.diff

1 Files Affected:

(modified) lld/MachO/ConcatOutputSection.cpp (+24-2)

diff --git a/lld/MachO/ConcatOutputSection.cpp b/lld/MachO/ConcatOutputSection.cpp
index 6a9301f84a03ef..9610a80e309bac 100644
--- a/lld/MachO/ConcatOutputSection.cpp
+++ b/lld/MachO/ConcatOutputSection.cpp
@@ -184,6 +184,15 @@ uint64_t TextOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
     InputSection *isec = inputs[i];
     isecEnd = alignToPowerOf2(isecEnd, isec->align) + isec->getSize();
   }
+
+  // Tally up any thunks that have already been placed that have address higher
+  // than the equivalent callIdx. We first find the index of the first thunk
+  // that is beyond the current inputs[callIdx].
+  auto itPostcallIdxThunks = std::partition_point(
+      thunks.begin(), thunks.end(),
+      [isecVA](const ConcatInputSection *t) { return t->getVA() <= isecVA; });
+  uint32_t existingForwardThunks = thunks.end() - itPostcallIdxThunks;
+
   // Estimate the address after which call sites can safely call stubs
   // directly rather than through intermediary thunks.
   uint64_t forwardBranchRange = target->forwardBranchRange;
@@ -191,8 +200,21 @@ uint64_t TextOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
          "should not run thunk insertion if all code fits in jump range");
   assert(isecEnd - isecVA <= forwardBranchRange &&
          "should only finalize sections in jump range");
-  uint64_t stubsInRangeVA = isecEnd + maxPotentialThunks * target->thunkSize +
-                            in.stubs->getSize() - forwardBranchRange;
+
+  // Estimate the maximum size of the code, right before the stubs section
+  uint32_t maxTextSize = 0;
+  // Add the size of all the inputs, including the unprocessed ones.
+  maxTextSize += isecEnd;
+
+  // Add the size of the thunks that may be created in the future
+  maxTextSize += maxPotentialThunks * target->thunkSize;
+
+  // Add the size of the thunks that have already been created that are ahead
+  maxTextSize += existingForwardThunks * target->thunkSize;
+
+  uint64_t stubsInRangeVA =
+      maxTextSize + in.stubs->getSize() - forwardBranchRange;
+
   log("thunks = " + std::to_string(thunkMap.size()) +
       ", potential = " + std::to_string(maxPotentialThunks) +
       ", stubs = " + std::to_string(in.stubs->getSize()) + ", isecVA = " +

llvmbot · 2024-12-19T07:46:53Z

@llvm/pr-subscribers-lld-macho

Author: None (alx32)

Changes

This patch improves the linker’s ability to estimate stub reachability in the TextOutputSection::estimateStubsInRangeVA function. It does so by including thunks that have already been placed ahead of the current call site address when calculating the threshold for direct stub calls.

Before this fix, the estimation process overlooked existing forward thunks. This could result in some thunks not being inserted where needed. In rare situations, particularly with large and specially arranged codebases, this might lead to branch instructions being out of range, causing linking errors.

Although this patch successfully addresses the problem, it is not feasible to create a test for this issue. The specific layout and order of thunk creation required to reproduce the corner case are too complex, making test creation impractical.

Full diff: https://github.com/llvm/llvm-project/pull/120529.diff

1 Files Affected:

(modified) lld/MachO/ConcatOutputSection.cpp (+24-2)

diff --git a/lld/MachO/ConcatOutputSection.cpp b/lld/MachO/ConcatOutputSection.cpp
index 6a9301f84a03ef..9610a80e309bac 100644
--- a/lld/MachO/ConcatOutputSection.cpp
+++ b/lld/MachO/ConcatOutputSection.cpp
@@ -184,6 +184,15 @@ uint64_t TextOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
     InputSection *isec = inputs[i];
     isecEnd = alignToPowerOf2(isecEnd, isec->align) + isec->getSize();
   }
+
+  // Tally up any thunks that have already been placed that have address higher
+  // than the equivalent callIdx. We first find the index of the first thunk
+  // that is beyond the current inputs[callIdx].
+  auto itPostcallIdxThunks = std::partition_point(
+      thunks.begin(), thunks.end(),
+      [isecVA](const ConcatInputSection *t) { return t->getVA() <= isecVA; });
+  uint32_t existingForwardThunks = thunks.end() - itPostcallIdxThunks;
+
   // Estimate the address after which call sites can safely call stubs
   // directly rather than through intermediary thunks.
   uint64_t forwardBranchRange = target->forwardBranchRange;
@@ -191,8 +200,21 @@ uint64_t TextOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
          "should not run thunk insertion if all code fits in jump range");
   assert(isecEnd - isecVA <= forwardBranchRange &&
          "should only finalize sections in jump range");
-  uint64_t stubsInRangeVA = isecEnd + maxPotentialThunks * target->thunkSize +
-                            in.stubs->getSize() - forwardBranchRange;
+
+  // Estimate the maximum size of the code, right before the stubs section
+  uint32_t maxTextSize = 0;
+  // Add the size of all the inputs, including the unprocessed ones.
+  maxTextSize += isecEnd;
+
+  // Add the size of the thunks that may be created in the future
+  maxTextSize += maxPotentialThunks * target->thunkSize;
+
+  // Add the size of the thunks that have already been created that are ahead
+  maxTextSize += existingForwardThunks * target->thunkSize;
+
+  uint64_t stubsInRangeVA =
+      maxTextSize + in.stubs->getSize() - forwardBranchRange;
+
   log("thunks = " + std::to_string(thunkMap.size()) +
       ", potential = " + std::to_string(maxPotentialThunks) +
       ", stubs = " + std::to_string(in.stubs->getSize()) + ", isecVA = " +

kyulee-com · 2024-12-19T18:11:59Z

@int3 can you take a look when you're available? This is a critical fix to deal with large binaries.

lld/MachO/ConcatOutputSection.cpp

ellishg · 2024-12-19T18:14:20Z

lld/MachO/ConcatOutputSection.cpp

+  // Tally up any thunks that have already been placed that have address higher
+  // than the equivalent callIdx. We first find the index of the first thunk
+  // that is beyond the current inputs[callIdx].
+  auto itPostcallIdxThunks = std::partition_point(


I think we can use llvm::partition_point()

Do you mean another form or it ? We're already using partition_point.

EDIT: I see ... std=>llvm

alx32 · 2024-12-19T18:33:22Z

@nico - could you have a look at this ?
Your previous comment here is relevant:

Yes, this computes the end address of all sections. After the loop, isecEnd points right after where the last section would end up, assuming we don't insert thunk sections anywhere along the way.

This fix is addressing the assuming we don't insert thunk sections anywhere along the way part - we do end up inserting thunks so this fix ensures we count them.

yozhu · 2025-01-07T20:11:00Z

lld/MachO/ConcatOutputSection.cpp

+  // Estimated maximum VA of last stub.
+  uint64_t maxVAOfLastStub = maxTextSize + in.stubs->getSize();
+
+  // Calculaate the first address that is gueranteed to not need a thunk to


Maybe use (move) the original comment on Line 197 here.

drodriguez · 2025-01-07T20:56:47Z

lld/MachO/ConcatOutputSection.cpp

+  // Tally up any thunks that have already been placed that have VA higher than
+  // inputs[callIdx]. First, find the index of the first thunk that is beyond
+  // the current inputs[callIdx].
+  auto itPostcallIdxThunks =
+      llvm::partition_point(thunks, [isecVA](const ConcatInputSection *t) {
+        return t->getVA() <= isecVA;
+      });
+  uint64_t existingForwardThunks = thunks.end() - itPostcallIdxThunks;
+


IIUC, the thunks used here are derived from the thunkMap (see 1), and thunkMap is iterated above (lines 171-179). One should be able to figure out this value during the previous iteration and avoid the partition_point call.

Something like:

// Tally the functions which still have call sites remaining to process, // which yields the maximum number of thunks we might yet place. size_t maxPotentialThunks = 0; // Tally up any thunks that have already been placed that have VA higher than // inputs[callIdx]. size_t existingForwardThunks = 0; for (auto &tp : thunkMap) { ThunkInfo &ti = tp.second; // This overcounts: Only sections that are in forward jump range from the // currently-active section get finalized, and all input sections are // finalized when estimateStubsInRangeVA() is called. So only backward // jumps will need thunks, but we count all jumps. if (ti.callSitesUsed < ti.callSiteCount) maxPotentialThunks += 1; if (ti.isec->getVA() <= isecVA) // it might be `>` in the conditional, not sure existingForwardThunks += 1 }

Because the dependency on isecVA, the code needs to move under the calculation of isecVA (more or less where the new code is added in this PR).

(It also looks like we might be double counting in maxPotentialThunks and existingForwardThunks, but there might be some not-easy-to-see dependency in how thunkMap is used and when estimateStubsInRangeVA is called that avoids that situation).

One should be able to figure out this value during the previous iteration and avoid the partition_point call.

This is true, but i think this might be less performant. The for loop iterates all thunks which there can be many of (100k+). That means adding a branch that executes 100k times.

maxPotentialThunks is log(n) complexity so it for ~100k input thunks it will have <20 iterations and a few hundred instructions executed.

Generally we favor more performant code for LLD - even if here it probably won't really be measurable.
Or do we want this for easier code readability / understandability ?

It also looks like we might be double counting in maxPotentialThunks and existingForwardThunks, but there might be some not-easy-to-see dependency in how thunkMap is used and when estimateStubsInRangeVA is called that avoids that situation

This is basically the issue to be fixed. maxPotentialThunks is intended and appears to contain all the possible thunks and adding in existingForwardThunks would appear to be double counting . But that's exactly the original bug - maxPotentialThunks does not include all the thunks so we need to count all the thunks by adding existingForwardThunks.

I did not have an intuition for the total number of entries in thunkMap. If we are talking about 100k or those numbers, doing only ~16 comparisons is probably going to always be quicker.

My comment about double counting is because of the name maxPotentialThunks and the way thunkMap seems to be copied progressively into thunks, but never cleaned up. I don't pretend to understand the flow of the code (it is difficult to figure out what's going on and how things will happen during runtime), but the name maxPotentialThunks might not be the best if it was undercounting (originally). IIUC the code calls estimateStubsInRangeVA only once during finalize (which I think only happens once). At the point estimateStubsInRangeVA is called, part of thunkMap has already been pushed into thunks. The more we overcount, the higher stubsInRangeVA will be, and it might be that we are creating thunks for the end of the text section when a simple jump will had be enough. I don't know how bad is that size increase in the final binary when we are overcounting. It seems to me that we were undercounting when ti.callSitesUsed == ti.callSiteCount (I imagine callSitesUsed cannot be higher than callSiteCount). It might be possible to have a better estimation during the thunksMap processing, for example changing the if (ti.isec->getVA() <= isecVA) I was proposing above to else if (ti.isec->getVA() <= isecVA) (which I think was when the original code was undercounting).

Again, it is complicated code, I probably don't understand it correctly, and there's a couple of things that might make overcounting not as important of not undercounting. Feel free to ignore my feedback.

I see. Indeed this is very tricky code. I should have clarified better - the thing that is being undercounted is the count of thunks that are needed to adjust the VA. maxPotentialThunks is correct, because it's counting the max thunks that can be created in the future - which it does correctly. The issue is that the VA also needs to be adjusted by the existing thunks that haven't been taken into consideration.

The implicit problematic assumption being that maxPotentialThunks includes the existing thunks, but it actually doesn't.

If I understand your answer correctly, then when if (ti.callSitesUsed < ti.callSiteCount) is not taken, it is the only case where the previous code was undercounting, so adding the right else if in there (either ti.isec->getVA() <= isecVA or ti.isec->getVA() > isecVA, which one I am not sure) should only add to the estimation those thunks that were forgotten before. Yes, the new code will be doing more checks that with partition_point (specially because for the last input section I would expect many thunks to match ti.callSitesUsed == ti.callSiteCount), but at the same time it will reduce the overcounting, which should improve the size of the final binary (by not unnecessarily creating more stubs because stubsInRangeVA is too high).

Hmm not quite. From debugging there was no opportunity to actually improve the estimation to work in all cases (without changing the fundamental algorithm). The already placed thunks were just being ignored - basically forgotten to be taken into consideration. I am not sure if your proposal will work - I don't think it will for all cases - there are a few more variables in play.

I do think that we can improve the algorithm a bit - but I this is the direct fix for the current bug.
I am hesitant to try to improve the underlying algorithm right now as that can lead to more issues in unknown scenarios.

This patch improves the linker’s ability to estimate stub reachability in the `TextOutputSection::estimateStubsInRangeVA` function. It does so by including thunks that have already been placed ahead of the current call site address when calculating the threshold for direct stub calls. Before this fix, the estimation process overlooked existing forward thunks. This could result in some thunks not being inserted where needed. In rare situations, particularly with large and specially arranged codebases, this might lead to branch instructions being out of range, causing linking errors. Although this patch successfully addresses the problem, it is not feasible to create a test for this issue. The specific layout and order of thunk creation required to reproduce the corner case are too complex, making test creation impractical. Example error messages the issue could generate: ``` ld64.lld: error: banana.o:(symbol OUTLINED_FUNCTION_24949_3875): relocation BRANCH26 is out of range: 134547892 is not in [-134217728, 134217727]; references objc_autoreleaseReturnValue ld64.lld: error: main.o:(symbol _main+0xc): relocation BRANCH26 is out of range: 134544132 is not in [-134217728, 134217727]; references objc_release ```

[lld-macho] Fix branch extension thunk estimation logic

fe6369a

alx32 requested review from int3, ellishg, DataCorrupted and kyulee-com December 19, 2024 07:44

alx32 marked this pull request as ready for review December 19, 2024 07:46

llvmbot added lld lld:MachO labels Dec 19, 2024

Alex B added 2 commits December 19, 2024 09:53

uint32 => uint64 + extra comments

bc5e342

Undo fixing unrelated formatting

43949e0

kyulee-com requested a review from smeenai December 19, 2024 18:11

ellishg reviewed Dec 19, 2024

View reviewed changes

alx32 requested review from gkmhub and nico December 19, 2024 18:27

Alex B added 2 commits December 19, 2024 11:28

Fix typo + add better comments to code.

e1408de

std::partition_point => llvm::partition_point

ec0c7ea

yozhu approved these changes Jan 7, 2025

View reviewed changes

drodriguez reviewed Jan 7, 2025

View reviewed changes

Address Feedback Nr.2

f8782e4

alx32 merged commit 156e605 into llvm:main Jan 9, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[lld-macho] Fix branch extension thunk estimation logic #120529

[lld-macho] Fix branch extension thunk estimation logic #120529

Uh oh!

alx32 commented Dec 19, 2024 •

edited

Loading

Uh oh!

llvmbot commented Dec 19, 2024

Uh oh!

llvmbot commented Dec 19, 2024

Uh oh!

kyulee-com commented Dec 19, 2024

Uh oh!

Uh oh!

ellishg Dec 19, 2024

Uh oh!

alx32 Dec 19, 2024 •

edited

Loading

Uh oh!

alx32 commented Dec 19, 2024

Uh oh!

yozhu Jan 7, 2025

Uh oh!

drodriguez Jan 7, 2025

Uh oh!

alx32 Jan 7, 2025

Uh oh!

drodriguez Jan 7, 2025

Uh oh!

alx32 Jan 7, 2025

Uh oh!

drodriguez Jan 8, 2025

Uh oh!

alx32 Jan 8, 2025

Uh oh!

Uh oh!

Uh oh!

[lld-macho] Fix branch extension thunk estimation logic #120529

[lld-macho] Fix branch extension thunk estimation logic #120529

Uh oh!

Conversation

alx32 commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 19, 2024

Uh oh!

llvmbot commented Dec 19, 2024

Uh oh!

kyulee-com commented Dec 19, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alx32 Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alx32 commented Dec 19, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alx32 commented Dec 19, 2024 •

edited

Loading

alx32 Dec 19, 2024 •

edited

Loading