Skip to content

[lld-macho] Fix branch extension thunk estimation logic #120529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 9, 2025

Conversation

alx32
Copy link
Contributor

@alx32 alx32 commented Dec 19, 2024

This patch improves the linker’s ability to estimate stub reachability in the TextOutputSection::estimateStubsInRangeVA function. It does so by including thunks that have already been placed ahead of the current call site address when calculating the threshold for direct stub calls.

Before this fix, the estimation process overlooked existing forward thunks. This could result in some thunks not being inserted where needed. In rare situations, particularly with large and specially arranged codebases, this might lead to branch instructions being out of range, causing linking errors.

Although this patch successfully addresses the problem, it is not feasible to create a test for this issue. The specific layout and order of thunk creation required to reproduce the corner case are too complex, making test creation impractical.

Example error messages the issue could generate:

ld64.lld: error: banana.o:(symbol OUTLINED_FUNCTION_24949_3875): relocation BRANCH26 is out of range: 134547892 is not in [-134217728, 134217727]; references objc_autoreleaseReturnValue
ld64.lld: error: main.o:(symbol _main+0xc): relocation BRANCH26 is out of range: 134544132 is not in [-134217728, 134217727]; references objc_release

@alx32 alx32 marked this pull request as ready for review December 19, 2024 07:46
@llvmbot
Copy link
Member

llvmbot commented Dec 19, 2024

@llvm/pr-subscribers-lld

Author: None (alx32)

Changes

This patch improves the linker’s ability to estimate stub reachability in the TextOutputSection::estimateStubsInRangeVA function. It does so by including thunks that have already been placed ahead of the current call site address when calculating the threshold for direct stub calls.

Before this fix, the estimation process overlooked existing forward thunks. This could result in some thunks not being inserted where needed. In rare situations, particularly with large and specially arranged codebases, this might lead to branch instructions being out of range, causing linking errors.

Although this patch successfully addresses the problem, it is not feasible to create a test for this issue. The specific layout and order of thunk creation required to reproduce the corner case are too complex, making test creation impractical.


Full diff: https://github.com/llvm/llvm-project/pull/120529.diff

1 Files Affected:

  • (modified) lld/MachO/ConcatOutputSection.cpp (+24-2)
diff --git a/lld/MachO/ConcatOutputSection.cpp b/lld/MachO/ConcatOutputSection.cpp
index 6a9301f84a03ef..9610a80e309bac 100644
--- a/lld/MachO/ConcatOutputSection.cpp
+++ b/lld/MachO/ConcatOutputSection.cpp
@@ -184,6 +184,15 @@ uint64_t TextOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
     InputSection *isec = inputs[i];
     isecEnd = alignToPowerOf2(isecEnd, isec->align) + isec->getSize();
   }
+
+  // Tally up any thunks that have already been placed that have address higher
+  // than the equivalent callIdx. We first find the index of the first thunk
+  // that is beyond the current inputs[callIdx].
+  auto itPostcallIdxThunks = std::partition_point(
+      thunks.begin(), thunks.end(),
+      [isecVA](const ConcatInputSection *t) { return t->getVA() <= isecVA; });
+  uint32_t existingForwardThunks = thunks.end() - itPostcallIdxThunks;
+
   // Estimate the address after which call sites can safely call stubs
   // directly rather than through intermediary thunks.
   uint64_t forwardBranchRange = target->forwardBranchRange;
@@ -191,8 +200,21 @@ uint64_t TextOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
          "should not run thunk insertion if all code fits in jump range");
   assert(isecEnd - isecVA <= forwardBranchRange &&
          "should only finalize sections in jump range");
-  uint64_t stubsInRangeVA = isecEnd + maxPotentialThunks * target->thunkSize +
-                            in.stubs->getSize() - forwardBranchRange;
+
+  // Estimate the maximum size of the code, right before the stubs section
+  uint32_t maxTextSize = 0;
+  // Add the size of all the inputs, including the unprocessed ones.
+  maxTextSize += isecEnd;
+
+  // Add the size of the thunks that may be created in the future
+  maxTextSize += maxPotentialThunks * target->thunkSize;
+
+  // Add the size of the thunks that have already been created that are ahead
+  maxTextSize += existingForwardThunks * target->thunkSize;
+
+  uint64_t stubsInRangeVA =
+      maxTextSize + in.stubs->getSize() - forwardBranchRange;
+
   log("thunks = " + std::to_string(thunkMap.size()) +
       ", potential = " + std::to_string(maxPotentialThunks) +
       ", stubs = " + std::to_string(in.stubs->getSize()) + ", isecVA = " +

@llvmbot
Copy link
Member

llvmbot commented Dec 19, 2024

@llvm/pr-subscribers-lld-macho

Author: None (alx32)

Changes

This patch improves the linker’s ability to estimate stub reachability in the TextOutputSection::estimateStubsInRangeVA function. It does so by including thunks that have already been placed ahead of the current call site address when calculating the threshold for direct stub calls.

Before this fix, the estimation process overlooked existing forward thunks. This could result in some thunks not being inserted where needed. In rare situations, particularly with large and specially arranged codebases, this might lead to branch instructions being out of range, causing linking errors.

Although this patch successfully addresses the problem, it is not feasible to create a test for this issue. The specific layout and order of thunk creation required to reproduce the corner case are too complex, making test creation impractical.


Full diff: https://github.com/llvm/llvm-project/pull/120529.diff

1 Files Affected:

  • (modified) lld/MachO/ConcatOutputSection.cpp (+24-2)
diff --git a/lld/MachO/ConcatOutputSection.cpp b/lld/MachO/ConcatOutputSection.cpp
index 6a9301f84a03ef..9610a80e309bac 100644
--- a/lld/MachO/ConcatOutputSection.cpp
+++ b/lld/MachO/ConcatOutputSection.cpp
@@ -184,6 +184,15 @@ uint64_t TextOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
     InputSection *isec = inputs[i];
     isecEnd = alignToPowerOf2(isecEnd, isec->align) + isec->getSize();
   }
+
+  // Tally up any thunks that have already been placed that have address higher
+  // than the equivalent callIdx. We first find the index of the first thunk
+  // that is beyond the current inputs[callIdx].
+  auto itPostcallIdxThunks = std::partition_point(
+      thunks.begin(), thunks.end(),
+      [isecVA](const ConcatInputSection *t) { return t->getVA() <= isecVA; });
+  uint32_t existingForwardThunks = thunks.end() - itPostcallIdxThunks;
+
   // Estimate the address after which call sites can safely call stubs
   // directly rather than through intermediary thunks.
   uint64_t forwardBranchRange = target->forwardBranchRange;
@@ -191,8 +200,21 @@ uint64_t TextOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
          "should not run thunk insertion if all code fits in jump range");
   assert(isecEnd - isecVA <= forwardBranchRange &&
          "should only finalize sections in jump range");
-  uint64_t stubsInRangeVA = isecEnd + maxPotentialThunks * target->thunkSize +
-                            in.stubs->getSize() - forwardBranchRange;
+
+  // Estimate the maximum size of the code, right before the stubs section
+  uint32_t maxTextSize = 0;
+  // Add the size of all the inputs, including the unprocessed ones.
+  maxTextSize += isecEnd;
+
+  // Add the size of the thunks that may be created in the future
+  maxTextSize += maxPotentialThunks * target->thunkSize;
+
+  // Add the size of the thunks that have already been created that are ahead
+  maxTextSize += existingForwardThunks * target->thunkSize;
+
+  uint64_t stubsInRangeVA =
+      maxTextSize + in.stubs->getSize() - forwardBranchRange;
+
   log("thunks = " + std::to_string(thunkMap.size()) +
       ", potential = " + std::to_string(maxPotentialThunks) +
       ", stubs = " + std::to_string(in.stubs->getSize()) + ", isecVA = " +

@kyulee-com kyulee-com requested a review from smeenai December 19, 2024 18:11
@kyulee-com
Copy link
Contributor

@int3 can you take a look when you're available? This is a critical fix to deal with large binaries.

// Tally up any thunks that have already been placed that have address higher
// than the equivalent callIdx. We first find the index of the first thunk
// that is beyond the current inputs[callIdx].
auto itPostcallIdxThunks = std::partition_point(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use llvm::partition_point()

Copy link
Contributor Author

@alx32 alx32 Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean another form or it ? We're already using partition_point.

EDIT: I see ... std=>llvm

@alx32 alx32 requested review from gkmhub and nico December 19, 2024 18:27
@alx32
Copy link
Contributor Author

alx32 commented Dec 19, 2024

@nico - could you have a look at this ?
Your previous comment here is relevant:

Yes, this computes the end address of all sections. After the loop, isecEnd points right after where the last section would end up, assuming we don't insert thunk sections anywhere along the way.

This fix is addressing the assuming we don't insert thunk sections anywhere along the way part - we do end up inserting thunks so this fix ensures we count them.

// Estimated maximum VA of last stub.
uint64_t maxVAOfLastStub = maxTextSize + in.stubs->getSize();

// Calculaate the first address that is gueranteed to not need a thunk to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use (move) the original comment on Line 197 here.

Comment on lines +188 to +196
// Tally up any thunks that have already been placed that have VA higher than
// inputs[callIdx]. First, find the index of the first thunk that is beyond
// the current inputs[callIdx].
auto itPostcallIdxThunks =
llvm::partition_point(thunks, [isecVA](const ConcatInputSection *t) {
return t->getVA() <= isecVA;
});
uint64_t existingForwardThunks = thunks.end() - itPostcallIdxThunks;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the thunks used here are derived from the thunkMap (see 1), and thunkMap is iterated above (lines 171-179). One should be able to figure out this value during the previous iteration and avoid the partition_point call.

Something like:

// Tally the functions which still have call sites remaining to process,
// which yields the maximum number of thunks we might yet place.
size_t maxPotentialThunks = 0;
// Tally up any thunks that have already been placed that have VA higher than
// inputs[callIdx].
size_t existingForwardThunks = 0;
for (auto &tp : thunkMap) {
  ThunkInfo &ti = tp.second;
  // This overcounts: Only sections that are in forward jump range from the
  // currently-active section get finalized, and all input sections are
  // finalized when estimateStubsInRangeVA() is called. So only backward
  // jumps will need thunks, but we count all jumps.
  if (ti.callSitesUsed < ti.callSiteCount)
    maxPotentialThunks += 1;
  if (ti.isec->getVA() <= isecVA) // it might be `>` in the conditional, not sure
    existingForwardThunks += 1
}

Because the dependency on isecVA, the code needs to move under the calculation of isecVA (more or less where the new code is added in this PR).

(It also looks like we might be double counting in maxPotentialThunks and existingForwardThunks, but there might be some not-easy-to-see dependency in how thunkMap is used and when estimateStubsInRangeVA is called that avoids that situation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One should be able to figure out this value during the previous iteration and avoid the partition_point call.

This is true, but i think this might be less performant. The for loop iterates all thunks which there can be many of (100k+). That means adding a branch that executes 100k times.

maxPotentialThunks is log(n) complexity so it for ~100k input thunks it will have <20 iterations and a few hundred instructions executed.

Generally we favor more performant code for LLD - even if here it probably won't really be measurable.
Or do we want this for easier code readability / understandability ?

It also looks like we might be double counting in maxPotentialThunks and existingForwardThunks, but there might be some not-easy-to-see dependency in how thunkMap is used and when estimateStubsInRangeVA is called that avoids that situation

This is basically the issue to be fixed. maxPotentialThunks is intended and appears to contain all the possible thunks and adding in existingForwardThunks would appear to be double counting . But that's exactly the original bug - maxPotentialThunks does not include all the thunks so we need to count all the thunks by adding existingForwardThunks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not have an intuition for the total number of entries in thunkMap. If we are talking about 100k or those numbers, doing only ~16 comparisons is probably going to always be quicker.

My comment about double counting is because of the name maxPotentialThunks and the way thunkMap seems to be copied progressively into thunks, but never cleaned up. I don't pretend to understand the flow of the code (it is difficult to figure out what's going on and how things will happen during runtime), but the name maxPotentialThunks might not be the best if it was undercounting (originally). IIUC the code calls estimateStubsInRangeVA only once during finalize (which I think only happens once). At the point estimateStubsInRangeVA is called, part of thunkMap has already been pushed into thunks. The more we overcount, the higher stubsInRangeVA will be, and it might be that we are creating thunks for the end of the text section when a simple jump will had be enough. I don't know how bad is that size increase in the final binary when we are overcounting. It seems to me that we were undercounting when ti.callSitesUsed == ti.callSiteCount (I imagine callSitesUsed cannot be higher than callSiteCount). It might be possible to have a better estimation during the thunksMap processing, for example changing the if (ti.isec->getVA() <= isecVA) I was proposing above to else if (ti.isec->getVA() <= isecVA) (which I think was when the original code was undercounting).

Again, it is complicated code, I probably don't understand it correctly, and there's a couple of things that might make overcounting not as important of not undercounting. Feel free to ignore my feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Indeed this is very tricky code. I should have clarified better - the thing that is being undercounted is the count of thunks that are needed to adjust the VA. maxPotentialThunks is correct, because it's counting the max thunks that can be created in the future - which it does correctly. The issue is that the VA also needs to be adjusted by the existing thunks that haven't been taken into consideration.

The implicit problematic assumption being that maxPotentialThunks includes the existing thunks, but it actually doesn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand your answer correctly, then when if (ti.callSitesUsed < ti.callSiteCount) is not taken, it is the only case where the previous code was undercounting, so adding the right else if in there (either ti.isec->getVA() <= isecVA or ti.isec->getVA() > isecVA, which one I am not sure) should only add to the estimation those thunks that were forgotten before. Yes, the new code will be doing more checks that with partition_point (specially because for the last input section I would expect many thunks to match ti.callSitesUsed == ti.callSiteCount), but at the same time it will reduce the overcounting, which should improve the size of the final binary (by not unnecessarily creating more stubs because stubsInRangeVA is too high).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not quite. From debugging there was no opportunity to actually improve the estimation to work in all cases (without changing the fundamental algorithm). The already placed thunks were just being ignored - basically forgotten to be taken into consideration. I am not sure if your proposal will work - I don't think it will for all cases - there are a few more variables in play.

I do think that we can improve the algorithm a bit - but I this is the direct fix for the current bug.
I am hesitant to try to improve the underlying algorithm right now as that can lead to more issues in unknown scenarios.

@alx32 alx32 merged commit 156e605 into llvm:main Jan 9, 2025
8 checks passed
BaiXilin pushed a commit to BaiXilin/llvm-fix-vnni-instr-types that referenced this pull request Jan 12, 2025
This patch improves the linker’s ability to estimate stub reachability
in the `TextOutputSection::estimateStubsInRangeVA` function. It does so
by including thunks that have already been placed ahead of the current
call site address when calculating the threshold for direct stub calls.

Before this fix, the estimation process overlooked existing forward
thunks. This could result in some thunks not being inserted where
needed. In rare situations, particularly with large and specially
arranged codebases, this might lead to branch instructions being out of
range, causing linking errors.

Although this patch successfully addresses the problem, it is not
feasible to create a test for this issue. The specific layout and order
of thunk creation required to reproduce the corner case are too complex,
making test creation impractical.

Example error messages the issue could generate:
```
ld64.lld: error: banana.o:(symbol OUTLINED_FUNCTION_24949_3875): relocation BRANCH26 is out of range: 134547892 is not in [-134217728, 134217727]; references objc_autoreleaseReturnValue
ld64.lld: error: main.o:(symbol _main+0xc): relocation BRANCH26 is out of range: 134544132 is not in [-134217728, 134217727]; references objc_release
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants