-
Notifications
You must be signed in to change notification settings - Fork 13.3k
[lld-macho] Fix branch extension thunk estimation logic #120529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-lld Author: None (alx32) ChangesThis patch improves the linker’s ability to estimate stub reachability in the Before this fix, the estimation process overlooked existing forward thunks. This could result in some thunks not being inserted where needed. In rare situations, particularly with large and specially arranged codebases, this might lead to branch instructions being out of range, causing linking errors. Although this patch successfully addresses the problem, it is not feasible to create a test for this issue. The specific layout and order of thunk creation required to reproduce the corner case are too complex, making test creation impractical. Full diff: https://github.com/llvm/llvm-project/pull/120529.diff 1 Files Affected:
diff --git a/lld/MachO/ConcatOutputSection.cpp b/lld/MachO/ConcatOutputSection.cpp
index 6a9301f84a03ef..9610a80e309bac 100644
--- a/lld/MachO/ConcatOutputSection.cpp
+++ b/lld/MachO/ConcatOutputSection.cpp
@@ -184,6 +184,15 @@ uint64_t TextOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
InputSection *isec = inputs[i];
isecEnd = alignToPowerOf2(isecEnd, isec->align) + isec->getSize();
}
+
+ // Tally up any thunks that have already been placed that have address higher
+ // than the equivalent callIdx. We first find the index of the first thunk
+ // that is beyond the current inputs[callIdx].
+ auto itPostcallIdxThunks = std::partition_point(
+ thunks.begin(), thunks.end(),
+ [isecVA](const ConcatInputSection *t) { return t->getVA() <= isecVA; });
+ uint32_t existingForwardThunks = thunks.end() - itPostcallIdxThunks;
+
// Estimate the address after which call sites can safely call stubs
// directly rather than through intermediary thunks.
uint64_t forwardBranchRange = target->forwardBranchRange;
@@ -191,8 +200,21 @@ uint64_t TextOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
"should not run thunk insertion if all code fits in jump range");
assert(isecEnd - isecVA <= forwardBranchRange &&
"should only finalize sections in jump range");
- uint64_t stubsInRangeVA = isecEnd + maxPotentialThunks * target->thunkSize +
- in.stubs->getSize() - forwardBranchRange;
+
+ // Estimate the maximum size of the code, right before the stubs section
+ uint32_t maxTextSize = 0;
+ // Add the size of all the inputs, including the unprocessed ones.
+ maxTextSize += isecEnd;
+
+ // Add the size of the thunks that may be created in the future
+ maxTextSize += maxPotentialThunks * target->thunkSize;
+
+ // Add the size of the thunks that have already been created that are ahead
+ maxTextSize += existingForwardThunks * target->thunkSize;
+
+ uint64_t stubsInRangeVA =
+ maxTextSize + in.stubs->getSize() - forwardBranchRange;
+
log("thunks = " + std::to_string(thunkMap.size()) +
", potential = " + std::to_string(maxPotentialThunks) +
", stubs = " + std::to_string(in.stubs->getSize()) + ", isecVA = " +
|
@llvm/pr-subscribers-lld-macho Author: None (alx32) ChangesThis patch improves the linker’s ability to estimate stub reachability in the Before this fix, the estimation process overlooked existing forward thunks. This could result in some thunks not being inserted where needed. In rare situations, particularly with large and specially arranged codebases, this might lead to branch instructions being out of range, causing linking errors. Although this patch successfully addresses the problem, it is not feasible to create a test for this issue. The specific layout and order of thunk creation required to reproduce the corner case are too complex, making test creation impractical. Full diff: https://github.com/llvm/llvm-project/pull/120529.diff 1 Files Affected:
diff --git a/lld/MachO/ConcatOutputSection.cpp b/lld/MachO/ConcatOutputSection.cpp
index 6a9301f84a03ef..9610a80e309bac 100644
--- a/lld/MachO/ConcatOutputSection.cpp
+++ b/lld/MachO/ConcatOutputSection.cpp
@@ -184,6 +184,15 @@ uint64_t TextOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
InputSection *isec = inputs[i];
isecEnd = alignToPowerOf2(isecEnd, isec->align) + isec->getSize();
}
+
+ // Tally up any thunks that have already been placed that have address higher
+ // than the equivalent callIdx. We first find the index of the first thunk
+ // that is beyond the current inputs[callIdx].
+ auto itPostcallIdxThunks = std::partition_point(
+ thunks.begin(), thunks.end(),
+ [isecVA](const ConcatInputSection *t) { return t->getVA() <= isecVA; });
+ uint32_t existingForwardThunks = thunks.end() - itPostcallIdxThunks;
+
// Estimate the address after which call sites can safely call stubs
// directly rather than through intermediary thunks.
uint64_t forwardBranchRange = target->forwardBranchRange;
@@ -191,8 +200,21 @@ uint64_t TextOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
"should not run thunk insertion if all code fits in jump range");
assert(isecEnd - isecVA <= forwardBranchRange &&
"should only finalize sections in jump range");
- uint64_t stubsInRangeVA = isecEnd + maxPotentialThunks * target->thunkSize +
- in.stubs->getSize() - forwardBranchRange;
+
+ // Estimate the maximum size of the code, right before the stubs section
+ uint32_t maxTextSize = 0;
+ // Add the size of all the inputs, including the unprocessed ones.
+ maxTextSize += isecEnd;
+
+ // Add the size of the thunks that may be created in the future
+ maxTextSize += maxPotentialThunks * target->thunkSize;
+
+ // Add the size of the thunks that have already been created that are ahead
+ maxTextSize += existingForwardThunks * target->thunkSize;
+
+ uint64_t stubsInRangeVA =
+ maxTextSize + in.stubs->getSize() - forwardBranchRange;
+
log("thunks = " + std::to_string(thunkMap.size()) +
", potential = " + std::to_string(maxPotentialThunks) +
", stubs = " + std::to_string(in.stubs->getSize()) + ", isecVA = " +
|
@int3 can you take a look when you're available? This is a critical fix to deal with large binaries. |
lld/MachO/ConcatOutputSection.cpp
Outdated
// Tally up any thunks that have already been placed that have address higher | ||
// than the equivalent callIdx. We first find the index of the first thunk | ||
// that is beyond the current inputs[callIdx]. | ||
auto itPostcallIdxThunks = std::partition_point( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use llvm::partition_point()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean another form or it ? We're already using partition_point
.
EDIT: I see ... std=>llvm
@nico - could you have a look at this ?
This fix is addressing the |
lld/MachO/ConcatOutputSection.cpp
Outdated
// Estimated maximum VA of last stub. | ||
uint64_t maxVAOfLastStub = maxTextSize + in.stubs->getSize(); | ||
|
||
// Calculaate the first address that is gueranteed to not need a thunk to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use (move) the original comment on Line 197 here.
// Tally up any thunks that have already been placed that have VA higher than | ||
// inputs[callIdx]. First, find the index of the first thunk that is beyond | ||
// the current inputs[callIdx]. | ||
auto itPostcallIdxThunks = | ||
llvm::partition_point(thunks, [isecVA](const ConcatInputSection *t) { | ||
return t->getVA() <= isecVA; | ||
}); | ||
uint64_t existingForwardThunks = thunks.end() - itPostcallIdxThunks; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the thunks
used here are derived from the thunkMap
(see 1), and thunkMap
is iterated above (lines 171-179). One should be able to figure out this value during the previous iteration and avoid the partition_point
call.
Something like:
// Tally the functions which still have call sites remaining to process,
// which yields the maximum number of thunks we might yet place.
size_t maxPotentialThunks = 0;
// Tally up any thunks that have already been placed that have VA higher than
// inputs[callIdx].
size_t existingForwardThunks = 0;
for (auto &tp : thunkMap) {
ThunkInfo &ti = tp.second;
// This overcounts: Only sections that are in forward jump range from the
// currently-active section get finalized, and all input sections are
// finalized when estimateStubsInRangeVA() is called. So only backward
// jumps will need thunks, but we count all jumps.
if (ti.callSitesUsed < ti.callSiteCount)
maxPotentialThunks += 1;
if (ti.isec->getVA() <= isecVA) // it might be `>` in the conditional, not sure
existingForwardThunks += 1
}
Because the dependency on isecVA
, the code needs to move under the calculation of isecVA
(more or less where the new code is added in this PR).
(It also looks like we might be double counting in maxPotentialThunks
and existingForwardThunks
, but there might be some not-easy-to-see dependency in how thunkMap
is used and when estimateStubsInRangeVA
is called that avoids that situation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One should be able to figure out this value during the previous iteration and avoid the partition_point call.
This is true, but i think this might be less performant. The for loop iterates all thunks which there can be many of (100k+). That means adding a branch that executes 100k times.
maxPotentialThunks
is log(n) complexity so it for ~100k input thunks it will have <20 iterations and a few hundred instructions executed.
Generally we favor more performant code for LLD - even if here it probably won't really be measurable.
Or do we want this for easier code readability / understandability ?
It also looks like we might be double counting in
maxPotentialThunks
andexistingForwardThunks
, but there might be some not-easy-to-see dependency in how thunkMap is used and when estimateStubsInRangeVA is called that avoids that situation
This is basically the issue to be fixed. maxPotentialThunks
is intended and appears to contain all the possible thunks and adding in existingForwardThunks
would appear to be double counting . But that's exactly the original bug - maxPotentialThunks
does not include all the thunks so we need to count all the thunks by adding existingForwardThunks
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not have an intuition for the total number of entries in thunkMap
. If we are talking about 100k or those numbers, doing only ~16 comparisons is probably going to always be quicker.
My comment about double counting is because of the name maxPotentialThunks
and the way thunkMap
seems to be copied progressively into thunks
, but never cleaned up. I don't pretend to understand the flow of the code (it is difficult to figure out what's going on and how things will happen during runtime), but the name maxPotentialThunks
might not be the best if it was undercounting (originally). IIUC the code calls estimateStubsInRangeVA
only once during finalize
(which I think only happens once). At the point estimateStubsInRangeVA
is called, part of thunkMap
has already been pushed into thunks
. The more we overcount, the higher stubsInRangeVA
will be, and it might be that we are creating thunks for the end of the text section when a simple jump will had be enough. I don't know how bad is that size increase in the final binary when we are overcounting. It seems to me that we were undercounting when ti.callSitesUsed == ti.callSiteCount
(I imagine callSitesUsed
cannot be higher than callSiteCount
). It might be possible to have a better estimation during the thunksMap
processing, for example changing the if (ti.isec->getVA() <= isecVA)
I was proposing above to else if (ti.isec->getVA() <= isecVA)
(which I think was when the original code was undercounting).
Again, it is complicated code, I probably don't understand it correctly, and there's a couple of things that might make overcounting not as important of not undercounting. Feel free to ignore my feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Indeed this is very tricky code. I should have clarified better - the thing that is being undercounted is the count of thunks that are needed to adjust the VA. maxPotentialThunks
is correct, because it's counting the max thunks that can be created in the future - which it does correctly. The issue is that the VA also needs to be adjusted by the existing thunks that haven't been taken into consideration.
The implicit problematic assumption being that maxPotentialThunks
includes the existing thunks, but it actually doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand your answer correctly, then when if (ti.callSitesUsed < ti.callSiteCount)
is not taken, it is the only case where the previous code was undercounting, so adding the right else if
in there (either ti.isec->getVA() <= isecVA
or ti.isec->getVA() > isecVA
, which one I am not sure) should only add to the estimation those thunks that were forgotten before. Yes, the new code will be doing more checks that with partition_point
(specially because for the last input section I would expect many thunks to match ti.callSitesUsed == ti.callSiteCount
), but at the same time it will reduce the overcounting, which should improve the size of the final binary (by not unnecessarily creating more stubs because stubsInRangeVA
is too high).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not quite. From debugging there was no opportunity to actually improve the estimation to work in all cases (without changing the fundamental algorithm). The already placed thunks were just being ignored - basically forgotten to be taken into consideration. I am not sure if your proposal will work - I don't think it will for all cases - there are a few more variables in play.
I do think that we can improve the algorithm a bit - but I this is the direct fix for the current bug.
I am hesitant to try to improve the underlying algorithm right now as that can lead to more issues in unknown scenarios.
This patch improves the linker’s ability to estimate stub reachability in the `TextOutputSection::estimateStubsInRangeVA` function. It does so by including thunks that have already been placed ahead of the current call site address when calculating the threshold for direct stub calls. Before this fix, the estimation process overlooked existing forward thunks. This could result in some thunks not being inserted where needed. In rare situations, particularly with large and specially arranged codebases, this might lead to branch instructions being out of range, causing linking errors. Although this patch successfully addresses the problem, it is not feasible to create a test for this issue. The specific layout and order of thunk creation required to reproduce the corner case are too complex, making test creation impractical. Example error messages the issue could generate: ``` ld64.lld: error: banana.o:(symbol OUTLINED_FUNCTION_24949_3875): relocation BRANCH26 is out of range: 134547892 is not in [-134217728, 134217727]; references objc_autoreleaseReturnValue ld64.lld: error: main.o:(symbol _main+0xc): relocation BRANCH26 is out of range: 134544132 is not in [-134217728, 134217727]; references objc_release ```
This patch improves the linker’s ability to estimate stub reachability in the
TextOutputSection::estimateStubsInRangeVA
function. It does so by including thunks that have already been placed ahead of the current call site address when calculating the threshold for direct stub calls.Before this fix, the estimation process overlooked existing forward thunks. This could result in some thunks not being inserted where needed. In rare situations, particularly with large and specially arranged codebases, this might lead to branch instructions being out of range, causing linking errors.
Although this patch successfully addresses the problem, it is not feasible to create a test for this issue. The specific layout and order of thunk creation required to reproduce the corner case are too complex, making test creation impractical.
Example error messages the issue could generate: