Skip to content

[lld-macho] Fix branch extension thunk estimation logic #120529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 9, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 32 additions & 4 deletions lld/MachO/ConcatOutputSection.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -184,15 +184,43 @@ uint64_t TextOutputSection::estimateStubsInRangeVA(size_t callIdx) const {
InputSection *isec = inputs[i];
isecEnd = alignToPowerOf2(isecEnd, isec->align) + isec->getSize();
}
// Estimate the address after which call sites can safely call stubs
// directly rather than through intermediary thunks.

// Tally up any thunks that have already been placed that have VA higher than
// inputs[callIdx]. First, find the index of the first thunk that is beyond
// the current inputs[callIdx].
auto itPostcallIdxThunks =
llvm::partition_point(thunks, [isecVA](const ConcatInputSection *t) {
return t->getVA() <= isecVA;
});
uint64_t existingForwardThunks = thunks.end() - itPostcallIdxThunks;

Comment on lines +188 to +196
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the thunks used here are derived from the thunkMap (see 1), and thunkMap is iterated above (lines 171-179). One should be able to figure out this value during the previous iteration and avoid the partition_point call.

Something like:

// Tally the functions which still have call sites remaining to process,
// which yields the maximum number of thunks we might yet place.
size_t maxPotentialThunks = 0;
// Tally up any thunks that have already been placed that have VA higher than
// inputs[callIdx].
size_t existingForwardThunks = 0;
for (auto &tp : thunkMap) {
  ThunkInfo &ti = tp.second;
  // This overcounts: Only sections that are in forward jump range from the
  // currently-active section get finalized, and all input sections are
  // finalized when estimateStubsInRangeVA() is called. So only backward
  // jumps will need thunks, but we count all jumps.
  if (ti.callSitesUsed < ti.callSiteCount)
    maxPotentialThunks += 1;
  if (ti.isec->getVA() <= isecVA) // it might be `>` in the conditional, not sure
    existingForwardThunks += 1
}

Because the dependency on isecVA, the code needs to move under the calculation of isecVA (more or less where the new code is added in this PR).

(It also looks like we might be double counting in maxPotentialThunks and existingForwardThunks, but there might be some not-easy-to-see dependency in how thunkMap is used and when estimateStubsInRangeVA is called that avoids that situation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One should be able to figure out this value during the previous iteration and avoid the partition_point call.

This is true, but i think this might be less performant. The for loop iterates all thunks which there can be many of (100k+). That means adding a branch that executes 100k times.

maxPotentialThunks is log(n) complexity so it for ~100k input thunks it will have <20 iterations and a few hundred instructions executed.

Generally we favor more performant code for LLD - even if here it probably won't really be measurable.
Or do we want this for easier code readability / understandability ?

It also looks like we might be double counting in maxPotentialThunks and existingForwardThunks, but there might be some not-easy-to-see dependency in how thunkMap is used and when estimateStubsInRangeVA is called that avoids that situation

This is basically the issue to be fixed. maxPotentialThunks is intended and appears to contain all the possible thunks and adding in existingForwardThunks would appear to be double counting . But that's exactly the original bug - maxPotentialThunks does not include all the thunks so we need to count all the thunks by adding existingForwardThunks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not have an intuition for the total number of entries in thunkMap. If we are talking about 100k or those numbers, doing only ~16 comparisons is probably going to always be quicker.

My comment about double counting is because of the name maxPotentialThunks and the way thunkMap seems to be copied progressively into thunks, but never cleaned up. I don't pretend to understand the flow of the code (it is difficult to figure out what's going on and how things will happen during runtime), but the name maxPotentialThunks might not be the best if it was undercounting (originally). IIUC the code calls estimateStubsInRangeVA only once during finalize (which I think only happens once). At the point estimateStubsInRangeVA is called, part of thunkMap has already been pushed into thunks. The more we overcount, the higher stubsInRangeVA will be, and it might be that we are creating thunks for the end of the text section when a simple jump will had be enough. I don't know how bad is that size increase in the final binary when we are overcounting. It seems to me that we were undercounting when ti.callSitesUsed == ti.callSiteCount (I imagine callSitesUsed cannot be higher than callSiteCount). It might be possible to have a better estimation during the thunksMap processing, for example changing the if (ti.isec->getVA() <= isecVA) I was proposing above to else if (ti.isec->getVA() <= isecVA) (which I think was when the original code was undercounting).

Again, it is complicated code, I probably don't understand it correctly, and there's a couple of things that might make overcounting not as important of not undercounting. Feel free to ignore my feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Indeed this is very tricky code. I should have clarified better - the thing that is being undercounted is the count of thunks that are needed to adjust the VA. maxPotentialThunks is correct, because it's counting the max thunks that can be created in the future - which it does correctly. The issue is that the VA also needs to be adjusted by the existing thunks that haven't been taken into consideration.

The implicit problematic assumption being that maxPotentialThunks includes the existing thunks, but it actually doesn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand your answer correctly, then when if (ti.callSitesUsed < ti.callSiteCount) is not taken, it is the only case where the previous code was undercounting, so adding the right else if in there (either ti.isec->getVA() <= isecVA or ti.isec->getVA() > isecVA, which one I am not sure) should only add to the estimation those thunks that were forgotten before. Yes, the new code will be doing more checks that with partition_point (specially because for the last input section I would expect many thunks to match ti.callSitesUsed == ti.callSiteCount), but at the same time it will reduce the overcounting, which should improve the size of the final binary (by not unnecessarily creating more stubs because stubsInRangeVA is too high).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not quite. From debugging there was no opportunity to actually improve the estimation to work in all cases (without changing the fundamental algorithm). The already placed thunks were just being ignored - basically forgotten to be taken into consideration. I am not sure if your proposal will work - I don't think it will for all cases - there are a few more variables in play.

I do think that we can improve the algorithm a bit - but I this is the direct fix for the current bug.
I am hesitant to try to improve the underlying algorithm right now as that can lead to more issues in unknown scenarios.

uint64_t forwardBranchRange = target->forwardBranchRange;
assert(isecEnd > forwardBranchRange &&
"should not run thunk insertion if all code fits in jump range");
assert(isecEnd - isecVA <= forwardBranchRange &&
"should only finalize sections in jump range");
uint64_t stubsInRangeVA = isecEnd + maxPotentialThunks * target->thunkSize +
in.stubs->getSize() - forwardBranchRange;

// Estimate the maximum size of the code, right before the stubs section.
uint64_t maxTextSize = 0;
// Add the size of all the inputs, including the unprocessed ones.
maxTextSize += isecEnd;

// Add the size of the thunks that have already been created that are ahead of
// inputs[callIdx]. These are already created thunks that will be interleaved
// with inputs[callIdx...end].
maxTextSize += existingForwardThunks * target->thunkSize;

// Add the size of the thunks that may be created in the future. Since
// 'maxPotentialThunks' overcounts, this is an estimate of the upper limit.
maxTextSize += maxPotentialThunks * target->thunkSize;

// Estimated maximum VA of last stub.
uint64_t maxVAOfLastStub = maxTextSize + in.stubs->getSize();

// Estimate the address after which call sites can safely call stubs
// directly rather than through intermediary thunks.
uint64_t stubsInRangeVA = maxVAOfLastStub - forwardBranchRange;

log("thunks = " + std::to_string(thunkMap.size()) +
", potential = " + std::to_string(maxPotentialThunks) +
", stubs = " + std::to_string(in.stubs->getSize()) + ", isecVA = " +
Expand Down
Loading