-
Notifications
You must be signed in to change notification settings - Fork 427
Batching L-BFGS-B (using as much as possible from scipy) #2870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
SamuelGabriel
wants to merge
2
commits into
pytorch:main
Choose a base branch
from
SamuelGabriel:export-D75977689
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…orts Summary: This is solely a base diff for the next diff, such that the next diff's diff looks reasonable, as this code is copied. Differential Revision: D75980919
This pull request was exported from Phabricator. Differential Revision: D75977689 |
SamuelGabriel
added a commit
to SamuelGabriel/botorch
that referenced
this pull request
Jun 6, 2025
Summary: This PR changes our acq optimization in the cases where we use L-BFGS-B to use a properly batched version of L-BFGS-B. This PR goes a little bit into the internals of scipy to make their FORTRAN code work as good as possible for us. For this, I have edited their scipy/optimize/_lbfgsb_py.py file to handle batched functions as well. The fortran code is still running sequentially: just the evaluations are run in parallel. In a toy test I did locally on Ackley_4d it led to speed-ups of **2.5x** in generation time (which dominates the overall time clearly) using it as a drop-in replacement in the standard SingleTaskGP generation strategy, and a better performance (but I think performance should be judged after the nightly has run, as I only looked at 1 problem and 2 seeds). The speedups are a lower-bound (+ some error), as we can adapt the batch sizes to be larger now and will likely get further speedups through that (they should be very sizable for PFNs, but also for GPs, esp. when on GPUs). Why does this speedup things? 1. We converge faster, as each subproblem is simpler. 2. When some sub-problem is not converged only that sub-problem is re-evaluated. How is this slower? 1. Based on my measurements it looks like the LBFGS-B implementation is really fast for "larger" dimensionalities when it comes to step times, while we perform a for-loop there. (It is much slower converging, though, as one can imagine with batching a problem like we do.) This might be possible to speedup again via threading/multi-processing, but *beware* pytorch cuda tends to work *really* badly with multi-processing. ### Benchmarking **[new stop crit]** Benchmarking (nightly) currently running: https://www.internalfb.com/mlhub/flow/745576495/overview **[no batch_limit]** Another benchmark, which does the following: - going back to our previous defaults for tolerance on the function optimum (a slightly different stopping criterion) - no more batch limits, as they are not needed anymore. this should yield big speedups. I have removed them for initialization, too, even though that is kinda unrelated (the speedup is not too high compared to the speedup via the classic `batch_limit`) - I have improved the algorithm to make optimal batches, as unexpected to me the optimizer function sometimes needs to be called multiple times before consuming an evaluation again https://internalfb.com/intern/fblearner/details/746051698/?notif_channel=cli **[no batch_limit & 128 restarts]** Another benchmark, increasing the `num_restarts` to 128 per default. https://internalfb.com/intern/fblearner/details/746054543/?notif_channel=cli **[no batch_limit & new stop crit]** https://internalfb.com/intern/fblearner/details/746392062/?notif_channel=cli **[no batch_limit & new stop crit & 32 restarts]** https://internalfb.com/intern/fblearner/details/746394172/?notif_channel=cli ### Results | Version | Final Score Improvement Share | Average Score Improvement Share | Gen Time Speedup | | --- | --- | --- | --- | | [new stop crit] | 59% | 58% | 1.4x | | [no batch_limit] | 55% | 52% | 2.4x | | [no batch_limit & 128 restarts] | 61% | 52% | -2x (slowdown) | | [no batch_limit & new stop crit] | TBD |TBD|TBD| | [no_batch_limit & new stop crit & 32 restarts | TBD | TBD| TBD| Note: The negative value in the last row indicates a slowdown instead of a speedup. TODO ~~Batching can be further improved by waiting until all states are either done or need to do a function evaluation.~~ (Done, yielded like 10-20% speedups on Ackley4) WARNINGS: 1 Currently this method uses slightly different defaults as `scipy.optimize.minimize`. Did not test, which ones are better. This should prob be done before merging. 2 This is already working but it is not yet turned on per-default, which it should be before shipping (so: DO NOT MERGE JUST YET) + there should be some tests.. 3 It is not yet tested for multi-objective and batched BO, where some shapes are different and thus things might break. 4 Currently the code for LbfgsInvHessProduct is copied over, as we can see that that changes with the newest SciPy version, potentially creating conflicts. But it might still be better to try and copy less. Differential Revision: D75977689
cded081
to
316f41b
Compare
Summary: Pull Request resolved: pytorch#2870 This PR changes our acq optimization in the cases where we use L-BFGS-B to use a properly batched version of L-BFGS-B. This PR goes a little bit into the internals of scipy to make their FORTRAN code work as good as possible for us. For this, I have edited their scipy/optimize/_lbfgsb_py.py file to handle batched functions as well. The fortran code is still running sequentially: just the evaluations are run in parallel. In a toy test I did locally on Ackley_4d it led to speed-ups of **2.5x** in generation time (which dominates the overall time clearly) using it as a drop-in replacement in the standard SingleTaskGP generation strategy, and a better performance (but I think performance should be judged after the nightly has run, as I only looked at 1 problem and 2 seeds). The speedups are a lower-bound (+ some error), as we can adapt the batch sizes to be larger now and will likely get further speedups through that (they should be very sizable for PFNs, but also for GPs, esp. when on GPUs). Why does this speedup things? 1. We converge faster, as each subproblem is simpler. 2. When some sub-problem is not converged only that sub-problem is re-evaluated. How is this slower? 1. Based on my measurements it looks like the LBFGS-B implementation is really fast for "larger" dimensionalities when it comes to step times, while we perform a for-loop there. (It is much slower converging, though, as one can imagine with batching a problem like we do.) This might be possible to speedup again via threading/multi-processing, but *beware* pytorch cuda tends to work *really* badly with multi-processing. ### Benchmarking **[new stop crit]** Benchmarking (nightly) currently running: https://www.internalfb.com/mlhub/flow/745576495/overview **[no batch_limit]** Another benchmark, which does the following: - going back to our previous defaults for tolerance on the function optimum (a slightly different stopping criterion) - no more batch limits, as they are not needed anymore. this should yield big speedups. I have removed them for initialization, too, even though that is kinda unrelated (the speedup is not too high compared to the speedup via the classic `batch_limit`) - I have improved the algorithm to make optimal batches, as unexpected to me the optimizer function sometimes needs to be called multiple times before consuming an evaluation again https://internalfb.com/intern/fblearner/details/746051698/?notif_channel=cli **[no batch_limit & 128 restarts]** Another benchmark, increasing the `num_restarts` to 128 per default. https://internalfb.com/intern/fblearner/details/746054543/?notif_channel=cli **[no batch_limit & new stop crit]** https://internalfb.com/intern/fblearner/details/746392062/?notif_channel=cli **[no batch_limit & new stop crit & 32 restarts]** https://internalfb.com/intern/fblearner/details/746394172/?notif_channel=cli ### Results | Version | Final Score Improvement Share | Average Score Improvement Share | Gen Time Speedup | | --- | --- | --- | --- | | [new stop crit] | 59% | 58% | 1.4x | | [no batch_limit] | 55% | 52% | 2.4x | | [no batch_limit & 128 restarts] | 61% | 52% | -2x (slowdown) | | [no batch_limit & new stop crit] | TBD |TBD|TBD| | [no_batch_limit & new stop crit & 32 restarts | TBD | TBD| TBD| Note: The negative value in the last row indicates a slowdown instead of a speedup. TODO ~~Batching can be further improved by waiting until all states are either done or need to do a function evaluation.~~ (Done, yielded like 10-20% speedups on Ackley4) WARNINGS: 1 Currently this method uses slightly different defaults as `scipy.optimize.minimize`. Did not test, which ones are better. This should prob be done before merging. 2 This is already working but it is not yet turned on per-default, which it should be before shipping (so: DO NOT MERGE JUST YET) + there should be some tests.. 3 It is not yet tested for multi-objective and batched BO, where some shapes are different and thus things might break. 4 Currently the code for LbfgsInvHessProduct is copied over, as we can see that that changes with the newest SciPy version, potentially creating conflicts. But it might still be better to try and copy less. Differential Revision: D75977689
This pull request was exported from Phabricator. Differential Revision: D75977689 |
316f41b
to
d495710
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
This PR changes our acq optimization in the cases where we use L-BFGS-B to use a properly batched version of L-BFGS-B.
This PR goes a little bit into the internals of scipy to make their FORTRAN code work as good as possible for us.
For this, I have edited their scipy/optimize/_lbfgsb_py.py file to handle batched functions as well. The fortran code is still running sequentially: just the evaluations are run in parallel.
In a toy test I did locally on Ackley_4d it led to speed-ups of 2.5x in generation time (which dominates the overall time clearly) using it as a drop-in replacement in the standard SingleTaskGP generation strategy, and a better performance (but I think performance should be judged after the nightly has run, as I only looked at 1 problem and 2 seeds).
The speedups are a lower-bound (+ some error), as we can adapt the batch sizes to be larger now and will likely get further speedups through that (they should be very sizable for PFNs, but also for GPs, esp. when on GPUs).
Why does this speedup things?
How is this slower?
Benchmarking
[new stop crit]
Benchmarking (nightly) currently running: https://www.internalfb.com/mlhub/flow/745576495/overview
[no batch_limit]
Another benchmark, which does the following:
batch_limit
)https://internalfb.com/intern/fblearner/details/746051698/?notif_channel=cli
[no batch_limit & 128 restarts]
Another benchmark, increasing the
num_restarts
to 128 per default.https://internalfb.com/intern/fblearner/details/746054543/?notif_channel=cli
[no batch_limit & new stop crit]
https://internalfb.com/intern/fblearner/details/746392062/?notif_channel=cli
[no batch_limit & new stop crit & 32 restarts]
https://internalfb.com/intern/fblearner/details/746394172/?notif_channel=cli
Results
Note: The negative value in the last row indicates a slowdown instead of a speedup.
TODO
Batching can be further improved by waiting until all states are either done or need to do a function evaluation.(Done, yielded like 10-20% speedups on Ackley4)WARNINGS:
1 Currently this method uses slightly different defaults as
scipy.optimize.minimize
. Did not test, which ones are better. This should prob be done before merging.2 This is already working but it is not yet turned on per-default, which it should be before shipping (so: DO NOT MERGE JUST YET) + there should be some tests..
3 It is not yet tested for multi-objective and batched BO, where some shapes are different and thus things might break.
4 Currently the code for LbfgsInvHessProduct is copied over, as we can see that that changes with the newest SciPy version, potentially creating conflicts. But it might still be better to try and copy less.
Differential Revision: D75977689