Batching L-BFGS-B (using as much as possible from scipy) #2870

SamuelGabriel · 2025-06-06T17:23:26Z

Summary:
This PR changes our acq optimization in the cases where we use L-BFGS-B to use a properly batched version of L-BFGS-B.
This PR goes a little bit into the internals of scipy to make their FORTRAN code work as good as possible for us.

For this, I have edited their scipy/optimize/_lbfgsb_py.py file to handle batched functions as well. The fortran code is still running sequentially: just the evaluations are run in parallel.

In a toy test I did locally on Ackley_4d it led to speed-ups of 2.5x in generation time (which dominates the overall time clearly) using it as a drop-in replacement in the standard SingleTaskGP generation strategy, and a better performance (but I think performance should be judged after the nightly has run, as I only looked at 1 problem and 2 seeds).

The speedups are a lower-bound (+ some error), as we can adapt the batch sizes to be larger now and will likely get further speedups through that (they should be very sizable for PFNs, but also for GPs, esp. when on GPUs).

Why does this speedup things?

We converge faster, as each subproblem is simpler.
When some sub-problem is not converged only that sub-problem is re-evaluated.

How is this slower?

Based on my measurements it looks like the LBFGS-B implementation is really fast for "larger" dimensionalities when it comes to step times, while we perform a for-loop there. (It is much slower converging, though, as one can imagine with batching a problem like we do.) This might be possible to speedup again via threading/multi-processing, but beware pytorch cuda tends to work really badly with multi-processing.

Benchmarking

[new stop crit]
Benchmarking (nightly) currently running: https://www.internalfb.com/mlhub/flow/745576495/overview

[no batch_limit]
Another benchmark, which does the following:

going back to our previous defaults for tolerance on the function optimum (a slightly different stopping criterion)
no more batch limits, as they are not needed anymore. this should yield big speedups. I have removed them for initialization, too, even though that is kinda unrelated (the speedup is not too high compared to the speedup via the classic batch_limit)
I have improved the algorithm to make optimal batches, as unexpected to me the optimizer function sometimes needs to be called multiple times before consuming an evaluation again

https://internalfb.com/intern/fblearner/details/746051698/?notif_channel=cli

[no batch_limit & 128 restarts]

Another benchmark, increasing the num_restarts to 128 per default.

https://internalfb.com/intern/fblearner/details/746054543/?notif_channel=cli

[no batch_limit & new stop crit]

https://internalfb.com/intern/fblearner/details/746392062/?notif_channel=cli

[no batch_limit & new stop crit & 32 restarts]

https://internalfb.com/intern/fblearner/details/746394172/?notif_channel=cli

Results

Version	Final Score Improvement Share	Average Score Improvement Share	Gen Time Speedup
[new stop crit]	59%	58%	1.4x
[no batch_limit]	55%	52%	2.4x
[no batch_limit & 128 restarts]	61%	52%	-2x (slowdown)
[no batch_limit & new stop crit]	TBD	TBD	TBD
[no_batch_limit & new stop crit & 32 restarts	TBD	TBD	TBD

Note: The negative value in the last row indicates a slowdown instead of a speedup.

TODO

~~Batching can be further improved by waiting until all states are either done or need to do a function evaluation.~~ (Done, yielded like 10-20% speedups on Ackley4)

WARNINGS:

1 Currently this method uses slightly different defaults as scipy.optimize.minimize. Did not test, which ones are better. This should prob be done before merging.

2 This is already working but it is not yet turned on per-default, which it should be before shipping (so: DO NOT MERGE JUST YET) + there should be some tests..

3 It is not yet tested for multi-objective and batched BO, where some shapes are different and thus things might break.

4 Currently the code for LbfgsInvHessProduct is copied over, as we can see that that changes with the newest SciPy version, potentially creating conflicts. But it might still be better to try and copy less.

Differential Revision: D75977689

…orts Summary: This is solely a base diff for the next diff, such that the next diff's diff looks reasonable, as this code is copied. Differential Revision: D75980919

facebook-github-bot · 2025-06-06T17:23:34Z

This pull request was exported from Phabricator. Differential Revision: D75977689

Summary: This PR changes our acq optimization in the cases where we use L-BFGS-B to use a properly batched version of L-BFGS-B. This PR goes a little bit into the internals of scipy to make their FORTRAN code work as good as possible for us. For this, I have edited their scipy/optimize/_lbfgsb_py.py file to handle batched functions as well. The fortran code is still running sequentially: just the evaluations are run in parallel. In a toy test I did locally on Ackley_4d it led to speed-ups of **2.5x** in generation time (which dominates the overall time clearly) using it as a drop-in replacement in the standard SingleTaskGP generation strategy, and a better performance (but I think performance should be judged after the nightly has run, as I only looked at 1 problem and 2 seeds). The speedups are a lower-bound (+ some error), as we can adapt the batch sizes to be larger now and will likely get further speedups through that (they should be very sizable for PFNs, but also for GPs, esp. when on GPUs). Why does this speedup things? 1. We converge faster, as each subproblem is simpler. 2. When some sub-problem is not converged only that sub-problem is re-evaluated. How is this slower? 1. Based on my measurements it looks like the LBFGS-B implementation is really fast for "larger" dimensionalities when it comes to step times, while we perform a for-loop there. (It is much slower converging, though, as one can imagine with batching a problem like we do.) This might be possible to speedup again via threading/multi-processing, but *beware* pytorch cuda tends to work *really* badly with multi-processing. ### Benchmarking **[new stop crit]** Benchmarking (nightly) currently running: https://www.internalfb.com/mlhub/flow/745576495/overview **[no batch_limit]** Another benchmark, which does the following: - going back to our previous defaults for tolerance on the function optimum (a slightly different stopping criterion) - no more batch limits, as they are not needed anymore. this should yield big speedups. I have removed them for initialization, too, even though that is kinda unrelated (the speedup is not too high compared to the speedup via the classic `batch_limit`) - I have improved the algorithm to make optimal batches, as unexpected to me the optimizer function sometimes needs to be called multiple times before consuming an evaluation again https://internalfb.com/intern/fblearner/details/746051698/?notif_channel=cli **[no batch_limit & 128 restarts]** Another benchmark, increasing the `num_restarts` to 128 per default. https://internalfb.com/intern/fblearner/details/746054543/?notif_channel=cli **[no batch_limit & new stop crit]** https://internalfb.com/intern/fblearner/details/746392062/?notif_channel=cli **[no batch_limit & new stop crit & 32 restarts]** https://internalfb.com/intern/fblearner/details/746394172/?notif_channel=cli ### Results | Version | Final Score Improvement Share | Average Score Improvement Share | Gen Time Speedup | | --- | --- | --- | --- | | [new stop crit] | 59% | 58% | 1.4x | | [no batch_limit] | 55% | 52% | 2.4x | | [no batch_limit & 128 restarts] | 61% | 52% | -2x (slowdown) | | [no batch_limit & new stop crit] | TBD |TBD|TBD| | [no_batch_limit & new stop crit & 32 restarts | TBD | TBD| TBD| Note: The negative value in the last row indicates a slowdown instead of a speedup. TODO ~~Batching can be further improved by waiting until all states are either done or need to do a function evaluation.~~ (Done, yielded like 10-20% speedups on Ackley4) WARNINGS: 1 Currently this method uses slightly different defaults as `scipy.optimize.minimize`. Did not test, which ones are better. This should prob be done before merging. 2 This is already working but it is not yet turned on per-default, which it should be before shipping (so: DO NOT MERGE JUST YET) + there should be some tests.. 3 It is not yet tested for multi-objective and batched BO, where some shapes are different and thus things might break. 4 Currently the code for LbfgsInvHessProduct is copied over, as we can see that that changes with the newest SciPy version, potentially creating conflicts. But it might still be better to try and copy less. Differential Revision: D75977689

Summary: Pull Request resolved: pytorch#2870 This PR changes our acq optimization in the cases where we use L-BFGS-B to use a properly batched version of L-BFGS-B. This PR goes a little bit into the internals of scipy to make their FORTRAN code work as good as possible for us. For this, I have edited their scipy/optimize/_lbfgsb_py.py file to handle batched functions as well. The fortran code is still running sequentially: just the evaluations are run in parallel. In a toy test I did locally on Ackley_4d it led to speed-ups of **2.5x** in generation time (which dominates the overall time clearly) using it as a drop-in replacement in the standard SingleTaskGP generation strategy, and a better performance (but I think performance should be judged after the nightly has run, as I only looked at 1 problem and 2 seeds). The speedups are a lower-bound (+ some error), as we can adapt the batch sizes to be larger now and will likely get further speedups through that (they should be very sizable for PFNs, but also for GPs, esp. when on GPUs). Why does this speedup things? 1. We converge faster, as each subproblem is simpler. 2. When some sub-problem is not converged only that sub-problem is re-evaluated. How is this slower? 1. Based on my measurements it looks like the LBFGS-B implementation is really fast for "larger" dimensionalities when it comes to step times, while we perform a for-loop there. (It is much slower converging, though, as one can imagine with batching a problem like we do.) This might be possible to speedup again via threading/multi-processing, but *beware* pytorch cuda tends to work *really* badly with multi-processing. ### Benchmarking **[new stop crit]** Benchmarking (nightly) currently running: https://www.internalfb.com/mlhub/flow/745576495/overview **[no batch_limit]** Another benchmark, which does the following: - going back to our previous defaults for tolerance on the function optimum (a slightly different stopping criterion) - no more batch limits, as they are not needed anymore. this should yield big speedups. I have removed them for initialization, too, even though that is kinda unrelated (the speedup is not too high compared to the speedup via the classic `batch_limit`) - I have improved the algorithm to make optimal batches, as unexpected to me the optimizer function sometimes needs to be called multiple times before consuming an evaluation again https://internalfb.com/intern/fblearner/details/746051698/?notif_channel=cli **[no batch_limit & 128 restarts]** Another benchmark, increasing the `num_restarts` to 128 per default. https://internalfb.com/intern/fblearner/details/746054543/?notif_channel=cli **[no batch_limit & new stop crit]** https://internalfb.com/intern/fblearner/details/746392062/?notif_channel=cli **[no batch_limit & new stop crit & 32 restarts]** https://internalfb.com/intern/fblearner/details/746394172/?notif_channel=cli ### Results | Version | Final Score Improvement Share | Average Score Improvement Share | Gen Time Speedup | | --- | --- | --- | --- | | [new stop crit] | 59% | 58% | 1.4x | | [no batch_limit] | 55% | 52% | 2.4x | | [no batch_limit & 128 restarts] | 61% | 52% | -2x (slowdown) | | [no batch_limit & new stop crit] | TBD |TBD|TBD| | [no_batch_limit & new stop crit & 32 restarts | TBD | TBD| TBD| Note: The negative value in the last row indicates a slowdown instead of a speedup. TODO ~~Batching can be further improved by waiting until all states are either done or need to do a function evaluation.~~ (Done, yielded like 10-20% speedups on Ackley4) WARNINGS: 1 Currently this method uses slightly different defaults as `scipy.optimize.minimize`. Did not test, which ones are better. This should prob be done before merging. 2 This is already working but it is not yet turned on per-default, which it should be before shipping (so: DO NOT MERGE JUST YET) + there should be some tests.. 3 It is not yet tested for multi-objective and batched BO, where some shapes are different and thus things might break. 4 Currently the code for LbfgsInvHessProduct is copied over, as we can see that that changes with the newest SciPy version, potentially creating conflicts. But it might still be better to try and copy less. Differential Revision: D75977689

facebook-github-bot · 2025-06-06T18:07:30Z

This pull request was exported from Phabricator. Differential Revision: D75977689

copying over lbfgs_b from scipy/optimize/_lbfgsb_py.py with fixed imp…

5c4d16c

…orts Summary: This is solely a base diff for the next diff, such that the next diff's diff looks reasonable, as this code is copied. Differential Revision: D75980919

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jun 6, 2025

facebook-github-bot added the fb-exported label Jun 6, 2025

SamuelGabriel force-pushed the export-D75977689 branch from cded081 to 316f41b Compare June 6, 2025 18:03

SamuelGabriel force-pushed the export-D75977689 branch from 316f41b to d495710 Compare June 6, 2025 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batching L-BFGS-B (using as much as possible from scipy) #2870

Batching L-BFGS-B (using as much as possible from scipy) #2870

Uh oh!

SamuelGabriel commented Jun 6, 2025

Uh oh!

facebook-github-bot commented Jun 6, 2025

Uh oh!

facebook-github-bot commented Jun 6, 2025

Uh oh!

Uh oh!

Batching L-BFGS-B (using as much as possible from scipy) #2870

Are you sure you want to change the base?

Batching L-BFGS-B (using as much as possible from scipy) #2870

Uh oh!

Conversation

SamuelGabriel commented Jun 6, 2025

Benchmarking

Results

Uh oh!

facebook-github-bot commented Jun 6, 2025

Uh oh!

facebook-github-bot commented Jun 6, 2025

Uh oh!

Uh oh!