Skip to content

Batching L-BFGS-B (using as much as possible from scipy) #2870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

SamuelGabriel
Copy link
Contributor

Summary:
This PR changes our acq optimization in the cases where we use L-BFGS-B to use a properly batched version of L-BFGS-B.
This PR goes a little bit into the internals of scipy to make their FORTRAN code work as good as possible for us.

For this, I have edited their scipy/optimize/_lbfgsb_py.py file to handle batched functions as well. The fortran code is still running sequentially: just the evaluations are run in parallel.

In a toy test I did locally on Ackley_4d it led to speed-ups of 2.5x in generation time (which dominates the overall time clearly) using it as a drop-in replacement in the standard SingleTaskGP generation strategy, and a better performance (but I think performance should be judged after the nightly has run, as I only looked at 1 problem and 2 seeds).

The speedups are a lower-bound (+ some error), as we can adapt the batch sizes to be larger now and will likely get further speedups through that (they should be very sizable for PFNs, but also for GPs, esp. when on GPUs).

Why does this speedup things?

  1. We converge faster, as each subproblem is simpler.
  2. When some sub-problem is not converged only that sub-problem is re-evaluated.

How is this slower?

  1. Based on my measurements it looks like the LBFGS-B implementation is really fast for "larger" dimensionalities when it comes to step times, while we perform a for-loop there. (It is much slower converging, though, as one can imagine with batching a problem like we do.) This might be possible to speedup again via threading/multi-processing, but beware pytorch cuda tends to work really badly with multi-processing.

Benchmarking

[new stop crit]
Benchmarking (nightly) currently running: https://www.internalfb.com/mlhub/flow/745576495/overview

[no batch_limit]
Another benchmark, which does the following:

  • going back to our previous defaults for tolerance on the function optimum (a slightly different stopping criterion)
  • no more batch limits, as they are not needed anymore. this should yield big speedups. I have removed them for initialization, too, even though that is kinda unrelated (the speedup is not too high compared to the speedup via the classic batch_limit)
  • I have improved the algorithm to make optimal batches, as unexpected to me the optimizer function sometimes needs to be called multiple times before consuming an evaluation again

https://internalfb.com/intern/fblearner/details/746051698/?notif_channel=cli

[no batch_limit & 128 restarts]

Another benchmark, increasing the num_restarts to 128 per default.

https://internalfb.com/intern/fblearner/details/746054543/?notif_channel=cli

[no batch_limit & new stop crit]

https://internalfb.com/intern/fblearner/details/746392062/?notif_channel=cli

[no batch_limit & new stop crit & 32 restarts]

https://internalfb.com/intern/fblearner/details/746394172/?notif_channel=cli

Results

Version Final Score Improvement Share Average Score Improvement Share Gen Time Speedup
[new stop crit] 59% 58% 1.4x
[no batch_limit] 55% 52% 2.4x
[no batch_limit & 128 restarts] 61% 52% -2x (slowdown)
[no batch_limit & new stop crit] TBD TBD TBD
[no_batch_limit & new stop crit & 32 restarts TBD TBD TBD

Note: The negative value in the last row indicates a slowdown instead of a speedup.

TODO

Batching can be further improved by waiting until all states are either done or need to do a function evaluation. (Done, yielded like 10-20% speedups on Ackley4)

WARNINGS:

1 Currently this method uses slightly different defaults as scipy.optimize.minimize. Did not test, which ones are better. This should prob be done before merging.

2 This is already working but it is not yet turned on per-default, which it should be before shipping (so: DO NOT MERGE JUST YET) + there should be some tests..

3 It is not yet tested for multi-objective and batched BO, where some shapes are different and thus things might break.

4 Currently the code for LbfgsInvHessProduct is copied over, as we can see that that changes with the newest SciPy version, potentially creating conflicts. But it might still be better to try and copy less.

Differential Revision: D75977689

…orts

Summary: This is solely a base diff for the next diff, such that the next diff's diff looks reasonable, as this code is copied.

Differential Revision: D75980919
@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jun 6, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75977689

SamuelGabriel added a commit to SamuelGabriel/botorch that referenced this pull request Jun 6, 2025
Summary:

This PR changes our acq optimization in the cases where we use L-BFGS-B to use a properly batched version of L-BFGS-B.
This PR goes a little bit into the internals of scipy to make their FORTRAN code work as good as possible for us.

For this, I have edited their scipy/optimize/_lbfgsb_py.py file to handle batched functions as well. The fortran code is still running sequentially: just the evaluations are run in parallel.

In a toy test I did locally on Ackley_4d it led to speed-ups of **2.5x** in generation time (which dominates the overall time clearly) using it as a drop-in replacement in the standard SingleTaskGP generation strategy, and a better performance (but I think performance should be judged after the nightly has run, as I only looked at 1 problem and 2 seeds).

The speedups are a lower-bound (+ some error), as we can adapt the batch sizes to be larger now and will likely get further speedups through that (they should be very sizable for PFNs, but also for GPs, esp. when on GPUs).

Why does this speedup things?

1. We converge faster, as each subproblem is simpler.
2. When some sub-problem is not converged only that sub-problem is re-evaluated.

How is this slower?

1. Based on my measurements it looks like the LBFGS-B implementation is really fast for "larger" dimensionalities when it comes to step times, while we perform a for-loop there. (It is much slower converging, though, as one can imagine with batching a problem like we do.) This might be possible to speedup again via threading/multi-processing, but *beware* pytorch cuda tends to work *really* badly with multi-processing.

### Benchmarking

**[new stop crit]**
Benchmarking (nightly) currently running: https://www.internalfb.com/mlhub/flow/745576495/overview

**[no batch_limit]**
Another benchmark, which does the following:

- going back to our previous defaults for tolerance on the function optimum (a slightly different stopping criterion)
- no more batch limits, as they are not needed anymore. this should yield big speedups. I have removed them for initialization, too, even though that is kinda unrelated (the speedup is not too high compared to the speedup via the classic `batch_limit`)
- I have improved the algorithm to make optimal batches, as unexpected to me the optimizer function sometimes needs to be called multiple times before consuming an evaluation again

https://internalfb.com/intern/fblearner/details/746051698/?notif_channel=cli

**[no batch_limit & 128 restarts]**

Another benchmark, increasing the `num_restarts` to 128 per default.

https://internalfb.com/intern/fblearner/details/746054543/?notif_channel=cli

**[no batch_limit & new stop crit]**

https://internalfb.com/intern/fblearner/details/746392062/?notif_channel=cli

**[no batch_limit & new stop crit & 32 restarts]**

https://internalfb.com/intern/fblearner/details/746394172/?notif_channel=cli

### Results

| Version | Final Score Improvement Share | Average Score Improvement Share | Gen Time Speedup |
| --- | --- | --- | --- |
| [new stop crit] | 59% | 58% | 1.4x |
| [no batch_limit] | 55% | 52% | 2.4x |
| [no batch_limit & 128 restarts] | 61% | 52% | -2x (slowdown) |
| [no batch_limit & new stop crit] | TBD |TBD|TBD|
| [no_batch_limit & new stop crit & 32 restarts | TBD | TBD| TBD|

Note: The negative value in the last row indicates a slowdown instead of a speedup.

TODO

~~Batching can be further improved by waiting until all states are either done or need to do a function evaluation.~~ (Done, yielded like 10-20% speedups on Ackley4)

WARNINGS:

1 Currently this method uses slightly different defaults as `scipy.optimize.minimize`. Did not test, which ones are better. This should prob be done before merging.

2 This is already working but it is not yet turned on per-default, which it should be before shipping (so: DO NOT MERGE JUST YET) + there should be some tests..

3 It is not yet tested for multi-objective and batched BO, where some shapes are different and thus things might break.

4 Currently the code for LbfgsInvHessProduct is copied over, as we can see that that changes with the newest SciPy version, potentially creating conflicts. But it might still be better to try and copy less.

Differential Revision: D75977689
Summary:
Pull Request resolved: pytorch#2870

This PR changes our acq optimization in the cases where we use L-BFGS-B to use a properly batched version of L-BFGS-B.
This PR goes a little bit into the internals of scipy to make their FORTRAN code work as good as possible for us.

For this, I have edited their scipy/optimize/_lbfgsb_py.py file to handle batched functions as well. The fortran code is still running sequentially: just the evaluations are run in parallel.

In a toy test I did locally on Ackley_4d it led to speed-ups of **2.5x** in generation time (which dominates the overall time clearly) using it as a drop-in replacement in the standard SingleTaskGP generation strategy, and a better performance (but I think performance should be judged after the nightly has run, as I only looked at 1 problem and 2 seeds).

The speedups are a lower-bound (+ some error), as we can adapt the batch sizes to be larger now and will likely get further speedups through that (they should be very sizable for PFNs, but also for GPs, esp. when on GPUs).

Why does this speedup things?

1. We converge faster, as each subproblem is simpler.
2. When some sub-problem is not converged only that sub-problem is re-evaluated.

How is this slower?

1. Based on my measurements it looks like the LBFGS-B implementation is really fast for "larger" dimensionalities when it comes to step times, while we perform a for-loop there. (It is much slower converging, though, as one can imagine with batching a problem like we do.) This might be possible to speedup again via threading/multi-processing, but *beware* pytorch cuda tends to work *really* badly with multi-processing.

### Benchmarking

**[new stop crit]**
Benchmarking (nightly) currently running: https://www.internalfb.com/mlhub/flow/745576495/overview

**[no batch_limit]**
Another benchmark, which does the following:

- going back to our previous defaults for tolerance on the function optimum (a slightly different stopping criterion)
- no more batch limits, as they are not needed anymore. this should yield big speedups. I have removed them for initialization, too, even though that is kinda unrelated (the speedup is not too high compared to the speedup via the classic `batch_limit`)
- I have improved the algorithm to make optimal batches, as unexpected to me the optimizer function sometimes needs to be called multiple times before consuming an evaluation again

https://internalfb.com/intern/fblearner/details/746051698/?notif_channel=cli

**[no batch_limit & 128 restarts]**

Another benchmark, increasing the `num_restarts` to 128 per default.

https://internalfb.com/intern/fblearner/details/746054543/?notif_channel=cli

**[no batch_limit & new stop crit]**

https://internalfb.com/intern/fblearner/details/746392062/?notif_channel=cli

**[no batch_limit & new stop crit & 32 restarts]**

https://internalfb.com/intern/fblearner/details/746394172/?notif_channel=cli

### Results

| Version | Final Score Improvement Share | Average Score Improvement Share | Gen Time Speedup |
| --- | --- | --- | --- |
| [new stop crit] | 59% | 58% | 1.4x |
| [no batch_limit] | 55% | 52% | 2.4x |
| [no batch_limit & 128 restarts] | 61% | 52% | -2x (slowdown) |
| [no batch_limit & new stop crit] | TBD |TBD|TBD|
| [no_batch_limit & new stop crit & 32 restarts | TBD | TBD| TBD|

Note: The negative value in the last row indicates a slowdown instead of a speedup.

TODO

~~Batching can be further improved by waiting until all states are either done or need to do a function evaluation.~~ (Done, yielded like 10-20% speedups on Ackley4)

WARNINGS:

1 Currently this method uses slightly different defaults as `scipy.optimize.minimize`. Did not test, which ones are better. This should prob be done before merging.

2 This is already working but it is not yet turned on per-default, which it should be before shipping (so: DO NOT MERGE JUST YET) + there should be some tests..

3 It is not yet tested for multi-objective and batched BO, where some shapes are different and thus things might break.

4 Currently the code for LbfgsInvHessProduct is copied over, as we can see that that changes with the newest SciPy version, potentially creating conflicts. But it might still be better to try and copy less.

Differential Revision: D75977689
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75977689

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants