Trying to implement L-BFGS optimizers

## 🐛 Bug


 I'm trying to get L-BFGS optimizer to work on TPU but I'm facing huge slowdowns at the first call of the `step().` The original code can be found [here](https://github.com/vchoutas/smplify-x/blob/master/smplifyx/optimizers/lbfgs_ls.py). I also saw a similar issue on [LAMB optimizers](https://github.com/pytorch/xla/issues/2511). I was also facing a deprecation warning in `_add()` in this [step](https://github.com/vchoutas/smplify-x/blob/a7876f03aa086c1fa010941a91482d7cb240e7d9/smplifyx/optimizers/lbfgs_ls.py#L238). I am unable to see the exact line in the code that is the major cause of the slowdown. 

In my analysis, I estimated that [max_iter](https://github.com/vchoutas/smplify-x/blob/a7876f03aa086c1fa010941a91482d7cb240e7d9/smplifyx/optimizers/lbfgs_ls.py#L266) used in this [loop](https://github.com/vchoutas/smplify-x/blob/a7876f03aa086c1fa010941a91482d7cb240e7d9/smplifyx/optimizers/lbfgs_ls.py#L304) is `30` and is fast for the 1st and 2nd iteration but it suddenly becomes very slow later.  What else can I do? Here, I have taken the VM configuration as specified in the **Fairseq tutorial**. Should I increase the number of cores for my problem? Do I need to shift my variables in the **fitting**
code to xla_device?

## To Reproduce



Steps to reproduce the behaviour:

1. I take a GCP VM instance and a TPU processing node by following these [steps](https://cloud.google.com/tpu/docs/tutorials/transformer-pytorch).
2. Then, I follow the steps given in the [SMPL-X repository](https://github.com/vchoutas/smplify-x) to install SMPL-X on the VM.
3. In the code, I change all the instances of `device` as given [here](https://github.com/vchoutas/smplify-x/blob/a7876f03aa086c1fa010941a91482d7cb240e7d9/smplifyx/fit_single_frame.py#L103) to `device  = xm.xla_device()` and [optimizer.step](https://github.com/vchoutas/smplify-x/blob/a7876f03aa086c1fa010941a91482d7cb240e7d9/smplifyx/fitting.py#L175) to `loss = xm.optimizer_step(optimizer, optimizer_args={'closure': closure}, barrier = True)`.

## Environment

 - Reproducible on XLA backend [CPU/TPU]:
 - torch_xla version: 1.6
 - OS: Linux transformer-tutorial 4.9.0-13-amd64 #1 SMP Debian 4.9.228-1 (2020-07-05) x86_64 GNU/Linux
 - GCC version: 6.3.0
 - Python version: 3.6 (64-bit runtime)
 - Is CUDA available: False
 - CUDA runtime version: No CUDA
 - GPU models and configuration: No CUDA
 - Nvidia driver version: No CUDA
 - cuDNN version: No CUDA
 - numpy                     1.16.3   pip
 - numpy                     1.19.1           py36hbc911f0_0  
 -  numpy-base                1.19.1           py36hfa32c7d_0  
 - numpydoc                  1.1.0                      py_0  
 - torch                     1.6.0                     pip
 - torch-xla                 1.6                       pip
 - torchgeometry             0.1.2                  pip
 - torchvision               0.7.0                     pip
 - blas                      1.0                         mkl  
 - mkl                       2020.3                intel_279    intel
 - mkl-service               2.3.0            py36he904b0f_0  
 - mkl_fft                   1.1.0            py36h23d657b_0  
 - mkl_random                1.1.1            py36h0573a6f_0  
 - torchvision               0.7.0                     pip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trying to implement L-BFGS optimizers #2545

🐛 Bug

To Reproduce

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trying to implement L-BFGS optimizers #2545

Description

🐛 Bug

To Reproduce

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions