You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If BatchSizeFinder is used with trainer.validate or trainer.test it would leave model in the train state, therefore enabling Dropout and other model randomness. That would influence model predictions and would result in unreliable validation metrics produced.
Step to reproduce:
Add BatchSizeFinder to a trainer
Run validation (without fit)
See how model output changes from one without BatchSizeFinder (can also be influenced by changing random state)
While this doesn't matter for trainer.fit (or LightningCLI fit arg for that matter) it will create undesired randomness when user want to reevaluate a model they already trained
What version are you seeing the problem on?
v1.8, v2.0
How to reproduce the bug
Minimal example with 3 validation outputs:
Without BatchSizeFinder
With default BatchSizeFinder
With BatchSizeFinder that calls to trainer.model.eval()
Note that something like Dropout need to be present in the model to replicate this behavior.
Without BatchSizeFinder:
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Validate metric DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
valid_loss 15.337783813476562
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
With BatchSizeFinder:
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Validate metric DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
valid_loss 26.495624542236328
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
With BatchSizeFinder that calls to `trainer.model.eval()`:
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Validate metric DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
valid_loss 15.337783813476562
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
As you can see the loss with BatchSizeFinder differs from other two options
Difference in loss can be fixed either by calling trainer.model.eval() OR by removing randomness from the model (Dropout in that case)
Bug description
If
BatchSizeFinder
is used withtrainer.validate
ortrainer.test
it would leave model in the train state, therefore enabling Dropout and other model randomness. That would influence model predictions and would result in unreliable validation metrics produced.Step to reproduce:
BatchSizeFinder
to a trainerBatchSizeFinder
(can also be influenced by changing random state)While this doesn't matter for
trainer.fit
(or LightningCLIfit
arg for that matter) it will create undesired randomness when user want to reevaluate a model they already trainedWhat version are you seeing the problem on?
v1.8, v2.0
How to reproduce the bug
Minimal example with 3 validation outputs:
BatchSizeFinder
BatchSizeFinder
BatchSizeFinder
that calls totrainer.model.eval()
Note that something like
Dropout
need to be present in the model to replicate this behavior.Error messages and logs
As you can see the loss with
BatchSizeFinder
differs from other two optionsDifference in loss can be fixed either by calling
trainer.model.eval()
OR by removing randomness from the model (Dropout in that case)Environment
Current environment
- GPU:
- NVIDIA GeForce RTX 3050 Laptop GPU
- available: True
- version: 12.1
- lightning: 2.1.0
- lightning-utilities: 0.9.0
- pytorch-lightning: 2.1.0
- torch: 2.1.0
- torchmetrics: 1.2.0
- aiohttp: 3.8.6
- aiosignal: 1.3.1
- async-timeout: 4.0.3
- attrs: 23.1.0
- certifi: 2023.7.22
- charset-normalizer: 3.3.0
- filelock: 3.12.4
- frozenlist: 1.4.0
- fsspec: 2023.9.2
- idna: 3.4
- jinja2: 3.1.2
- lightning: 2.1.0
- lightning-utilities: 0.9.0
- markupsafe: 2.1.3
- mpmath: 1.3.0
- multidict: 6.0.4
- networkx: 3.1
- numpy: 1.24.4
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 8.9.2.26
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-nccl-cu12: 2.18.1
- nvidia-nvjitlink-cu12: 12.2.140
- nvidia-nvtx-cu12: 12.1.105
- packaging: 23.2
- pip: 23.2.1
- pytorch-lightning: 2.1.0
- pyyaml: 6.0.1
- requests: 2.31.0
- setuptools: 68.1.2
- sympy: 1.12
- torch: 2.1.0
- torchmetrics: 1.2.0
- tqdm: 4.66.1
- triton: 2.1.0
- typing-extensions: 4.8.0
- urllib3: 2.0.7
- wheel: 0.41.2
- yarl: 1.9.2
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.18
- release: 5.15.0-83-generic
- version: Update trainer.py #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023
More info
No response
The text was updated successfully, but these errors were encountered: