-
Notifications
You must be signed in to change notification settings - Fork 856
Closed
Description
Describe the bug
Trying to do multi-GPU training of fastflow by setting the config strategy: ddp and optimizer: gpu, I get the error:
Traceback (most recent call last):
File "/home/sean/combinedpipe/run_anomalib.py", line 40, in <module>
trainer.fit(model=model, datamodule=datamodule)
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.on_advance_end()
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 250, in on_advance_end
self._run_validation()
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 308, in _run_validation
self.val_loop.run()
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 206, in run
output = self.on_run_end()
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 180, in on_run_end
self._evaluation_epoch_end(self._outputs)
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 288, in _evaluation_epoch_end
self.trainer._call_lightning_module_hook(hook_name, output_or_outputs)
File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/sean/anomalib/src/anomalib/models/components/base/anomaly_module.py", line 145, in validation_epoch_end
self._compute_adaptive_threshold(outputs)
File "/home/sean/anomalib/src/anomalib/models/components/base/anomaly_module.py", line 162, in _compute_adaptive_threshold
self.image_threshold.compute()
File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/metric.py", line 529, in wrapped_func
with self.sync_context(
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/metric.py", line 500, in sync_context
self.sync(
File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/metric.py", line 452, in sync
self._sync_dist(dist_sync_fn, process_group=process_group)
File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/metric.py", line 364, in _sync_dist
output_dict = apply_to_collection(
File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 203, in apply_to_collection
return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 203, in <dictcomp>
return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 209, in apply_to_collection
return elem_type([apply_to_collection(d, dtype, function, *args, **kwargs) for d in data])
File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 209, in <listcomp>
return elem_type([apply_to_collection(d, dtype, function, *args, **kwargs) for d in data])
File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 199, in apply_to_collection
return function(data, *args, **kwargs)
File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/utilities/distributed.py", line 131, in gather_all_tensors
torch.distributed.all_gather(local_sizes, local_size, group=group)
File "/home/sean/sean/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
File "/home/sean/sean/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2450, in all_gather
work = group.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be CUDA and dense
Getting around the RuntimeError: Tensors must be CUDA and dense error by removing all .cpu() calls in src/anomalib/models/components/base/anomaly_module.py results in image_F1Score being 0.0 during both testing and validation.
Why is AnomalyScoreThreshold incompatible multi-GPU training and how could it be modified to be compatible?
Dataset
Other (please specify in the text field below)
Model
FastFlow
Steps to reproduce the behavior
See bug description.
OS information
OS information:
- OS: Ubuntu 22.04.03
- Python version: 3.10.12
- Anomalib version: main on Github
- PyTorch version: 2.0.1
- CUDA/cuDNN version: 12.2
- GPU models and configuration: 2x NVIDIA RTX 6000 Ada
- Any other relevant information: I'm using the hazelnut toy dataset
Expected behavior
I expected to be able to do multi-GPU training using Fastflow and for the F1 score to be non-zero.
Screenshots
No response
Pip/GitHub
GitHub
What version/branch did you use?
main
Configuration YAML
See bug description.Logs
See bug description.Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
No labels