Replace uses of custom all_gather with PyTorch's all_gather_object #3804

fmassa · 2021-05-10T11:27:45Z

🚀 Feature

In the references/detection in torchvision, we have a primitive called all_gather that allows to distribute arbitrary picklable objects among different processes, see

vision/references/detection/utils.py

Lines 75 to 115 in d6fee5a

    
           def all_gather(data): 
        
               """ 
        
               Run all_gather on arbitrary picklable data (not necessarily tensors) 
        
               Args: 
        
                   data: any picklable object 
        
               Returns: 
        
                   list[data]: list of data gathered from each rank 
        
               """ 
        
               world_size = get_world_size() 
        
               if world_size == 1: 
        
                   return [data] 
        
               # serialized to a Tensor 
        
               buffer = pickle.dumps(data) 
        
               storage = torch.ByteStorage.from_buffer(buffer) 
        
               tensor = torch.ByteTensor(storage).to("cuda") 
        
               # obtain Tensor size of each rank 
        
               local_size = torch.tensor([tensor.numel()], device="cuda") 
        
               size_list = [torch.tensor([0], device="cuda") for _ in range(world_size)] 
        
               dist.all_gather(size_list, local_size) 
        
               size_list = [int(size.item()) for size in size_list] 
        
               max_size = max(size_list) 
        
               # receiving Tensor from all ranks 
        
               # we pad the tensor because torch all_gather does not support 
        
               # gathering tensors of different shapes 
        
               tensor_list = [] 
        
               for _ in size_list: 
        
                   tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda")) 
        
               if local_size != max_size: 
        
                   padding = torch.empty(size=(max_size - local_size,), dtype=torch.uint8, device="cuda") 
        
                   tensor = torch.cat((tensor, padding), dim=0) 
        
               dist.all_gather(tensor_list, tensor) 
        
               data_list = [] 
        
               for size, tensor in zip(size_list, tensor_list): 
        
                   buffer = tensor.cpu().numpy().tobytes()[:size] 
        
                   data_list.append(pickle.loads(buffer)) 
        
               return data_list

Since pytorch/pytorch#42189, PyTorch now natively supports this primitive, so we should replace its usage in the references with PyTorch's equivalent, verifying that we obtain the same results as before.

One can find the implementation of PyTorch's all_gather_object in here, which will be useful to validate if both functions are indeed implementing the same thing

The text was updated successfully, but these errors were encountered:

fmassa added enhancement module: reference scripts good first issue labels May 10, 2021

fmassa assigned prabhat00155 May 10, 2021

prabhat00155 mentioned this issue May 18, 2021

Updated all_gather() to make use of all_gather_object() #3857

Merged

prabhat00155 closed this as completed in #3857 May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace uses of custom all_gather with PyTorch's all_gather_object #3804

Replace uses of custom all_gather with PyTorch's all_gather_object #3804

fmassa commented May 10, 2021

Replace uses of custom all_gather with PyTorch's all_gather_object #3804

Replace uses of custom all_gather with PyTorch's all_gather_object #3804

Comments

fmassa commented May 10, 2021

🚀 Feature