-
Notifications
You must be signed in to change notification settings - Fork 149
Open
Labels
Description
The goal of this issue is to discuss a solution in EB to allow for faster development of the PyTorch ecosystem in each toolchain. The two main motivations are:
- there are more and more packages that depend on PyTorch, and the easyconfigs of PyTorch are a bottleneck to start working on those
- building PyTorch is also becoming more and more complex (just look at the number of patches in 2.6), which reduces the number of people that is willing or can tackle the development of easyconfigs for new versions
The initial idea is to have a 2-stage distribution of PyTorch easyconfigs:
- Stage-1: easyconfigs using pre-built binaries of PyTorch from wheels
easy to update in EB: with usual PythonBundle easyblock installs
no need for extensive testing: the pre-built binaries are already tested, so we can just check that they work in the host system avoiding testing the thousand features in PyTorch (as we do with builds from source)
has to integrate with the dependencies in EB - Stage-2: easyconfigs building PyTorch from source (as usual)
the same type of installation we currently do where PyTorch is built from source
The key feature for Stage-2 is that it should be transparent to users. This means that there should be some mechanism that allows deploying the optimized PyTorch side by side to the wheel PyTorch and automagically swap the wheel module with the optimized one without reinstalling any of the software depending on that version of PyTorch.
Some preliminary info:
- The wheels from the PyTorch project already cover CPU, CUDA and ROCm; so there is no need to look elsewhere for our purpose.
- In Brussels we already have some experience in doing module swaps with PyTorch. We installed two different versions of PyTorch in the same toolchain without reinstalling any of the packages on top of it. This happened for 2023a, we started installing PyTorch v2.1.2, then the tens of packages needing PyTorch were gradually added on top of it; and months later we installed v2.3.0 and just instructed our users to manually swap the PyTorch modules. That worked very well and we have not received any issue with that approach. However, that was before EB 5 and without RPATH.