Skip to content

Implement solution for faster release of PyTorch easyconfigs #921

@lexming

Description

@lexming

The goal of this issue is to discuss a solution in EB to allow for faster development of the PyTorch ecosystem in each toolchain. The two main motivations are:

  • there are more and more packages that depend on PyTorch, and the easyconfigs of PyTorch are a bottleneck to start working on those
  • building PyTorch is also becoming more and more complex (just look at the number of patches in 2.6), which reduces the number of people that is willing or can tackle the development of easyconfigs for new versions

The initial idea is to have a 2-stage distribution of PyTorch easyconfigs:

  • Stage-1: easyconfigs using pre-built binaries of PyTorch from wheels
    easy to update in EB: with usual PythonBundle easyblock installs
    no need for extensive testing: the pre-built binaries are already tested, so we can just check that they work in the host system avoiding testing the thousand features in PyTorch (as we do with builds from source)
    has to integrate with the dependencies in EB
  • Stage-2: easyconfigs building PyTorch from source (as usual)
    the same type of installation we currently do where PyTorch is built from source

The key feature for Stage-2 is that it should be transparent to users. This means that there should be some mechanism that allows deploying the optimized PyTorch side by side to the wheel PyTorch and automagically swap the wheel module with the optimized one without reinstalling any of the software depending on that version of PyTorch.

Some preliminary info:

  • The wheels from the PyTorch project already cover CPU, CUDA and ROCm; so there is no need to look elsewhere for our purpose.
  • In Brussels we already have some experience in doing module swaps with PyTorch. We installed two different versions of PyTorch in the same toolchain without reinstalling any of the packages on top of it. This happened for 2023a, we started installing PyTorch v2.1.2, then the tens of packages needing PyTorch were gradually added on top of it; and months later we installed v2.3.0 and just instructed our users to manually swap the PyTorch modules. That worked very well and we have not received any issue with that approach. However, that was before EB 5 and without RPATH.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions