-
Notifications
You must be signed in to change notification settings - Fork 308
make TensorFlow easyblock ignore the PKG_REVISION identifier if NCCL version if it exists #2085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
FZJ-JSC has a NCCL versioned 2.4.6-1. This makes the bazel question about NCCL version to fail - as the version on the `nccl.h` is like this: ```#define NCCL_MAJOR 2 #define NCCL_MINOR 4 #define NCCL_PATCH 6 #define NCCL_SUFFIX "" #define NCCL_VERSION_CODE 2406``` This patch makes sure that NCCL versions with patch names (and sub-names) work.
easybuild/easyblocks/t/tensorflow.py
Outdated
| raise EasyBuildError("TensorFlow has a strict dependency on cuDNN if CUDA is enabled") | ||
| if nccl_root: | ||
| nccl_version = get_software_version('NCCL') | ||
| nccl_maj_min_ver = '.'.join(nccl_version.split('.')[:2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we are reporting 2.4 as opposed to 2.4.6 (or 2.4.6-1), are we sure this doesn't have consequences? We could be more conservative here and do
# Ignore the PKG_REVISION identifier (i.e., report 2.4.6 for 2.4.6-1 or 2.4.6-2)
nccl_version = nccl_version.split('-')[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's better.
I just scrapped the -x on the version when creating the NCCL ECs, didn't see a reason to keep them.
And it's not really a pre-release identifier, it's just what the OS agnostic version uses instead of "-ga"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did take a look at the releases last night in https://github.com/NVIDIA/nccl/releases and there is a 2.5.6-2 in there, so I don't think they can be completely ignored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: I checked the TF sources on what they do with this: They search the provided nccl installation paths for a nccl.h and compare the given version against the major.minor.patch given in the header by matching it at the start. I.e. if "major.minor.patch".startswith(TF_NCCL_VERSION) --> OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what you would suggest we do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I committed directly to the branch, I meant to make a suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically omitting the patch version works, but I'd stay with the more explicit approach and include it. Hence the current state with the above suggestion is fine
|
@akesandgren Can you review/merge this since I've now made a commit on this PR, thanks. |
akesandgren
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Going in, thanks @surak! |
FZJ-JSC has a NCCL versioned 2.4.6-1. This makes the bazel question about NCCL version to fail - as the version on the
nccl.his like this: