Skip to content

Conversation

@surak
Copy link
Contributor

@surak surak commented Jun 25, 2020

FZJ-JSC has a NCCL versioned 2.4.6-1. This makes the bazel question about NCCL version to fail - as the version on the nccl.h is like this:

#define NCCL_MINOR 4
#define NCCL_PATCH 6
#define NCCL_SUFFIX ""
#define NCCL_VERSION_CODE 2406```

This patch makes sure that NCCL versions with patch names (and sub-names) work.

FZJ-JSC has a NCCL versioned 2.4.6-1. This makes the bazel question about NCCL version to fail - as the version on the `nccl.h` is like this:

```#define NCCL_MAJOR 2
#define NCCL_MINOR 4
#define NCCL_PATCH 6
#define NCCL_SUFFIX ""
#define NCCL_VERSION_CODE 2406```

This patch makes sure that NCCL versions with patch names (and sub-names) work.
raise EasyBuildError("TensorFlow has a strict dependency on cuDNN if CUDA is enabled")
if nccl_root:
nccl_version = get_software_version('NCCL')
nccl_maj_min_ver = '.'.join(nccl_version.split('.')[:2])
Copy link
Member

@ocaisa ocaisa Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we are reporting 2.4 as opposed to 2.4.6 (or 2.4.6-1), are we sure this doesn't have consequences? We could be more conservative here and do

# Ignore the PKG_REVISION identifier (i.e., report 2.4.6 for 2.4.6-1 or 2.4.6-2)
nccl_version = nccl_version.split('-')[0]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's better.
I just scrapped the -x on the version when creating the NCCL ECs, didn't see a reason to keep them.

And it's not really a pre-release identifier, it's just what the OS agnostic version uses instead of "-ga"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did take a look at the releases last night in https://github.com/NVIDIA/nccl/releases and there is a 2.5.6-2 in there, so I don't think they can be completely ignored

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: I checked the TF sources on what they do with this: They search the provided nccl installation paths for a nccl.h and compare the given version against the major.minor.patch given in the header by matching it at the start. I.e. if "major.minor.patch".startswith(TF_NCCL_VERSION) --> OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what you would suggest we do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I committed directly to the branch, I meant to make a suggestion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically omitting the patch version works, but I'd stay with the more explicit approach and include it. Hence the current state with the above suggestion is fine

@boegel boegel added the bug fix label Jul 3, 2020
@boegel boegel added this to the release after 4.2.2 (4.2.3?) milestone Jul 3, 2020
@ocaisa
Copy link
Member

ocaisa commented Jul 9, 2020

@akesandgren Can you review/merge this since I've now made a commit on this PR, thanks.

Copy link
Contributor

@akesandgren akesandgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@akesandgren
Copy link
Contributor

Going in, thanks @surak!

@akesandgren akesandgren merged commit d935243 into easybuilders:develop Jul 13, 2020
@migueldiascosta migueldiascosta changed the title Tensorflow.py fails with bigger NCCL version names ignore the PKG_REVISION identifier if NCCL version if it exists, in Tensorflow easyblock Sep 11, 2020
@migueldiascosta migueldiascosta changed the title ignore the PKG_REVISION identifier if NCCL version if it exists, in Tensorflow easyblock make TensorFlow easyblock ignore the PKG_REVISION identifier if NCCL version if it exists Sep 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants