Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Jun 10, 2025

(created using eb --new-pr)

A few variables were renamed or removed. Most importantly are the LOCAL_* variables to find the installed CUDA.

I also upgraded the warning of unknown variables "passed" to configure to a more visible one as that was easy to miss

Requires rebuild of CUDA with

Test report: easybuilders/easybuild-easyconfigs#22921 (comment)

@Flamefire Flamefire marked this pull request as draft June 11, 2025 05:55
@Flamefire
Copy link
Contributor Author

Flamefire commented Jun 17, 2025

As for building TensorFlow 2.18+ with our CUDA: They don't support that (anymore) officially and strongly suggest to use the "hermetic" one, i.e. let Bazel download it during build.

They argue that the build already "downloads half the internet" so one more doesn't hurt and they use checksums too for verification.

Would that be acceptable for us or shall we still pursue using our CUDA? See easybuilders/easybuild-easyconfigs#22921 (comment)

Edit:

Solution implemented in #3791 : Symlink CUPTI files in CUDA module so they will be found

@lexming
Copy link
Contributor

lexming commented Jul 28, 2025

Test report by @lexming

Overview of tested easyconfigs (in order)

Build succeeded for 0 out of 1 (1 easyconfigs in total)
node251.hydra.os - Linux Rocky Linux 9.5 (Blue Onyx), x86_64, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 1 x NVIDIA Tesla P100-PCIE-16GB, 570.158.01, Python 3.9.21
See https://gist.github.com/lexming/418cf5d65687c372f22903112ca8fd9b for a full test report.

@Flamefire
Copy link
Contributor Author

@lexming Can you search the log for the error? With easybuilders/easybuild-framework#4942 the test report would likely contain it so maybe we can that in soon

@lexming
Copy link
Contributor

lexming commented Jul 28, 2025

Test report by @lexming

Overview of tested easyconfigs (in order)

Build succeeded for 0 out of 1 (1 easyconfigs in total)
node252.hydra.os - Linux Rocky Linux 9.5 (Blue Onyx), x86_64, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 1 x NVIDIA Tesla P100-PCIE-16GB, 570.158.01, Python 3.9.21
See https://gist.github.com/lexming/e2a524b2012bf1810c869cabab5fc4bd for a full test report.

@lexming
Copy link
Contributor

lexming commented Jul 29, 2025

@Flamefire my tests on 2022a and 2023a failed due to linking issues with OpenSSL in the system:

/apps/brussel/RL8/broadwell/software/binutils/2.38-GCCcore-11.3.0/bin/ld: bazel-out/k8-opt/bin/_solib_local/_U_S_Stensorflow_Scc_Cops_Suser_Uops_Ugen_Ucc___Utensorflow/libtensorflow_framework.so.2: undefined reference to `EVP_DigestSignUpdate'

This error is not caused by this PR though. The problem is that these old toolchains use OpenSSL v1.1, while my system (Rocky 9) has OpenSSL v3 and just some compat libs with OpenSSL v1.1. This means that the headers for OpenSSL under /usr/include are all for v3 and they include this EVP_DigestSignUpdate that was not present in v1.1.

So, rebuilding OpenSSL v1.1 from source and testing again...

@lexming
Copy link
Contributor

lexming commented Jul 29, 2025

Test report by @lexming

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.11.0-foss-2022a-CUDA-11.7.0.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node405.hydra.os - Linux Rocky Linux 9.5 (Blue Onyx), x86_64, AMD EPYC 7282 16-Core Processor, 1 x NVIDIA NVIDIA A100-PCIE-40GB, 570.158.01, Python 3.9.21
See https://gist.github.com/lexming/0fddce97078a3bb10245118cc55d2893 for a full test report.

@lexming
Copy link
Contributor

lexming commented Jul 29, 2025

Test report by @lexming

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.15.1-foss-2023a-CUDA-12.1.1.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node405.hydra.os - Linux Rocky Linux 9.5 (Blue Onyx), x86_64, AMD EPYC 7282 16-Core Processor, 1 x NVIDIA NVIDIA A100-PCIE-40GB, 570.158.01, Python 3.9.21
See https://gist.github.com/lexming/3e2c31366769a98333314abe782eacad for a full test report.

@lexming
Copy link
Contributor

lexming commented Jul 29, 2025

Test report by @lexming

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.11.0-foss-2022a-CUDA-11.7.0.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node250.hydra.os - Linux Rocky Linux 9.5 (Blue Onyx), x86_64, Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 1 x NVIDIA Tesla P100-PCIE-16GB, 570.158.01, Python 3.9.21
See https://gist.github.com/lexming/72a82ecfdd972abb681c0d01035cb3e5 for a full test report.

Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lexming
Copy link
Contributor

lexming commented Jul 29, 2025

Merging, thanks @Flamefire !

@lexming lexming merged commit ff6c83f into easybuilders:develop Jul 29, 2025
17 checks passed
@Flamefire Flamefire deleted the 20250610174336_new_pr_tensorflow branch July 30, 2025 07:11
@boegel boegel changed the title Update tensorflow easyblock for CUDA support in TensorFlow 2.18+ update TensorFlow easyblock for CUDA support in TensorFlow 2.18+ Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants