Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions easybuild/easyconfigs/p/PyTorch/PyTorch-1.8.0-foss-2020b.eb
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
name = 'PyTorch'
version = '1.8.0'

homepage = 'https://pytorch.org/'
description = """Tensors and Dynamic neural networks in Python with strong GPU acceleration.
PyTorch is a deep learning framework that puts Python first."""

toolchain = {'name': 'foss', 'version': '2020b'}

sources = [{
'filename': '%(name)s-%(version)s.tar.gz',
'git_config': {
'url': 'https://github.com/pytorch',
'repo_name': 'pytorch',
'tag': 'v%(version)s',
'recursive': True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@branfosj Any concerns here w.r.t. reproducibility? Or are the submodules "locked" to a particular commit anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are all locked to specific commits - see https://github.com/pytorch/pytorch/tree/v1.8.0/third_party and subdirectories. The only issue we'd have is if PyTorch reused the tag - then we'd get a different download (with, potentially, a different set of items in the third_party directory).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, a downside of this I see is that --fetch likely doesn't work, i.e. a full offline install fails, or does EB handle that?
Also no checksums...
BTW: There is a script in framework to create the sources list out of a valid git checkout (must have git submodule update done)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--fetch works (so long as you do not hit easybuilders/easybuild-framework#3619).

},
}]
patches = [
'PyTorch-1.6.0_fix-test-dataloader-fixed-affinity.patch',
'PyTorch-1.7.0_avoid-nan-in-test-torch.patch',
'PyTorch-1.7.0_increase-distributed-test-timeout.patch',
'PyTorch-1.7.0_disable-dev-shm-test.patch',
]
checksums = [
None, # can't add proper SHA256 checksum, because source tarball is created locally after recursive 'git clone'
# PyTorch-1.6.0_fix-test-dataloader-fixed-affinity.patch
'a4208a46cd2098744daaba96cebb96cd91166f8fc616924315e05974bad80c67',
'b899aa94d9e60f11ee75a706563312ccefa9cf432756c470caa8e623991c8f18', # PyTorch-1.7.0_avoid-nan-in-test-torch.patch
# PyTorch-1.7.0_increase-distributed-test-timeout.patch
'95abb468a35451fbd0f864ca843f6ad15ff8bfb909c3fd580f65859b26c9691c',
'622cb1eaeadc06e13128a862d9946bcc1f1edd3d02b259c56a9aecc4d5406b8a', # PyTorch-1.7.0_disable-dev-shm-test.patch
]

osdependencies = [OS_PKG_IBVERBS_DEV]

builddependencies = [
('CMake', '3.18.4'),
('hypothesis', '5.41.5'),
]

dependencies = [
('Ninja', '1.10.1'), # Required for JIT compilation of C++ extensions
('Python', '3.8.6'),
('protobuf', '3.14.0'),
('protobuf-python', '3.14.0'),
('pybind11', '2.6.0'),
('SciPy-bundle', '2020.11'),
('typing-extensions', '3.7.4.3'),
('PyYAML', '5.3.1'),
('MPFR', '4.1.0'),
('GMP', '6.2.0'),
('numactl', '2.0.13'),
('FFmpeg', '4.3.1'),
('Pillow', '8.0.1'),
]

excluded_tests = {
'': [
# Test from this suite timeout often. The process group backend is deprecated anyway
'distributed/rpc/test_process_group_agent',
# Potentially problematic save/load issue with test_lstm on only some machines. Tell users to verify save&load!
# https://github.com/pytorch/pytorch/issues/43209
'test_quantization',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@branfosj DId you check whether we still see failures?

I can test on our Cascade Lake system where I saw issues with this earlier (cfr. pytorch/pytorch#43209)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not yet checked that. I'll run a test on our Cascade Lake where we run that test - though I do not know if we saw the issue you saw or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the same failure with PyTorch 1.7.1 and 1.8.0 on our Cascade Lake.

======================================================================
FAIL: test_lstm (quantization.test_backward_compatibility.TestSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_quantized.py", line 151, in test_fn
    qfunction(*args, **kwargs)
  File "/rds/projects/2017/branfosj-rse/ProblemSolving/pyt18/pytorch/test/quantization/test_backward_compatibility.py", line 230, in test_lstm
    self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
  File "/rds/projects/2017/branfosj-rse/ProblemSolving/pyt18/pytorch/test/quantization/test_backward_compatibility.py", line 76, in _test_op
    self.assertEqual(qmodule(input_tensor), expected, atol=prec)
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1198, in assertEqual
    self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg,
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1165, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test failure still occurs when I build with MKL.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boegel @branfosj

Were you using a full-metal Cascade Lake machine, or were you using a VM on it?
With a Linux VM (with KVM hypervisor), I reproduced this issue on a Cascade Lake machine.
However, if you were using a full-metal machine, then isn't there a possibility that there might be some latent issues with Cascade Lake machines that might surface later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that there are issues when optimizing for a cascade lake machine, e.g. tensorflow/tensorflow#47179
I've seen that with 2019b, not with newer compilers, but it is possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the info! I had used gcc/g++ 9.3, but that TensorFlow issue you posted also seems quite relevant. I can try testing with a more recent version of gcc, although gcc 9.3 was released in March 2020.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: 2019b uses GCC 8.3.0, 2020a (IIRC) 9.3.0 (which solves the TF issue for us) but as it is a toolchain generation it might also be related to dependencies being updated, so maybe not only the compiler, but that is the best bet as it looks like a misoptimization.

]
}

runtest = 'cd test && PYTHONUNBUFFERED=1 %(python)s run_test.py --verbose %(excluded_tests)s'

sanity_check_commands = ["python -c 'import caffe2.python'"]
tests = ['PyTorch-check-cpp-extension.py']

moduleclass = 'devel'