Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Jul 30, 2020

@boegel boegel added the update label Aug 17, 2020
@boegel boegel added this to the next release (4.2.3?) milestone Aug 17, 2020
@boegel boegel changed the title Add PyTorch-1.6.0-fosscuda-2019b-Python-3.7.4 {devel}[fosscuda/2019b] PyTorch v1.6.0 w/ Python 3.7.4 Aug 17, 2020
@boegel
Copy link
Member

boegel commented Aug 17, 2020

@Flamefire I see one failing test:

======================================================================
ERROR: test_set_affinity_in_worker_init (__main__.TestSetAffinity)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_dataloader.py", line 2004, in test_set_affinity_in_worker_init
    for sample in dataloader:
  File "/tmp/eb-qcz48h81/tmposm2a_uv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/tmp/eb-qcz48h81/tmposm2a_uv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
    return self._process_data(data)
  File "/tmp/eb-qcz48h81/tmposm2a_uv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
    data.reraise()
  File "/tmp/eb-qcz48h81/tmposm2a_uv/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/tmp/eb-qcz48h81/tmposm2a_uv/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 142, in _worker_loop
    init_fn(worker_id)
  File "test_dataloader.py", line 1992, in worker_set_affinity
    os.sched_setaffinity(0, [2])
OSError: [Errno 22] Invalid argument

@Flamefire
Copy link
Contributor Author

Flamefire commented Aug 17, 2020

No idea. os.sched_setaffinity(0, [2]) looks like a valid call. Or wait: Do you happen to have only 1 CPU (or less than 3) allocated? Nice PyTorch! Really nice!

Guess we should disable this test then?

@boegel
Copy link
Member

boegel commented Aug 18, 2020

No idea. os.sched_setaffinity(0, [2]) looks like a valid call. Or wait: Do you happen to have only 1 CPU (or less than 3) allocated? Nice PyTorch! Really nice!

Guess we should disable this test then?

I was tested in a Slurm job, so definitely with restricted access to cores, but I certainly had more than 1 core...
Or do you mean sockets?

@Flamefire
Copy link
Contributor Author

According to https://docs.python.org/3/library/os.html#os.sched_setaffinity it is CPUs. So no idea... Maybe because you don't have access to CPU 2?

@boegel
Copy link
Member

boegel commented Aug 18, 2020

Tried again in a Slurm job with 8 cores (out of 36 available, in a 2-socket 18-core proc system), now test_set_affinity_in_worker_init passed. But another test failed:

======================================================================
FAIL: test_lstm (quantization.test_backward_compatibility.TestSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/eb-1hbt9lal/tmpiuhh9sss/lib/python3.7/site-packages/torch/testing/_internal/common_quantized.py", line 124, in test_fn
    qfunction(*args, **kwargs)
  File "/tmp/vsc40023/easybuild_build/PyTorch/1.6.0/fosscuda-2019b-Python-3.7.4/pytorch-1.6.0/test/quantization/test_backward_compatibility.py", line 132, in test_lstm
    self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
  File "/tmp/vsc40023/easybuild_build/PyTorch/1.6.0/fosscuda-2019b-Python-3.7.4/pytorch-1.6.0/test/quantization/test_backward_compatibility.py", line 68, in _test_op
    self.assertEqual(qmodule(input_tensor), expected, atol=prec)
  File "/tmp/eb-1hbt9lal/tmpiuhh9sss/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1111, in assertEqual
    exact_dtype=exact_dtype, exact_device=exact_device)
  File "/tmp/eb-1hbt9lal/tmpiuhh9sss/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1085, in assertEqual
    self.assertTrue(result, msg=msg)
AssertionError: False is not true : Tensors failed to compare as equal! With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

----------------------------------------------------------------------
Ran 276 tests in 415.229s

FAILED (failures=1, skipped=7)

@Flamefire
Copy link
Contributor Author

-.- No idea. Another test to exclude I think. Reported it to pytorch: pytorch/pytorch#43209

@terjekv
Copy link
Collaborator

terjekv commented Aug 21, 2020

Test report by @terjekv
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in this PR)
ninhursaga.uio.no - Linux RHEL 8.2, x86_64, Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz, Python 3.6.8
See https://gist.github.com/34bfbc21291ef5224ac11406afc1ad84 for a full test report.

@Flamefire
Copy link
Contributor Author

@terjekv Same issue as #11041 (comment)

Not sure what went wrong there and hard to investigate as I'm not seeing it on any of our systems... Looks bad to me but don't know what to do about it except disabling the whole test suite this test is in

@terjekv
Copy link
Collaborator

terjekv commented Aug 21, 2020

Yeah, no idea either. I have a user request for 1.6.0 as 1.4.0 has some bug that keeps killing his code. I'll strip the test in question and see what comes out of that, and point out to the user that save/load should be verified before using it.

@terjekv
Copy link
Collaborator

terjekv commented Aug 22, 2020

Obviously, skipping the quantization tests makes the EC pass, by adding the following to excluded_tests:

        # Save/load issue with test_lstm: https://github.com/pytorch/pytorch/issues/43209
        'test_quantization',

@Flamefire
Copy link
Contributor Author

@terjekv Added that although I don't feel comfortable doing so. So at least added a dangerous sounding comment...

@Flamefire Flamefire force-pushed the PyTorch16 branch 2 times, most recently from 504e953 to c99a6d7 Compare September 7, 2020 15:13
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in this PR)
taurusml8 - Linux RHEL 7.6, POWER, 8335-GTX, Python 2.7.5
See https://gist.github.com/7f87f8cec657943cf2fb8aab7a840a93 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in this PR)
taurusml8 - Linux RHEL 7.6, POWER, 8335-GTX, Python 2.7.5
See https://gist.github.com/cab0461dfaee9afd4bcff3d23a23192a for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in this PR)
taurusml24 - Linux RHEL 7.6, POWER, 8335-GTX, Python 2.7.5
See https://gist.github.com/b2cefcd7058e8bc13b389e3931484abf for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in this PR)
taurusa7 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz, Python 2.7.5
See https://gist.github.com/f875edcd2065d4dd0e8dde317d52b378 for a full test report.

@easybuilders easybuilders deleted a comment from boegelbot Sep 8, 2020
@easybuilders easybuilders deleted a comment from boegelbot Sep 8, 2020
@easybuilders easybuilders deleted a comment from boegelbot Sep 8, 2020
@boegel
Copy link
Member

boegel commented Sep 8, 2020

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11041 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11041 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 5275

Test results coming soon (I hope)...

Details

- notification for comment with ID 689119645 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in this PR)
generoso-x-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/590acb67c77383e6839e3c896a7a5c1b for a full test report.

@boegel
Copy link
Member

boegel commented Sep 9, 2020

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in this PR)
node3415.kirlia.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 2.7.5
See https://gist.github.com/a27f46dda95db5c8a6ebef4b5c146093 for a full test report.

@boegel
Copy link
Member

boegel commented Sep 9, 2020

@boegelbot please test @ generoso

@boegel
Copy link
Member

boegel commented Sep 9, 2020

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in this PR)
node3404.kirlia.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 2.7.5
See https://gist.github.com/0cd5a2e02a4990aaf96d6e9c4227520c for a full test report.

@easybuilders easybuilders deleted a comment from boegelbot Sep 9, 2020
@easybuilders easybuilders deleted a comment from boegelbot Sep 9, 2020
@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11041 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11041 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 5535

Test results coming soon (I hope)...

Details

- notification for comment with ID 689351336 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member

boegel commented Sep 9, 2020

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in this PR)
node3300.joltik.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/82356d0a556c41096b59816707abfa2a for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in this PR)
generoso-x-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/49ccdfee7e6b8973866e84d6e431f455 for a full test report.

@boegel
Copy link
Member

boegel commented Sep 9, 2020

Going in, thanks @Flamefire!

@boegel boegel merged commit 372e0f3 into easybuilders:develop Sep 9, 2020
@Flamefire Flamefire deleted the PyTorch16 branch September 9, 2020 13:08
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in this PR)
taurusml23 - Linux RHEL 7.6, POWER, 8335-GTX, Python 2.7.5
See https://gist.github.com/f937a2145648cbf296d6e9965b5c628a for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants