Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Feb 21, 2025

(created using eb --new-pr)

Requires

Some explanations:

  • In a discussion in a PyTorch issue the only machine readable output are the test XML files which are only generated on (their) CI
  • The easyblock applies patches to allow enabling test reports by setting an EasyBuild specific variable.
    • There is an option to pass to run_test.py that is supposed to enable that but it isn't passed to subprocess and hence not reliable
    • Another bug results in not generating test reports outside CI even with the option passed
    • As the patched file gets installed we shouldn't change its (default) behavior in case users use it, hence the env variable
  • The PyTorch test suite uses Python unittest, pytest and since 2.3 a custom logic to rerun failed tests. That generates XML result files in different formats and potentially with duplicates
    • The successful reruns might be reported alongside their previous failures but in different files so "merging" is required to keep only the successful ones
    • The implemented parser collects all results and attributes them to their "test suite" (usually the Python file executed which might include or run other files)
    • Some tests are run multiple times in different configurations, i.e. the same test file is executed multiple times with an environment variable set to choose e.g. the distribution backend. Those need to be considered as separate tests
    • Afterwards all results are combined/merged. Each test from the same test suite that is found multiple times is considered as successful if at least one of the duplicates was successful
  • I used Python type hints to make it a bit easier to follow
  • In many places assumptions are verified by raising a descriptive error. This should allow to detect changes in PyTorch that affect the logic
  • The "old" (current) parsing of the stdout is still used
    • The new logic is only enabled when the PyTorch easyconfig has the required xmlrunner Python package directly or transitively. We have unittest-xml-reporting ECs for that
    • For PyTorch < 2.3 (Since 2.3 that parser isn't really useful anymore) the found results are compared and differences shown in the logfile. They should match of course, but in the end the result from the XML files is used
    • The final output of PyTorchs run_tests.py contains a list of failed test suites. We match against that as before to detect when we missed something.
    • That also detects test suites that failed to start due to e.g. syntax errors introduced by our patches. In that case no XML file is generated and we'd miss it but we should handle all those cases by fixing the issue or skipping the test
  • I considered verifying the found suites against the list of suites to run as printed by run_test.py but some of the test files are missing the code required to start the test and hence show up in that list but produce no output at all
  • The easyblock file can be run directly and accepts:
    • An easybuild log file: Parses the stdout of run_test as found in the log to test the old parser. This exists already
    • A directory: Run the new (XML) parser on a test-results folder containing the XML reports

I prepared PRs for new and old PyTorch ECs to include the dependency required for the XML reporting. Those can be used to test this PR:

All look fine. A couple show a warning (as designed):

WARNING Found differences when parsing stdout and XML files:
Different number of tests in XML files: 211197 != 211209

I traced that down to tests being rerun in 2 dimensions: With different backends (e.g NCCL and Gloo) which I accounted for by taking the dist-nccl/dist-gloo prefix into account to count the tests as different. For the 2nd dimension (Init method: file, env or none, i.e. serial) there is no such "tag" and my code merges them which makes it count as a single test. I created a PR which is merged for the next PyTorch version. Adding the patch to existing ECs is IMO not worth it. This is an issue if the test fails with one backend but succeeds in another so it looks like a successful rerun and we miss a failure. Didn't observe that though and I'd assume it fails in all backends or none.

FTR in 2.1.2 I collected:

Running distributed tests for the test backend with env init_method
  Skipped
Running distributed tests for the test backend with file init_method
  Skipped
Running distributed tests for the nccl backend with env init_method
  6 tests:
    test/distributed/algorithms/quantization/test_quantization.py::DistQuantizationTests::test_all_gather_bfp16
    test/distributed/algorithms/quantization/test_quantization.py::DistQuantizationTests::test_all_gather_fp16
    test/distributed/algorithms/quantization/test_quantization.py::DistQuantizationTests::test_all_to_all_bfp16
    test/distributed/algorithms/quantization/test_quantization.py::DistQuantizationTests::test_all_to_all_fp16
    test/distributed/algorithms/quantization/test_quantization.py::DistQuantizationTests::test_all_to_all_single_bfp16
    test/distributed/algorithms/quantization/test_quantization.py::DistQuantizationTests::test_all_to_all_single_fp16
Running distributed tests for the nccl backend with file init_method
   6 test, same as above
Running distributed tests for the gloo backend with env init_method
    same
Running distributed tests for the gloo backend with file init_method
   same

So for 2 backends 2 duplicate init methods get merged which is 2*6=12 "missed" tests. Those appear as "=== 1 passed in 4.45s ====" in the log, so the stdout-parsing logic collects them. --> All fine

For one such test the XML files look like this which shows that there is no way to reasonable deduplicate them:

$ xmllint --format test-reports/dist-gloo/distributed.algorithms.quantization.test_quantization/distributed.algorithms.quantization.test_quantization-c089f7b2dd0b9af4.xml
<?xml version="1.0"?>
<testsuites>
  <testsuite name="pytest" errors="0" failures="0" skipped="0" tests="1" time="4.467" timestamp="2025-03-06T11:59:56.386921" hostname="i8021">
    <testcase classname="DistQuantizationTests" name="test_all_gather_bfp16" time="4.412" file="distributed/algorithms/quantization/test_quantization.py"/>
  </testsuite>
</testsuites>
$ xmllint --format test-reports/dist-gloo/distributed.algorithms.quantization.test_quantization/distributed.algorithms.quantization.test_quantization-ef746dd9698fd528.xml
<?xml version="1.0"?>
<testsuites>
  <testsuite name="pytest" errors="0" failures="0" skipped="0" tests="1" time="4.649" timestamp="2025-03-06T12:00:52.179397" hostname="i8021">
    <testcase classname="DistQuantizationTests" name="test_all_gather_bfp16" time="4.612" file="distributed/algorithms/quantization/test_quantization.py"/>
  </testsuite>
</testsuites>

They could be added to the "rerun" statistic though.

@akesandgren
Copy link
Contributor

@Flamefire Do you expect to have more to add to this one?

@Flamefire
Copy link
Contributor Author

Likely not. I'm still checking a couple GPU builds but the CPU builds have been verified so I don't expect anything new

@akesandgren
Copy link
Contributor

Ok, I'll review and merge later today then.


self.has_xml_test_reports = False
if pytorch_version >= '1.10.0':
out, ec = run_cmd(self.python_cmd + " -c 'import xmlrunner'", log_ok=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why not just check that unittest-xml-reporting is a builddep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I wanted to anticipate that we might have it in Python or Python-bundle-PyPI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, make sense, but that's likely only ever going to be a builddep and thus doesn't belong in either python or python-bundle-* in my opinion.

Anyway if needed it can be changed later :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine others need that too, even users and we provide e.g. pytest too. and the one additional call doesn't hurt in the testing of PyTorch ;-)

if new_suites:
diffs.append(f'Found {len(new_suites)} new suites in XML files: {", ".join(sorted(new_suites))}')
if missing_suites:
diffs.append(f'Did not found {len(missing_suites)} suites in XML files: ' +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "Did not find" or "Did find"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just a typo. "not find". Fixed

if new_tests:
diffs.append(f'Found {len(new_tests)} new tests with errors in XML files: {", ".join(sorted(new_tests))}')
if missing_tests:
diffs.append(f'Did not found {len(missing_tests)} tests with errors in XML files: ' +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, "Did not find" or something

if new_tests:
diffs.append(f'Found {len(new_tests)} new failed tests in XML files: {", ".join(sorted(new_tests))}')
if missing_tests:
diffs.append(f'Did not found {len(missing_tests)} failed tests in XML files: ' +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here

Copy link
Contributor

@akesandgren akesandgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@akesandgren akesandgren added this to the release after 4.9.4 milestone Mar 10, 2025
@akesandgren
Copy link
Contributor

Going in, thanks @Flamefire!

@akesandgren akesandgren merged commit 0d1bf10 into easybuilders:develop Mar 10, 2025
41 checks passed
@Flamefire Flamefire deleted the 20250221130009_new_pr_pytorch branch March 10, 2025 15:40
@boegel boegel modified the milestones: release after 4.9.4, 5.0.0 Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants