[lit] Echo full RUN lines in case of external shells #65267

jdenny-ornl · 2023-09-04T15:06:03Z

Before https://reviews.llvm.org/D154984 and https://reviews.llvm.org/D156954, lit reported full RUN lines in a Script: section. Now, in the case of lit's internal shell, it's the execution trace that includes them. However, if lit is configured to use an external shell (e.g., bash, windows cmd), they aren't reported at all.

A fix was requested at the following:

This patch does not correctly address the case when the external shell is windows cmd. As discussed at #65242, it's not clear whether that's a use case that people still care about, and it seems to be generally broken anyway.

RoboTux · 2023-09-04T15:57:37Z

llvm/utils/lit/lit/TestRunner.py

+                commands[j] += f": {shlex.quote(command.lstrip())} >&2 " \
+                               f"&& {command}"
+            else:
+                commands[j] += " has no command after substitutions"


Could you add a testcase for this codepath? Thanks!

Good idea, especially given that I forgot the stderr redirection in that case. I pushed a commit.

banach-space

LGTM, thank you!

mikaelholmen · 2023-09-05T06:03:15Z

I like this. Thank you!

RoboTux

LGTM too.

Before <https://reviews.llvm.org/D154984> and <https://reviews.llvm.org/D156954>, lit reported full RUN lines in a `Script:` section. Now, in the case of lit's internal shell, it's the execution trace that includes them. However, if lit is configured to use an external shell (e.g., bash, windows `cmd`), they aren't reported at all. A fix was requested at the following: * <https://reviews.llvm.org/D154984#4627605> * <https://discourse.llvm.org/t/rfc-improving-lits-debug-output/72839/35?u=jdenny-ornl> This patch does not correctly address the case when the external shell is windows `cmd`. As discussed at <llvm#65242>, it's not clear whether that's a use case that people still care about, and it seems to be generally broken anyway.

jdenny-ornl · 2023-09-05T14:48:34Z

Thanks for the reviews.

jsji · 2023-09-05T21:14:00Z

Looks like this PR might cause unexpected failure in testing, and kill buildbot worker (also python code).
Examples can be found:
https://lab.llvm.org/buildbot/#/builders/184/builds/6236
https://lab.llvm.org/buildbot/#/builders/197/builds/9358
https://lab.llvm.org/buildbot/#/builders/75/builds/35421
https://lab.llvm.org/buildbot/#/builders/240/builds/14507

Buildbots failed after this landed, as reported at: <#65267 (comment)> This reverts commit 9191ba7.

jdenny-ornl · 2023-09-05T21:26:25Z

I pushed a revert (efec733). I cannot tell from the logs what happened.

jdenny-ornl · 2023-09-05T21:56:27Z

@jsji Thanks for alerting me to the bot fails. The revert seems to have fixed at list one bot. Did you see any additional clue in the logs about what happened? All I see is:

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]

I don't know how to debug that.

jsji · 2023-09-05T22:30:21Z

Sorry, no idea about why either.

dyung · 2023-09-06T08:24:16Z

It's probably not much help, but this affected one of my build bots, and when I ran check-all manually on the machine, it died with an error "User defined signal 2":

PASS: libFuzzer-x86_64-default-Linux :: recommended-dictionary.test (72990 of 75935)
PASS: libFuzzer-x86_64-default-Linux :: cleanse.test (72991 of 75935)                                                                                                                                                                             
User defined signal 2

petrhosek · 2023-09-06T16:52:26Z

We saw the same issue on our bots as well, see for example:

https://luci-milo.appspot.com/ui/p/fuchsia/builders/toolchain.ci/clang-linux-x64/b8770760294040862593/overview

We did some investigation but haven't found the culprit either, but we saw trap faults in the kernel log.

jdenny-ornl · 2023-09-06T16:57:51Z

Thanks for the info, and sorry for all the trouble. I am now able to reproduce on a local system: ninja check-fuzzer kills my terminal window [edit: well, the shell running in it] with signal 12. I'll keep poking at it as I have time. If you see any additional clues, please post them here.

uweigand · 2023-09-06T20:17:30Z

I'm running into this issue on the s390x builder as well. It seems the problem happens in the ./compiler-rt/test/fuzzer/fork-sigusr.test case here:

RUN: %run %t/ForkSIGUSR -fork=3 -ignore_crashes=1 2>%t/log & export PID=$!
RUN: sleep 3
RUN: kill -SIGUSR2 $PID
RUN: sleep 6
RUN: cat %t/log | FileCheck %s --dump-input=fail

Normally, the PID variable is set to the pid of the ForkSIGUSR process, which then gets killed with SIGUSR2 and terminates cleanly.

However, with this patch applied, the PID variable get set to the pid of a shell process executing

  echo "..." && ../ForkSIGUSR ...

The shell eats the SIGUSR2 and prints a message, but the ForkSIGUSR process never gets the signal and keeps running indefinitely.

This somehow causes the buildbot connection to terminate - not sure exactly why but possibly some bad reaction to that shell getting the signal?

Unfortunately whenever I re-start the build bot, it thinks it need to re-try the previous build since that had the failed connection - but that previous build still has this patch applied and therefore terminates again :-( I'm trying to work around this now.

jdenny-ornl · 2023-09-06T21:42:15Z

@uweigand Thanks! That was the clue I needed.

I found there are four tests that have a similar pattern and cause trouble on my local system:

fork-sigusr.test: Seemed to hang, but I never saw it kill the shell that was running check-fuzzer. That appears to be a different result than @uweigand saw, so maybe there's something non-deterministic, or maybe our shells are different.
merge-sigusr.test: Killed the shell.
sigint.test: Seemed to hang.
sigusr.test: Killed the shell.

If I disable those tests, check-fuzzer behaves as at the parent commit.

I have a simple lit fix I've pushed to this branch. It also makes check-fuzzer behave as at the parent commit.

I'm hesitant to land the new version as last time was a bit of a disaster. Based on @uweigand's comments, some bots are still recovering. Sorry! Who knows what other trouble might be lurking on the next attempt.

Any advice on how to proceed?

jdenny-ornl · 2023-09-06T21:46:07Z

Here's the fix: jdenny-ornl@61e272e

Unfortunately, I made the mistake of pushing the revert directly to git (old workflow) instead of clicking the revert button here in the PR. I'm not sure how to reopen the PR now. I can start a new PR if that's desirable.

dyung · 2023-09-07T03:34:51Z

Unfortunately whenever I re-start the build bot, it thinks it need to re-try the previous build since that had the failed connection - but that previous build still has this patch applied and therefore terminates again :-( I'm trying to work around this now.

@uweigand if you are still trying to recover your buildbot, this is the process I used which I was able to recover mine. Thanks to the info everyone found here, it seems there are 4 affected tests fork-sigusr.test, merge-sigusr.test, sigint.test and sigusr.test. With the worker offline, I manually edited the files to add a , * to the UNSUPPORTED line to get the run to skip the tests so it wouldn't fail. I then stashed the change and brought the worker back online. When the job rerun had finished the sync and was doing the build, I went to the local git repo and unstashed the test changes and then let the build/test complete. Then before the next run is executed, just revert the local test changes. The tests were not exactly skipped (they ended up being "UNRESOLVED" because of the new syntax required for the UNSUPPORTED lines), but the net result was the run completed and moved on to the next job which then passed. Hope this helps you to recover your bot.

"Completed" job: https://lab.llvm.org/buildbot/#/builders/247/builds/8680

uweigand · 2023-09-07T14:24:05Z

Thanks for the tips. I've managed to get the bot going again by clearing the build queues in the web interface, so it would reset to current mainline. All s390x bots are now green again.

This reverts commit 19b44c2. The reason for the revert is discussed at: https://discourse.llvm.org/t/rfc-improving-lits-debug-output/72839/52

Before <https://reviews.llvm.org/D154984> and <https://reviews.llvm.org/D156954>, lit reported full RUN lines in a `Script:` section. Now, in the case of lit's internal shell, it's the execution trace that includes them. However, if lit is configured to use an external shell (e.g., bash, windows `cmd`), they aren't reported at all. A fix was requested at the following: * <https://reviews.llvm.org/D154984#4627605> * <https://discourse.llvm.org/t/rfc-improving-lits-debug-output/72839/35?u=jdenny-ornl> This patch does not correctly address the case when the external shell is windows `cmd`. As discussed at <llvm#65242>, it's not clear whether that's a use case that people still care about, and it seems to be generally broken anyway.

Buildbots failed after this landed, as reported at: <llvm#65267 (comment)> This reverts commit 9191ba7.

)" This reverts commit 19b44c2. The reason for the revert is discussed at: https://discourse.llvm.org/t/rfc-improving-lits-debug-output/72839/52

jdenny-ornl · 2023-09-14T16:58:15Z

I'm trying to figure out how to resurrect this PR after rebasing onto main and applying fixes. It will need additional review. Is it best to just to create a new PR, or is there some way to reopen this one?

jdenny-ornl · 2023-09-14T17:42:47Z

Based on the information I found, it's not possible to reopen this PR, so I created a new one: #66408

Buildbots failed after this landed, as reported at: <llvm/llvm-project#65267 (comment)> This reverts commit 9191ba7.

Buildbots failed after this landed, as reported at: <llvm/llvm-project#65267 (comment)> This reverts commit 9191ba7144b39f5af699993d66f3587d5da49759.

jdenny-ornl requested review from rnk, mstorsjo, MaskRay, yln, ldionne, banach-space, asavonic, AaronBallman, Endilll, bjope, jh7370 and mikaelholmen September 4, 2023 15:06

jdenny-ornl requested a review from a team as a code owner September 4, 2023 15:06

philnik777 added the tools:llvm-lit label Sep 4, 2023

RoboTux requested changes Sep 4, 2023

View reviewed changes

banach-space approved these changes Sep 4, 2023

View reviewed changes

RoboTux approved these changes Sep 5, 2023

View reviewed changes

jdenny-ornl force-pushed the lit-external-shell-echo-run-lines branch from 0054ad4 to 21444d6 Compare September 5, 2023 14:43

github-actions bot added the llvm-lit label Sep 5, 2023

jdenny-ornl force-pushed the lit-external-shell-echo-run-lines branch from 21444d6 to 9191ba7 Compare September 5, 2023 14:48

jdenny-ornl merged commit 19b44c2 into llvm:main Sep 5, 2023

jdenny-ornl mentioned this pull request Sep 5, 2023

[lit] Fix some issues from --per-test-coverage #65242

Merged

jdenny-ornl added a commit that referenced this pull request Sep 5, 2023

Revert "[lit] Echo full RUN lines in case of external shells"

efec733

Buildbots failed after this landed, as reported at: <#65267 (comment)> This reverts commit 9191ba7.

avillega pushed a commit to avillega/llvm-project that referenced this pull request Sep 11, 2023

Revert "[lit] Echo full RUN lines in case of external shells"

0e8c6cc

Buildbots failed after this landed, as reported at: <llvm#65267 (comment)> This reverts commit 9191ba7.

jdenny-ornl mentioned this pull request Sep 14, 2023

[lit] Echo full RUN lines in case of external shells #66408

Merged

qihangkong pushed a commit to rvgpu/llvm that referenced this pull request Apr 18, 2024

Revert "[lit] Echo full RUN lines in case of external shells"

20484d6

Buildbots failed after this landed, as reported at: <llvm/llvm-project#65267 (comment)> This reverts commit 9191ba7.

[lit] Echo full RUN lines in case of external shells #65267

[lit] Echo full RUN lines in case of external shells #65267

Uh oh!

Conversation

jdenny-ornl commented Sep 4, 2023

Uh oh!

RoboTux Sep 4, 2023

Choose a reason for hiding this comment

Uh oh!

jdenny-ornl Sep 4, 2023

Choose a reason for hiding this comment

Uh oh!

banach-space left a comment

Choose a reason for hiding this comment

Uh oh!

mikaelholmen commented Sep 5, 2023

Uh oh!

RoboTux left a comment

Choose a reason for hiding this comment

Uh oh!

jdenny-ornl commented Sep 5, 2023

Uh oh!

jsji commented Sep 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jdenny-ornl commented Sep 5, 2023

Uh oh!

jdenny-ornl commented Sep 5, 2023

Uh oh!

jsji commented Sep 5, 2023

Uh oh!

dyung commented Sep 6, 2023

Uh oh!

petrhosek commented Sep 6, 2023

Uh oh!

jdenny-ornl commented Sep 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uweigand commented Sep 6, 2023

Uh oh!

jdenny-ornl commented Sep 6, 2023

Uh oh!

jdenny-ornl commented Sep 6, 2023

Uh oh!

dyung commented Sep 7, 2023

Uh oh!

uweigand commented Sep 7, 2023

Uh oh!

jdenny-ornl commented Sep 14, 2023

Uh oh!

jdenny-ornl commented Sep 14, 2023

Uh oh!

Uh oh!

jsji commented Sep 5, 2023 •

edited

Loading

jdenny-ornl commented Sep 6, 2023 •

edited

Loading