-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Debug subprocess management by pytest #3975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't know how to pursue this, but looking at def test_python_evaluation(testcase):
# ...
returncode, out = run(mypy_cmdline)
if returncode == 0:
returncode, interp_out = run([interpreter, program])
out += interp_out
os.remove(program_path)
assert_string_arrays_equal(adapt_output(testcase), out, 'Invalid output ({}, line {})'.format(testcase.file, testcase.line)) And def run(cmdline, env, timeout=30) -> Tuple[int, List[str]]:
"""A poor man's subprocess.run() for 3.3 and 3.4 compatibility."""
process = subprocess.Popen(cmdline, env=env,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
cwd=test_temp_dir)
try:
out, err = process.communicate(timeout=timeout)
except subprocess.TimeoutExpired:
out = err = b''
process.kill()
return process.returncode, split_lines(out, err) So empty output could be caused by a simple timeout. Besides, I'm not sure the value 30 is reasonable, and |
I think it would be nice to get more debug information on what subprocess is timing out this and perhaps under what conditions. A good first step would be to not be silent when we catch the |
It appears (based on the normal subprocess.run) that the timeout parameter is normally just |
But, barring crazy experiments or excessive load, the 30 seconds timeout should be fine, right? |
That depends on the fairness of the scheduler. I would think it should be fine, but seems like it isn't. So maybe the scheduler is not fair, and it happens that some process is starved for too long. |
If we have way more processes than we can actually run in parallel and if there are multiple other heavy (non-mypy) jobs on the same Travis slave, there could be long stalls, though 30 seconds sounds quite high. I've seen similar (maybe not quite as pathological) behavior in our internal CI system as well. If we increase the timeout, maybe we should increase it even more than that -- say, to 90 seconds? Legitimate hangs are probably pretty rare, and flaky builds are very annoying. Here's a somewhat crazy idea: could we detect the maximum amount of real parallelism that we can get at the beginning of a build, and adjust runtests.py parallelism based on the measured value? The performance of Travis CI builds seems to vary a lot depending on load, and maybe it would be useful to dynamically adapt to this. |
I agree with Jukka -- we are putting "excessive load" on Travis and Appveyor. Technically we should only be running two processes. And if a test actually hangs, it will just take 30s longer to tell if we up the timeout.
Technically it is 2 cores. That is what the docs say. So either that isn't accurate and each job is not run in a separate container, or Travis is doing more scheduling, because we do get more speed from running >2 processes. |
Lowering to normal priority, ideally this should be figured out, but our current solution is working for now. |
I don't think this is relevant anymore. We haven't had issues with this in a long time. |
We keep seeing test flakes in Travis, attributed to #3543. They are kind of hard to recognize, since the failures just have the form "Expected: ; Actual: (empty)".
We suspect that the subprocesses created by the integration tests (the python*eval tests) are running out of memory because there are too many of them, and are being killed, but whatever is managing the subprocesses doesn't notice the exist status. IIRC Linux OOM kills appear as killed by signal 9, i.e. SIGKILL, but if you just read the pipe you'll never know.
We need some investigation into this theory and improved status checking, so at least the crashes are easier to diagnose. @ethanhs @elazarg are either of you interested?
My theory could be patently false, e.g. if some of the failures seen don't come from "python*eval" tests. It could also be that somehow
runtests.py -j12
spins uppytest -n12
plus 11 other tasks, which would mean we're putting double the load on Travis that we intended.The text was updated successfully, but these errors were encountered: