Skip to content

Debug subprocess management by pytest #3975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gvanrossum opened this issue Sep 20, 2017 · 9 comments
Closed

Debug subprocess management by pytest #3975

gvanrossum opened this issue Sep 20, 2017 · 9 comments

Comments

@gvanrossum
Copy link
Member

We keep seeing test flakes in Travis, attributed to #3543. They are kind of hard to recognize, since the failures just have the form "Expected: ; Actual: (empty)".

We suspect that the subprocesses created by the integration tests (the python*eval tests) are running out of memory because there are too many of them, and are being killed, but whatever is managing the subprocesses doesn't notice the exist status. IIRC Linux OOM kills appear as killed by signal 9, i.e. SIGKILL, but if you just read the pipe you'll never know.

We need some investigation into this theory and improved status checking, so at least the crashes are easier to diagnose. @ethanhs @elazarg are either of you interested?

My theory could be patently false, e.g. if some of the failures seen don't come from "python*eval" tests. It could also be that somehow runtests.py -j12 spins up pytest -n12 plus 11 other tasks, which would mean we're putting double the load on Travis that we intended.

@elazarg
Copy link
Contributor

elazarg commented Sep 23, 2017

I don't know how to pursue this, but looking at testpythoneval.py, the code that executes the tests is this (after removing some comments and type hints):

def test_python_evaluation(testcase):
    # ...
    returncode, out = run(mypy_cmdline)
    if returncode == 0:
        returncode, interp_out = run([interpreter, program])
        out += interp_out
    os.remove(program_path)
    assert_string_arrays_equal(adapt_output(testcase), out, 'Invalid output ({}, line {})'.format(testcase.file, testcase.line))

And run() is defined as

def run(cmdline, env, timeout=30) -> Tuple[int, List[str]]:
    """A poor man's subprocess.run() for 3.3 and 3.4 compatibility."""
    process = subprocess.Popen(cmdline, env=env,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        cwd=test_temp_dir)
    try:
        out, err = process.communicate(timeout=timeout)
    except subprocess.TimeoutExpired:
        out = err = b''
        process.kill()
    return process.returncode, split_lines(out, err)

So empty output could be caused by a simple timeout. Besides, I'm not sure the value 30 is reasonable, and returncode is not checked at the call site.

@emmatyping
Copy link
Member

I think it would be nice to get more debug information on what subprocess is timing out this and perhaps under what conditions. A good first step would be to not be silent when we catch the TimeoutExpired exception and print out the command. Additionally, I think it would help to increase the timeout threshold.

@emmatyping
Copy link
Member

It appears (based on the normal subprocess.run) that the timeout parameter is normally just None. I was testing running many more processes than threads for #3895, and I found it caused processes to time out. I think a simple solution (which would allow us to not decrease the process count), is to increase the timeout parameter to 60.

@gvanrossum
Copy link
Member Author

But, barring crazy experiments or excessive load, the 30 seconds timeout should be fine, right?

@elazarg
Copy link
Contributor

elazarg commented Sep 26, 2017

That depends on the fairness of the scheduler. I would think it should be fine, but seems like it isn't. So maybe the scheduler is not fair, and it happens that some process is starved for too long.

@JukkaL
Copy link
Collaborator

JukkaL commented Sep 26, 2017

If we have way more processes than we can actually run in parallel and if there are multiple other heavy (non-mypy) jobs on the same Travis slave, there could be long stalls, though 30 seconds sounds quite high. I've seen similar (maybe not quite as pathological) behavior in our internal CI system as well. If we increase the timeout, maybe we should increase it even more than that -- say, to 90 seconds? Legitimate hangs are probably pretty rare, and flaky builds are very annoying.

Here's a somewhat crazy idea: could we detect the maximum amount of real parallelism that we can get at the beginning of a build, and adjust runtests.py parallelism based on the measured value? The performance of Travis CI builds seems to vary a lot depending on load, and maybe it would be useful to dynamically adapt to this.

@emmatyping
Copy link
Member

I agree with Jukka -- we are putting "excessive load" on Travis and Appveyor. Technically we should only be running two processes. And if a test actually hangs, it will just take 30s longer to tell if we up the timeout.

could we detect the maximum amount of real parallelism that we can get at the beginning of a build

Technically it is 2 cores. That is what the docs say. So either that isn't accurate and each job is not run in a separate container, or Travis is doing more scheduling, because we do get more speed from running >2 processes.

@emmatyping
Copy link
Member

Lowering to normal priority, ideally this should be figured out, but our current solution is working for now.

@emmatyping
Copy link
Member

I don't think this is relevant anymore. We haven't had issues with this in a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants