CI: Speed up coverage tests by splitting files #29671

datapythonista · 2019-11-17T15:02:55Z

In #28531 we started to parallelize tests in the CI by file. Meaning that each file in pandas/tests/ will be assigned to a core, and will run fully there. Before, each individual test was assigned to a core. This makes things much simpler, since tests that cannot run in parallel just need to live in the same file. But it implies, that if a test file is big and contains many tests, it will slow down the running of all tests.

As an example, imagine we have 1,000 tests that take one second each, and we have 100 cores. 10 tests will be assigned to each core, and the test suite will run in 10 seconds using the 100 cores. But with the new set up, if we have a file with 100 tests, all them will run in a single core, and the test suite will take the 100 seconds that the core will need to run the 100 tests.

In practice, after #28531 most of the builds in the CI didn't increases much the time to run. Only the coverage build is in some cases taking significantly more (from around 30 minutes to around 45 minutes).

The solution to this is:

See how long each test file takes
Split the files taking longer in a meaningful way

Note that tests can be skipped if dependencies are not found. For example, if matplotlib is not installed in the environment, the plotting tests will simply be skipped. So, it's needed to create an environment with the dependencies of the build ci/deps/travis-36-cov.yaml.

To call the exact tests as the build, see the call in run_tests.sh, and replace the parameters by the ones defined in the build.

The text was updated successfully, but these errors were encountered:

Tirore96 · 2019-11-23T17:03:17Z

Hi I'd like to work on this issue

Tirore96 · 2019-11-23T20:06:15Z

In regards to running run_test.sh I got two questions.

The parameter values specified in build don't adress $TEST_ARGS and $COVERAGE. What values should these have?
When running pytest it reports an "unrecognized arguments: --strict-data-files" error. -strict-data-files is specified as an addopt in the config file. Looking at the options page for pytest I can't find --strict-data-files as an addopt. Am I misssing something?

datapythonista · 2019-11-24T00:44:40Z

$COVERAGE is usually empty, except for the job that computed and uploads the coverage, then it's true: https://github.com/pandas-dev/pandas/blob/master/.travis.yml#L47 Similar for $TEST_ARGS, not sure which job uses it, but it's usually empty, and some build will use it to provide extra args to pytest (if I'm not wrong).
It looks like you don't have the right dependencies or the right version of pytest. Use conda for it, in the root of pandas run conda env update if you already created the conda environment, conda env create if you didn't. Make sure you run conda activate pandas-dev after that.

datapythonista · 2019-11-24T00:45:34Z

To work in an issue, you can write take in a comment and it'll be assigned to you. Please unassign it if you stop working on it. Thanks!

Tirore96 · 2019-11-25T22:41:08Z

Thanks, I'll remember that ;)

lambertsbennett · 2019-12-06T17:09:04Z

I'd love to take a stab at this. I'll take this issue and try to get it sorted.

datapythonista · 2019-12-06T17:13:24Z

Thanks @lambertsbennett. I assigned you the issue, you can do it yourself next time by writing a comment with just the word take. If you can't finally finish this would be great if you unassign yourself from the issue, so someone else can give a try. Thanks!

lambertsbennett · 2019-12-07T23:24:05Z

Hi, sorry to ask, but it's my first time trying something like this. I carried out the following steps:

Forked and cloned pandas.
2)Navigated to ci/deps/ and created an environment with Anaconda using the file travis-36-cov.yaml.
Activated the environment.
Ran python setup.py build_ext --inplace -j 4
Ran pip install -e .
Navigated to ci/
Set variable PATTERN="((not slow and not network) or (single and db))" as indicated in the build file.
Set variable COVERAGE="true" as indicated in the build file.

When I execute run_tests.sh I get the same error Tiore96 indicated above. When I run the pytest command alone in the directory above, the tests will run for some time and then begin to throw many errors. I'm not sure if I'm doing something fundamentally stupid here.

datapythonista · 2019-12-08T18:54:29Z

The first thing I'd recommend is to make sure you use Miniconda (not Anaconda) and that you never install any dependency in the base environment, that's the cause of many errors, and if misleading error messages. And often errors related to pytest.

But I think in this particular case not sure if the problem may be the path where you're calling the script from. I think it should be from the root of this project, so it's called using ci/run_tests. py. The error you get seems to be caused by pandas/conftest. py not being found by pytest, and I guess the path is the problem.

You can see in the Travis config how it is called, and used the same way.

Hope this helps.

lambertsbennett · 2019-12-09T03:32:17Z

Ok thanks for the pointers, I'll keep on it!

lambertsbennett · 2019-12-14T00:41:47Z

Hi again,

So I've worked at splitting test files to decrease the build time of the coverage build. To do so, I did the following:

Ran pytest and analysed which test files take the longest to run.
Split the the files with the longest run times (about 10 of the original test files) into as many smaller units as possible while maintaining some sort of logical system. (Often placing test classes into separate files).
3)Split some test classes into multiple classes across files when tests were in a large monolithic class.
Re-ran travis CI to check with each set of changes.

At this point, each coverage build takes between 30-35 minutes. It is not the best gain in build time, but I worry that splitting things even further will create a bit of a mess. I was surprised to see that splitting the lengthiest files more than 5 times each still didn't have much of an effect on the overall build time.

datapythonista · 2019-12-17T06:52:56Z

Thanks for the effort @lambertsbennett.

I think we're missing something here. My expectation was that the times would improve by just splitting couple of files in two. So if after what you did there is not a clear gain, I think there is something that we're not understanding.

I think what would be useful is first to gather all the information about what is going on. How long is every file taking, how many parallel jobs are being executed in the CI, and if it's not too complex to obtain, which files are being executed in each process and how long the total for each process is.

I guess you already have some of this info. I think when we have all this, we should be able to understand what's going on, and what's the best solution.

Thanks!

lambertsbennett · 2019-12-18T01:40:17Z

OK, I will get into rounding up this info!

lambertsbennett · 2020-01-07T00:50:24Z

Looking at the TravisCI output, it looks like only two jobs are being split with each job taking half the tests. Maybe splitting the files more doesn't make much of a difference as a result of too few threads/jobs?

Another thing that I noticed, is that when the coverage reports are detected that there are several 'failed to upload' messages followed by 'sleeping for 30s and trying again...'. Depending on how long it attempts to upload prior to failing this could add a decent amount of time?

However, comparing to the '3.6, locale' build there are a similar number of tests run (difference of about 160), but pytest executes in 17 min 30 s vs 23 min 30 s, so there is definitely a hang up in terms of pytest taking longer.

datapythonista · 2020-01-10T15:16:17Z

Thanks for the research @lambertsbennett

If the "sleeping for 30s" is happening often, we should have a look and see what's going on.

My understanding is that until recently, pytest was splitting all the tests in two processes individually. So, if there are 1,000 tests, it'd take 500 randomly for one process, and send the other 500 to the other. That won't make both processes have the same amount of work and take the same, but the difference should be big. The slowest tests take around 30 seconds, so even if we're unlucky and one process has several slow tests more than the other, the difference won't be more than couple of minutes.

You can see the slowest tests for example here: https://travis-ci.org/pandas-dev/pandas/jobs/635275875#L2143

Now, we're instead splitting by file. I'm not sure how long it takes for each file, but I assume the slowest are several minutes. If we have 100 files, pytest should divide them in 50 and 50 randomly, and if the tests in each split take around the same, all will be good. But if there are some files that are much slower, taking for example 5 minutes, if there are couple of those more in one process than the other, one process will finish quite fast, but the other will take 10 minutes more to finish, and the job as a whole will be taking much longer (5 minutes more) than before.

I think the solution is to detect how long is taking every file, and try to make sure that there are not files that take a lot of time. I think even having only 2 jobs, if we have more files, and all them take a short period, is very unlikely that the difference between both processes is big.

Does this make sense?

lambertsbennett · 2020-01-15T17:03:29Z

Sorry for the delayed response. This totally makes sense, I had calculated the run length of the files prior to splitting, but I will check again now that I've split files to see how long the new set of files takes.

jbrockmendel · 2020-09-17T19:46:23Z

Is there a final word on how effective this will be?

mroeschke · 2022-03-13T18:55:47Z

Appears we don't have a specific coverage build anymore since originally this reference a Travis CI build we don't have anymore. Not sure if this is actionable anymore so closing, but we can reopen if there's a specific area during the code coverage that can be addressed.

datapythonista added Testing pandas testing functions or related to the test suite CI Continuous Integration good first issue labels Nov 17, 2019

datapythonista mentioned this issue Nov 17, 2019

TST: Call tests just once with --dist=loadscope #28531

Merged

datapythonista assigned Tirore96 Nov 24, 2019

Tirore96 removed their assignment Nov 27, 2019

datapythonista assigned lambertsbennett Dec 6, 2019

lambertsbennett removed their assignment Feb 13, 2020

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Sep 22, 2020

mroeschke removed the good first issue label Jul 23, 2021

mroeschke closed this as completed Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Speed up coverage tests by splitting files #29671

CI: Speed up coverage tests by splitting files #29671

datapythonista commented Nov 17, 2019

Tirore96 commented Nov 23, 2019

Tirore96 commented Nov 23, 2019

datapythonista commented Nov 24, 2019

datapythonista commented Nov 24, 2019

Tirore96 commented Nov 25, 2019

lambertsbennett commented Dec 6, 2019

datapythonista commented Dec 6, 2019

lambertsbennett commented Dec 7, 2019

datapythonista commented Dec 8, 2019

lambertsbennett commented Dec 9, 2019

lambertsbennett commented Dec 14, 2019

datapythonista commented Dec 17, 2019

lambertsbennett commented Dec 18, 2019

lambertsbennett commented Jan 7, 2020

datapythonista commented Jan 10, 2020

lambertsbennett commented Jan 15, 2020

jbrockmendel commented Sep 17, 2020

mroeschke commented Mar 13, 2022

CI: Speed up coverage tests by splitting files #29671

CI: Speed up coverage tests by splitting files #29671

Comments

datapythonista commented Nov 17, 2019

Tirore96 commented Nov 23, 2019

Tirore96 commented Nov 23, 2019

datapythonista commented Nov 24, 2019

datapythonista commented Nov 24, 2019

Tirore96 commented Nov 25, 2019

lambertsbennett commented Dec 6, 2019

datapythonista commented Dec 6, 2019

lambertsbennett commented Dec 7, 2019

datapythonista commented Dec 8, 2019

lambertsbennett commented Dec 9, 2019

lambertsbennett commented Dec 14, 2019

datapythonista commented Dec 17, 2019

lambertsbennett commented Dec 18, 2019

lambertsbennett commented Jan 7, 2020

datapythonista commented Jan 10, 2020

lambertsbennett commented Jan 15, 2020

jbrockmendel commented Sep 17, 2020

mroeschke commented Mar 13, 2022