-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
CI: Speed up coverage tests by splitting files #29671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi I'd like to work on this issue |
In regards to running run_test.sh I got two questions.
|
|
To work in an issue, you can write |
Thanks, I'll remember that ;) |
I'd love to take a stab at this. I'll take this issue and try to get it sorted. |
Thanks @lambertsbennett. I assigned you the issue, you can do it yourself next time by writing a comment with just the word |
Hi, sorry to ask, but it's my first time trying something like this. I carried out the following steps:
When I execute run_tests.sh I get the same error Tiore96 indicated above. When I run the pytest command alone in the directory above, the tests will run for some time and then begin to throw many errors. I'm not sure if I'm doing something fundamentally stupid here. |
The first thing I'd recommend is to make sure you use Miniconda (not Anaconda) and that you never install any dependency in the base environment, that's the cause of many errors, and if misleading error messages. And often errors related to pytest. But I think in this particular case not sure if the problem may be the path where you're calling the script from. I think it should be from the root of this project, so it's called using You can see in the Travis config how it is called, and used the same way. Hope this helps. |
Ok thanks for the pointers, I'll keep on it! |
Hi again, So I've worked at splitting test files to decrease the build time of the coverage build. To do so, I did the following:
At this point, each coverage build takes between 30-35 minutes. It is not the best gain in build time, but I worry that splitting things even further will create a bit of a mess. I was surprised to see that splitting the lengthiest files more than 5 times each still didn't have much of an effect on the overall build time. |
Thanks for the effort @lambertsbennett. I think we're missing something here. My expectation was that the times would improve by just splitting couple of files in two. So if after what you did there is not a clear gain, I think there is something that we're not understanding. I think what would be useful is first to gather all the information about what is going on. How long is every file taking, how many parallel jobs are being executed in the CI, and if it's not too complex to obtain, which files are being executed in each process and how long the total for each process is. I guess you already have some of this info. I think when we have all this, we should be able to understand what's going on, and what's the best solution. Thanks! |
OK, I will get into rounding up this info! |
Looking at the TravisCI output, it looks like only two jobs are being split with each job taking half the tests. Maybe splitting the files more doesn't make much of a difference as a result of too few threads/jobs? Another thing that I noticed, is that when the coverage reports are detected that there are several 'failed to upload' messages followed by 'sleeping for 30s and trying again...'. Depending on how long it attempts to upload prior to failing this could add a decent amount of time? However, comparing to the '3.6, locale' build there are a similar number of tests run (difference of about 160), but pytest executes in 17 min 30 s vs 23 min 30 s, so there is definitely a hang up in terms of pytest taking longer. |
Thanks for the research @lambertsbennett If the "sleeping for 30s" is happening often, we should have a look and see what's going on. My understanding is that until recently, pytest was splitting all the tests in two processes individually. So, if there are 1,000 tests, it'd take 500 randomly for one process, and send the other 500 to the other. That won't make both processes have the same amount of work and take the same, but the difference should be big. The slowest tests take around 30 seconds, so even if we're unlucky and one process has several slow tests more than the other, the difference won't be more than couple of minutes. You can see the slowest tests for example here: https://travis-ci.org/pandas-dev/pandas/jobs/635275875#L2143 Now, we're instead splitting by file. I'm not sure how long it takes for each file, but I assume the slowest are several minutes. If we have 100 files, pytest should divide them in 50 and 50 randomly, and if the tests in each split take around the same, all will be good. But if there are some files that are much slower, taking for example 5 minutes, if there are couple of those more in one process than the other, one process will finish quite fast, but the other will take 10 minutes more to finish, and the job as a whole will be taking much longer (5 minutes more) than before. I think the solution is to detect how long is taking every file, and try to make sure that there are not files that take a lot of time. I think even having only 2 jobs, if we have more files, and all them take a short period, is very unlikely that the difference between both processes is big. Does this make sense? |
Sorry for the delayed response. This totally makes sense, I had calculated the run length of the files prior to splitting, but I will check again now that I've split files to see how long the new set of files takes. |
Is there a final word on how effective this will be? |
Appears we don't have a specific coverage build anymore since originally this reference a Travis CI build we don't have anymore. Not sure if this is actionable anymore so closing, but we can reopen if there's a specific area during the code coverage that can be addressed. |
In #28531 we started to parallelize tests in the CI by file. Meaning that each file in
pandas/tests/
will be assigned to a core, and will run fully there. Before, each individual test was assigned to a core. This makes things much simpler, since tests that cannot run in parallel just need to live in the same file. But it implies, that if a test file is big and contains many tests, it will slow down the running of all tests.As an example, imagine we have 1,000 tests that take one second each, and we have 100 cores. 10 tests will be assigned to each core, and the test suite will run in 10 seconds using the 100 cores. But with the new set up, if we have a file with 100 tests, all them will run in a single core, and the test suite will take the 100 seconds that the core will need to run the 100 tests.
In practice, after #28531 most of the builds in the CI didn't increases much the time to run. Only the coverage build is in some cases taking significantly more (from around 30 minutes to around 45 minutes).
The solution to this is:
Note that tests can be skipped if dependencies are not found. For example, if matplotlib is not installed in the environment, the plotting tests will simply be skipped. So, it's needed to create an environment with the dependencies of the build
ci/deps/travis-36-cov.yaml
.To call the exact tests as the build, see the call in run_tests.sh, and replace the parameters by the ones defined in the build.
The text was updated successfully, but these errors were encountered: