-
Notifications
You must be signed in to change notification settings - Fork 805
[WIP] Use helix on CI #628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Preserving unaddressed comments from @natemcmaster on old pr:
|
Finally have a clean helix test run, 2 of the failing tests were fixed and the rest were skipped, looks like it typically takes between 15-30 minutes for the helix job all inclusive, not sure why there's such a large range in running times: Issues for the skipped tests: https://mc.dot.net/#/user/dotnet-github-anon-kaonashi-bot/pr~2Faspnet~2Fextensions/ci/20181212.16 |
@natemcmaster how about using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there. A few things left to do:
- Update Helix SDK to something that has this fix Work around Microsoft/msbuild#3345 arcade#1683
- Document how to run Helix tests locally (for internal contributors only). Are there any existing docs on how to use Helix that we can link to?
- Coordinate this with @ryanbrandenburg who is about to merge stuff that will break this (Use Arcade #586).
- Fix or skip one more test failures (not yet reported to Azure Pipelines).
Failed Microsoft.Extensions.Localization.ResourceManagerStringLocalizerTest.EnumeratorCachesCultureWalkForSameAssembly
Error Message:
Assert.Equal() Failure
Expected: 2
Actual: 1
Stack Trace:
at Microsoft.Extensions.Localization.ResourceManagerStringLocalizerTest.EnumeratorCachesCultureWalkForSameAssembly() in /_/src/Localization/Localization/test/ResourceManagerStringLocalizerTest.cs:line 56
build/repo.targets
Outdated
|
||
<!-- will move into korebuild --> | ||
<PropertyGroup> | ||
<DefaultHelixTestTimeout>00:30:00</DefaultHelixTestTimeout> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't used anywhere, so I believe you can remove it.
Directory.Build.targets
Outdated
<TestName>$(MSBuildProjectName)-$(TargetFramework)</TestName> | ||
<Command Condition="$(HelixTargetQueue.StartsWith('Windows'))">runtests.cmd $(TargetFileName) $(NETCoreSdkVersion) $(RuntimeFrameworkVersion)</Command> | ||
<Command Condition="!$(HelixTargetQueue.StartsWith('Windows'))">./runtests.sh $(TargetFileName) $(NETCoreSdkVersion) $(RuntimeFrameworkVersion)</Command> | ||
<TestTimeout>$(DistributedTestTimeout)</TestTimeout> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DistributedTestTimeout isn't set anywhere. Was this mean to align with DefaultHelixTestTimeout? If so, the timeout property needs to be in Directory.Build.props, not build/repo.targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, but do you think we really need different timeouts per test project? I would imagine something standard like 30min or 1 hour should be plenty, do we really want any tests projects that take longer than that to run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don't specify anything, what is the default? I'm okay if we don't add this knob until we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default if we don't specify anything looks like only 5 minutes, https://github.com/dotnet/arcade/blob/bd3e6b007d9a4b5d8926343e79e538dc2a52d7e5/src/Microsoft.DotNet.Helix/JobSender/WorkItemDefinition.cs#L21
That seems kinda short but we could try it if we want?
Results from more target queues is here: https://mc.dot.net/#/user/aspnet-extensions/pr~2Faspnet~2Fextensions/ci/20190107.16 Redhat appears busto on dotnet-install. |
Nice the helix fix works, test failures now fail the build (I tried it on the EF repo: https://dnceng.visualstudio.com/public/_build/results?buildId=67638) I'll redisable the localization tests, and update the sdk here |
@natemcmaster Kicking off a helix run locally is just |
Can you add to this doc? https://github.com/aspnet/Extensions/blob/master/docs/build-from-source.md |
It looks like all RedHat 6 tests failure with:
Since tests pass on the RedHat 7 image, which has ICU installed, I'm OK if we just remove RedHat 6. |
dotnet-install.sh is notoriously flaky on Linux. Let's try this:
|
Added a section to the doc and removed redhat 6 |
Can you merge in the latest release/2.1 changes, which dropped netcoreapp2.0? I want to see how long tests take on Helix. |
This reverts commit 3ca3bab.
@natemcmaster hard to say for sure what average numbers would be long term, but after rebasing total time is down to 25 min for a helix run, was 36 min last night, so that squares with 30% drop... |
Everything looks to be green or skipped with the exception of a few flaky caching tests, I'll kick off a few more runs to see what the bad boys are and disable them, so far I've seen: AbsoluteExpirationExpiresInBackground (Debian) |
25 min doesn't seem too bad. I imagine this will get faster if Helix implements de-duping on test payloads, right? I expect using Helix is going to make flaky tests even more obvious in PRs. By running on 10 different OSes, you increase the probability of tests failing on PRs. |
Yeah possibly, I'm not sure how much time is spent on the CI uploading the payloads, its definitely still a sizeable chunk though as sometimes many of the early queues are done completely before the tests for the later ones are even finished uploading yet, so if we wanted to, we could speed things up by splitting the job into two that published in parallel to half the queues each, or I could even try adding a ci job per helix queue as an experiment to see if that is the fastest overall run time? |
Like ideally we could
Although just running this would add whatever the build time to the front of each which maybe isn't too bad I'll try it just to see
|
FYI second run with failing tests disabled only took 16 minutes |
16-25 minutes is still not bad for running a few thousand test on 10 operating systems. I wish their was more visibility into the status of Helix tests. Combing thru the build log to find the link to Mission Control is annoying. |
Totally agree, and then clicking thru the MC links is another 3-4 slowww clicks to actually get to the actual failing queues. I thought they were eventually going to move to the azure pipeline reporting tabs right? |
They demoed some prototypes back in November. I don't know what the timeline is for making it a real thing. |
Closing. This was a really useful experiment in using Helix for the first time. We discussed as a team, and don't plan to run tests on Helix for this repo. The distribute testing and platform coverage isn't really necessary and doesn't warrant the overhead at this time. |
Replaces #567 (which I busted with a bad force push)
PR validation contains a new helix job which can be used to view MC tests results (in bottom build.cmd output for link to mc.dot.net)