Skip to content

[WIP] Use helix on CI #628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 43 commits into from
Closed

[WIP] Use helix on CI #628

wants to merge 43 commits into from

Conversation

HaoK
Copy link
Member

@HaoK HaoK commented Dec 11, 2018

Replaces #567 (which I busted with a bad force push)

PR validation contains a new helix job which can be used to view MC tests results (in bottom build.cmd output for link to mc.dot.net)

@HaoK HaoK changed the base branch from master to release/2.1 December 11, 2018 21:03
@HaoK
Copy link
Member Author

HaoK commented Dec 11, 2018

Preserving unaddressed comments from @natemcmaster on old pr:

  • Starting to think we should have runtests.cmd and runtests.ps1, and make the cmd a dummy script that starts powershell.
  • We'll need to use different queue names for internal builds. We should find a different way to check if a queue is targeting windows that doesn't depend on the exact queue name "Windows.10.Amd64.Open"
  • Use $(NETCoreSdkVersion) to get the .NET Core SDK version.

@HaoK
Copy link
Member Author

HaoK commented Dec 12, 2018

Finally have a clean helix test run, 2 of the failing tests were fixed and the rest were skipped, looks like it typically takes between 15-30 minutes for the helix job all inclusive, not sure why there's such a large range in running times:

Issues for the skipped tests:
#627
#624
#623
#913

https://mc.dot.net/#/user/dotnet-github-anon-kaonashi-bot/pr~2Faspnet~2Fextensions/ci/20181212.16

@HaoK
Copy link
Member Author

HaoK commented Jan 7, 2019

@natemcmaster how about using $(HelixTargetQueue.StartsWith('Windows') to determine windows queue or not? That doesn't seem too unreasonable

Copy link

@natemcmaster natemcmaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there. A few things left to do:

  1. Update Helix SDK to something that has this fix Work around Microsoft/msbuild#3345 arcade#1683
  2. Document how to run Helix tests locally (for internal contributors only). Are there any existing docs on how to use Helix that we can link to?
  3. Coordinate this with @ryanbrandenburg who is about to merge stuff that will break this (Use Arcade #586).
  4. Fix or skip one more test failures (not yet reported to Azure Pipelines).
Failed   Microsoft.Extensions.Localization.ResourceManagerStringLocalizerTest.EnumeratorCachesCultureWalkForSameAssembly
Error Message:
 Assert.Equal() Failure
Expected: 2
Actual:   1
Stack Trace:
   at Microsoft.Extensions.Localization.ResourceManagerStringLocalizerTest.EnumeratorCachesCultureWalkForSameAssembly() in /_/src/Localization/Localization/test/ResourceManagerStringLocalizerTest.cs:line 56


<!-- will move into korebuild -->
<PropertyGroup>
<DefaultHelixTestTimeout>00:30:00</DefaultHelixTestTimeout>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used anywhere, so I believe you can remove it.

<TestName>$(MSBuildProjectName)-$(TargetFramework)</TestName>
<Command Condition="$(HelixTargetQueue.StartsWith('Windows'))">runtests.cmd $(TargetFileName) $(NETCoreSdkVersion) $(RuntimeFrameworkVersion)</Command>
<Command Condition="!$(HelixTargetQueue.StartsWith('Windows'))">./runtests.sh $(TargetFileName) $(NETCoreSdkVersion) $(RuntimeFrameworkVersion)</Command>
<TestTimeout>$(DistributedTestTimeout)</TestTimeout>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DistributedTestTimeout isn't set anywhere. Was this mean to align with DefaultHelixTestTimeout? If so, the timeout property needs to be in Directory.Build.props, not build/repo.targets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, but do you think we really need different timeouts per test project? I would imagine something standard like 30min or 1 hour should be plenty, do we really want any tests projects that take longer than that to run?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't specify anything, what is the default? I'm okay if we don't add this knob until we need it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default if we don't specify anything looks like only 5 minutes, https://github.com/dotnet/arcade/blob/bd3e6b007d9a4b5d8926343e79e538dc2a52d7e5/src/Microsoft.DotNet.Helix/JobSender/WorkItemDefinition.cs#L21

That seems kinda short but we could try it if we want?

@HaoK
Copy link
Member Author

HaoK commented Jan 7, 2019

Results from more target queues is here: https://mc.dot.net/#/user/aspnet-extensions/pr~2Faspnet~2Fextensions/ci/20190107.16

Redhat appears busto on dotnet-install.
There are two unbuntu failures with one being a 'killed' dotnet-install hang? The caching test I believe might just be flaky since its passed before

@HaoK
Copy link
Member Author

HaoK commented Jan 7, 2019

Nice the helix fix works, test failures now fail the build (I tried it on the EF repo: https://dnceng.visualstudio.com/public/_build/results?buildId=67638)

I'll redisable the localization tests, and update the sdk here

@HaoK HaoK mentioned this pull request Jan 7, 2019
@HaoK
Copy link
Member Author

HaoK commented Jan 7, 2019

@natemcmaster Kicking off a helix run locally is just build.cmd /t:Helix no PAT needed or anything anymore so its pretty straight forward, should I just add it to one of the internal wikis? I think linking to https://github.com/dotnet/arcade/blob/master/Documentation/AzureDevOps/SendingJobsToHelix.md is probably a good enough starting point for more info about helix

@natemcmaster
Copy link

@natemcmaster
Copy link

It looks like all RedHat 6 tests failure with:

Couldn't find a valid ICU package installed on the system. Set the configuration flag System.Globalization.Invariant to true if you want to run with no globalization support.

Since tests pass on the RedHat 7 image, which has ICU installed, I'm OK if we just remove RedHat 6.

@natemcmaster
Copy link

There are two unbuntu failures with one being a 'killed' dotnet-install hang?

dotnet-install.sh is notoriously flaky on Linux. Let's try this:

  1. Add a retry around the dotnet-install.sh call
  2. Remove the --no-cdn flag.

@HaoK
Copy link
Member Author

HaoK commented Jan 7, 2019

Added a section to the doc and removed redhat 6

@natemcmaster
Copy link

Can you merge in the latest release/2.1 changes, which dropped netcoreapp2.0? I want to see how long tests take on Helix.

@HaoK
Copy link
Member Author

HaoK commented Jan 8, 2019

@natemcmaster hard to say for sure what average numbers would be long term, but after rebasing total time is down to 25 min for a helix run, was 36 min last night, so that squares with 30% drop...

@HaoK
Copy link
Member Author

HaoK commented Jan 8, 2019

Everything looks to be green or skipped with the exception of a few flaky caching tests, I'll kick off a few more runs to see what the bad boys are and disable them, so far I've seen:

AbsoluteExpirationExpiresInBackground (Debian)
DoNotAddIfSizeOverflows (Windows, Debian)
AddingReplacementWhenTotalSizeExceedsCapacityDoesNotUpdateRemovesOldEntryAndTriggersCompaction(Ubuntu, Debian)

@natemcmaster
Copy link

25 min doesn't seem too bad. I imagine this will get faster if Helix implements de-duping on test payloads, right?

I expect using Helix is going to make flaky tests even more obvious in PRs. By running on 10 different OSes, you increase the probability of tests failing on PRs.

@HaoK
Copy link
Member Author

HaoK commented Jan 8, 2019

Yeah possibly, I'm not sure how much time is spent on the CI uploading the payloads, its definitely still a sizeable chunk though as sometimes many of the early queues are done completely before the tests for the later ones are even finished uploading yet, so if we wanted to, we could speed things up by splitting the job into two that published in parallel to half the queues each, or I could even try adding a ci job per helix queue as an experiment to see if that is the fastest overall run time?

@HaoK
Copy link
Member Author

HaoK commented Jan 8, 2019

Like ideally we could

  1. Build/publish everything
  2. publish to each helix queue in parallel using the publish dirs (vs today's publish to each helix queue sequentially)

Although just running this would add whatever the build time to the front of each which maybe isn't too bad I'll try it just to see

./build.cmd /t:Helix /p:HelixQueue=WIndows.Amd64.Open
./build.cmd /t:Helix /p:HelixQueue=OSX10.Open
...etc

@HaoK
Copy link
Member Author

HaoK commented Jan 8, 2019

FYI second run with failing tests disabled only took 16 minutes

@natemcmaster
Copy link

16-25 minutes is still not bad for running a few thousand test on 10 operating systems. I wish their was more visibility into the status of Helix tests. Combing thru the build log to find the link to Mission Control is annoying.

@HaoK
Copy link
Member Author

HaoK commented Jan 8, 2019

Totally agree, and then clicking thru the MC links is another 3-4 slowww clicks to actually get to the actual failing queues. I thought they were eventually going to move to the azure pipeline reporting tabs right?

@natemcmaster
Copy link

They demoed some prototypes back in November. I don't know what the timeline is for making it a real thing.

@natemcmaster
Copy link

Closing. This was a really useful experiment in using Helix for the first time. We discussed as a team, and don't plan to run tests on Helix for this repo. The distribute testing and platform coverage isn't really necessary and doesn't warrant the overhead at this time.

@mmitche mmitche deleted the haok/helix3 branch January 4, 2021 21:33
@ghost ghost locked as resolved and limited conversation to collaborators May 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants