Skip to content

Revert "Switch libraries testing to Ubuntu 22 temporarily" #115312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

richlander
Copy link
Member

Reverts #114268

Context: #114276 (comment)

@jkotas

@Copilot Copilot AI review requested due to automatic review settings May 5, 2025 19:46
@ghost ghost added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 5, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reverts the earlier temporary change that switched libraries testing to Ubuntu 22, restoring the previous helix queue names.

  • Reverts the public helix queue from "Ubuntu.2204.Amd64.Open" to "azurelinux.3.amd64.open".
  • Reverts the internal helix queue from "Ubuntu.2204.Amd64" to "azurelinux.3.amd64".
Comments suppressed due to low confidence (2)

eng/pipelines/coreclr/templates/helix-queues-setup.yml:101

  • [nitpick] Ensure that changing the queue name for the public team project to "azurelinux.3.amd64.open" is consistent with naming conventions used across the pipeline configuration.
-        - azurelinux.3.amd64.open

eng/pipelines/coreclr/templates/helix-queues-setup.yml:103

  • [nitpick] Verify that the internal team project queue name "azurelinux.3.amd64" aligns with the established naming patterns in the broader system.
-        - azurelinux.3.amd64

@richlander
Copy link
Member Author

It looks like the issues are related to timeouts / space limitations. I think we were seeing those before. Did anyone develop any insight on that? I'm not sure how that conversation ended. @mthalman @AndyAyersMS @jkotas @dougbu @agocke

Example:

    System.Net.Mail.Tests.SmtpClientTlsTest_Send.ClientCertificateRequired_Sent [FAIL]
      Assert.Null() Failure: Value is not null
      Expected: null
      Actual:   System.Net.Mail.SmtpException: The operation has timed out.
         at System.Net.Mail.SmtpClient.Send(MailMessage message) in /_/src/libraries/System.Net.Mail/src/System/Net/Mail/SmtpClient.cs:line 521
         at System.Net.Mail.Tests.LoopbackServerTestBase`1.SendMailInternal(MailMessage msg, CancellationToken cancellationToken, Nullable`1 asyncExpectDirectException) in /_/src/libraries/System.Net.Mail/tests/Functional/LoopbackServerTestBase.cs:line 74
      Stack Trace:
        /_/src/libraries/System.Net.Mail/tests/Functional/LoopbackServerTestBase.cs(155,0): at System.Net.Mail.Tests.LoopbackServerTestBase`1.SendMail(MailMessage msg, CancellationToken cancellationToken)
        /_/src/libraries/System.Net.Mail/tests/Functional/SmtpClientTlsTest.cs(159,0): at System.Net.Mail.Tests.SmtpClientTlsTest`1.ClientCertificateRequired_Sent()
        --- End of stack trace from previous location ---
      Output:
        Server> 220 localhost
        Client> EHLO a000K7N
        Server> 250-localhost, mock server here
              > 250-STARTTLS
        Server> 250 AUTH PLAIN LOGIN
        Client> STARTTLS
        Server> 220 Ready to start TLS

https://helixr1107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-115312-merge-c8aa9ab112ad47839a/System.Net.Mail.Functional.Tests/1/console.46ccc21a.log?helixlogtype=result

@jkotas jkotas added area-System.Net and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 6, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented May 6, 2025

It looks like the issues are related to timeouts / space limitations

The failures are networking related hangs, very similar to the hangs that lead to disabling of Azure Linux 3 testing. #114152 is an example of networking related hang that we have seen earlier.

The Azure Linux 3 symcrypt problem does not appear to fixed. My guess is that the fix is not deployed to our CI machines. These CI legs run physical OS, they do not run containerized OS. #114276 (comment) does not imply that the fix is deployed to the affected CI machines.

@richlander
Copy link
Member Author

Oh, you are right. I didn't think of the E2E workflow.

@ilyas1974 -- Can you check which symcrympt version is on the CI machines? The context link above should help.

@ilyas1974
Copy link

The image currently associated with those pools are from 4/17. Currently checking to see what version of symcrympt is installed on these systems.

@vcsjones
Copy link
Member

vcsjones commented May 6, 2025

A discussion point in a different medium was raised that I wanted to make sure was better understood - do we have Linux testing that we feel is adequate that is not Azure Linux 3? Azure Linux 3's OpenSSL is different enough from vanilla OpenSSL that they are more or less two different things. We wanted to make sure we also have CI protections that are representative of what the Linux community uses, too.

@richlander
Copy link
Member Author

richlander commented May 6, 2025

Good point. The original plan was to switch to all AZL3 VMs. That's now looking like an impractical plan, for two reasons:

  • It's quite different (what you said).
  • Concerns about ongoing quality/reliability/breakage (we need to be conservative).

I'm concerned that SymCrypt changes will continue to break us. We need a way to easily fallback to Ubuntu. That would be much easier if we had a guarded rollout scheme (via PR), where VMs were not updated in place.

Just so I/we understand ... Some of these crypto APIs must have a kernel implementation. It's the kernel that gets shared between host and guest and (AFAIK) nothing else. That's the architectural reason for this conversation, yes?

@vcsjones
Copy link
Member

vcsjones commented May 6, 2025

@ilyas1974 Specifically, the package under question is SymCrypt-OpenSSL.

The fix is in >= 1.8.0.

The image currently associated with those pools are from 4/17.

Based on the repo information, 1.8.0 was released on 4/29. I suspect the VMs are on 1.7.0-1.azl3.

@richlander
Copy link
Member Author

The May Azure Linux 3.0 update has been released! (3.0.20250429)

Just got this in mail.

@jkotas
Copy link
Member

jkotas commented May 6, 2025

Some of these crypto APIs must have a kernel implementation. It's the kernel that gets shared between host and guest and (AFAIK) nothing else. That's the architectural reason for this conversation, yes?

Standard openssl configuration does not have any special kernel components. Some containerized distros run on Azure Linux kernel and we have not observed any issues with it so far: https://github.com/dotnet/runtime/blob/main/eng/pipelines/coreclr/templates/helix-queues-setup.yml#L79-L82

That would be much easier if we had a guarded rollout scheme (via PR), where VMs were not updated in place.

Right, this is ongoing pain from unguarded rollouts that we have been living with for years. I would not over index on the Azure Linux symcrypt problem. I can bet that the next 5 surprise CI breaks from unguarded image rollouts are going to somewhere else.

@richlander
Copy link
Member Author

richlander commented May 6, 2025

I see now. I just re-looked at the changes. This is the scenario where we're testing with raw VMs. That resolves my confusion. A simple solution is that we don't do that with Azure Linux (and close this PR).

@jkotas
Copy link
Member

jkotas commented May 6, 2025

A simple solution is that we don't do that with Azure Linux

There are two aspects:

  • Whether to test Azure Linux in default CI: Our strategy has been to have a mix of OSes and architectures in the default CI so that we have high probability of finding most breakages quickly. We need to have Azure Linux coverage somewhere. Dropping Azure Linux testing in the default CI runs would not be an improvement. Azure Linux breakages come from both Azure Linux changes (less common) and our code changes (more common). If we do not have Azure Linux testing in default CI, more breakages will sneak through to outer loop.
  • Whether to test Azure Linux on raw VMs: We should have Azure Linux raw VMs coverage somewhere. We can choose to move it to outer loop (as long as we have testing for containerized Azure Linux in default CI).

@richlander
Copy link
Member Author

Yes, I was thinking of having containerized Azure Linux in default CI, which you had suggested earlier. The idea of raw VM testing in outer loop hadn't occurred to me. That is likely a good idea. I will close this PR since it is no longer relevant.

I read an underlying statement that raw VM testing is an important modality.

@richlander richlander closed this May 6, 2025
@jkotas jkotas deleted the revert-114268-azl branch May 23, 2025 22:04
@github-actions github-actions bot locked and limited conversation to collaborators Jun 23, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants