Skip to content

Conversation

pacevedom
Copy link
Contributor

Which issue(s) this PR addresses:

Closes #

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 3, 2024

@pacevedom: This pull request references USHIFT-2443 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Which issue(s) this PR addresses:

Closes #

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 3, 2024
@pacevedom
Copy link
Contributor Author

/hold
Tests pending.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 3, 2024
@openshift-ci openshift-ci bot requested review from ggiguash and pliurh April 3, 2024 12:17
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 3, 2024
"routeAdmissionPolicy"
],
"properties": {
"expose": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the other fields like this are nouns. Maybe we can find a noun similar to advertiseAddress for this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about listenAddress? It also allows NIC names, but always translates to IP addresses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the field called in the kube API where these values end up being copied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be this one: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L5058
Which is under service.status.loadBalancer.ingress.ip

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we call it listenAddresses how confusing will it be that it takes an interface, too? Does listenInterfaces have the connotation that it could be a NIC name or IP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested listenaddress because after all everything gets translated into IP addresses, and users can see it in the service itself. It is true that it does not accommodate NIC names very good though. I see listenInterfaces with a similar issue but on the opposite direction. What do you think of endpoints? That would serve for anything that accepts network connections.

return addresses, nil
}

func GetConfiguredAddresses() ([]string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a public function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the function. This is used from both validations and the load balancer service controller to obtain all the allowed IP addresses in the host. They may change dynamically without restarting MicroShift.

return addressList, nil
}

func GetHostNICNames() ([]string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a public function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the function. This is used as the function getting all the allowed ip addresses, interfaces may also change dynamically.

IP: c.NodeIP,
})
//TODO use annotations instead.
if svc.Name == "router-default" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional statement needs a comment explaining why we treat this service in a special way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -209,3 +238,59 @@ func (c *LoadbalancerServiceController) patchStatus(svc *corev1.Service, newStat

return err
}

func (c *LoadbalancerServiceController) getRouterIPAddressList() ([]string, error) {
configuredAddresses, err := config.GetConfiguredAddresses()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what sense are these addresses configured? Are they part of the config data the user gave us, or are they present on the host?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the already present IPs in the host. Renamed both the function and the variable. I was thinking about a context where configuredXYZ means its in the host and anything coming from the user is in cfg. This can be misleading though.


for _, ip := range c.IPAddresses {
if !slices.Contains(configuredAddresses, ip) {
klog.Infof("IP address %v not found in the host. Removing it", ip)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The address is just being ignored, right, not removed? Will a reader of the log understand why we're reporting about that address at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on the context. These are the addresses that a user configured, and they were in the service status field for some time.
If somehow the IP disappears from the host then it is effectively removed from the service status. This may qualify better as a warning instead, as it points to a misconfiguration of MicroShift/the host.
It is true though that the removal only happens once. The next execution loop will not remove it because it was not there (provided that the situation persists). Will reword to skip and raise the level to warning.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2024
@pacevedom
Copy link
Contributor Author

/test ?

Copy link
Contributor

openshift-ci bot commented Apr 8, 2024

@pacevedom: The following commands are available to trigger required jobs:

  • /test images
  • /test metal-periodic-test
  • /test metal-periodic-test-arm
  • /test microshift-metal-cache
  • /test microshift-metal-cache-arm
  • /test microshift-metal-tests
  • /test microshift-metal-tests-arm
  • /test ocp-conformance-rhel-eus
  • /test ocp-conformance-rhel-eus-arm
  • /test test-rpm
  • /test test-unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test test-rebase

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-microshift-main-images
  • pull-ci-openshift-microshift-main-microshift-metal-tests
  • pull-ci-openshift-microshift-main-microshift-metal-tests-arm
  • pull-ci-openshift-microshift-main-ocp-conformance-rhel-eus
  • pull-ci-openshift-microshift-main-ocp-conformance-rhel-eus-arm
  • pull-ci-openshift-microshift-main-test-rpm
  • pull-ci-openshift-microshift-main-test-unit
  • pull-ci-openshift-microshift-main-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2024
@pacevedom
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 8, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2024
@pacevedom pacevedom force-pushed the USHIFT-2443 branch 2 times, most recently from d5a8c59 to e8e2216 Compare April 8, 2024 21:38
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2024
@pacevedom
Copy link
Contributor Author

/test microshift-metal-tests-arm

2 similar comments
@pacevedom
Copy link
Contributor Author

/test microshift-metal-tests-arm

@pacevedom
Copy link
Contributor Author

/test microshift-metal-tests-arm

@pacevedom
Copy link
Contributor Author

/hold
Seems like the tests are taking too long and it times out. Studying where to cut time now.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 9, 2024
@ShudiLi
Copy link
Member

ShudiLi commented Apr 11, 2024

the ingress port including listenAddress and port were added to the config.yaml, then restarted the microshift service, but sometimes, couldn't connect to the cluster anymore(could curl the http route with the specified LB ip and http port 10080, failed to curl https route on the specified LB ip and https port 10443)

One testing result was below:

  1. debug node and check the host ips:
    sh-5.1# ip address | grep "inet 10."
    inet 10.192.10.65/24 brd 10.192.10.255 scope global dynamic noprefixroute eth0
    inet 10.44.0.0/32 scope global br-ex
    inet 10.42.0.2/24 brd 10.42.0.255 scope global ovn-k8s-mp0
    sh-5.1#

  2. check the default load balancer ip, which having 10.192.10.65 and 10.44.0.0 as expected
    % oc -n openshift-ingress get svc
    NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
    router-default LoadBalancer 10.43.37.67 10.192.10.65,10.44.0.0 80:32373/TCP,443:32150/TCP 24m
    router-internal-default ClusterIP 10.43.40.42 80/TCP,443/TCP,1936/TCP 24m

  3. create two routes
    %oc get route
    NAME HOST ADMITTED SERVICE TLS
    edge1 edge1-default.apps.example.com True unsec-server3
    unsec-server3 unsec-server3-default.apps.example.com True unsec-server3

  4. curl the routes with destination to the LB
    sh-4.4# curl http://unsec-server3-default.apps.example.com --resolve unsec-server3-default.apps.example.com:80:10.192.10.65
    this a test!
    sh-4.4# curl http://unsec-server3-default.apps.example.com --resolve unsec-server3-default.apps.example.com:80:10.44.0.0
    this a test!
    sh-4.4#
    sh-4.4# curl https://edge1-default.apps.example.com -k --resolve edge1-default.apps.example.com:443:10.192.10.65
    this a test!
    sh-4.4# curl https://edge1-default.apps.example.com -k --resolve edge1-default.apps.example.com:443:10.44.0.0
    this a test!
    sh-4.4#

  5. debug node and modify config.yaml, and then restart the microshift service
    sh-5.1# vi config.yaml
    sh-5.1# sudo systemctl restart microshift

  6. after about 1h passed, couldn't connect to the server
    % oc get nodes
    The connection to the server 35.94.60.42:6443 was refused - did you specify the right host or port?

@ShudiLi
Copy link
Member

ShudiLi commented Apr 11, 2024

Tested it with modifying the config.yaml file by adding the ingress part(specifying the listening address 10.44.0.0), after restarted the microshift service:
1.
% oc -n openshift-ingress get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
router-default LoadBalancer 10.43.21.230 10.44.0.0 80:30448/TCP,443:31863/TCP 69m
router-internal-default ClusterIP 10.43.88.186 80/TCP,443/TCP,1936/TCP 69m

% oc get route
NAME HOST ADMITTED SERVICE TLS
edge1 edge1-default.apps.example.com True unsec-server3
unsec-server3 unsec-server3-default.apps.example.com True unsec-server3

% oc rsh centos-pod2
sh-4.4# curl http://unsec-server3-default.apps.example.com --resolve unsec-server3-default.apps.example.com:80:10.44.0.0
this a test!
sh-4.4#
sh-4.4# curl https://edge1-default.apps.example.com -k --resolve edge1-default.apps.example.com:443:10.44.0.0
this a test!
sh-4.4#

@ShudiLi
Copy link
Member

ShudiLi commented Apr 11, 2024

/label qe-approved
thanks

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 11, 2024
@openshift-bot
Copy link

openshift-bot commented Apr 11, 2024

@pacevedom: This pull request references USHIFT-2443 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Which issue(s) this PR addresses:

Closes #

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ShudiLi
Copy link
Member

ShudiLi commented Apr 11, 2024

tests for the ports:
1.
% oc -n openshift-ingress get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
router-default LoadBalancer 10.43.21.230 10.44.0.0 10080:30448/TCP,10443:31863/TCP 113m
router-internal-default ClusterIP 10.43.88.186 80/TCP,443/TCP,1936/TCP 113m

% oc get route
NAME HOST ADMITTED SERVICE TLS
edge1 edge1-default.apps.example.com True unsec-server3
passthrough1 passthrough1-default.apps.example.com True sec-httpbin
reencrypt1 reencrypt1-default.apps.example.com True sec-httpbin
unsec-server3 unsec-server3-default.apps.example.com True unsec-server3

sh-4.4# curl https://edge1-default.apps.example.com:10443 -k --resolve edge1-default.apps.example.com:10443:10.44.0.0
this a test!
sh-4.4# curl http://unsec-server3-default.apps.example.com:10080 --resolve unsec-server3-default.apps.example.com:10080:10.44.0.0
this a test!
sh-4.4#
sh-4.4# curl https://passthrough1-default.apps.example.com:10443/headers -k --resolve passthrough1-default.apps.example.com:10443:10.44.0.0
{
"headers": {
"Accept": "/",
"Host": "passthrough1-default.apps.example.com:10443",
"User-Agent": "curl/7.61.1"
}
}
sh-4.4#
sh-4.4# curl https://reencrypt1-default.apps.example.com:10443/headers -k --resolve reencrypt1-default.apps.example.com:10443:10.44.0.0
{
"headers": {
"Accept": "/",
"Forwarded": "for=10.42.0.11;host=reencrypt1-default.apps.example.com:10443;proto=https",
"Host": "reencrypt1-default.apps.example.com:10443",
"User-Agent": "curl/7.61.1",
"X-Forwarded-Host": "reencrypt1-default.apps.example.com:10443"
}
}
sh-4.4#

@pacevedom
Copy link
Contributor Author

/hold cancel
Ready for final review

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 11, 2024
klog.Infof("P333. This is something I am waiting for: %v", err)
if err != nil {
klog.Errorf("unable to update default router service status: %v", err)
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not retry in some time? I guess this also leaves MicroShift without this "controller"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will include this as part of the initial loop. However, falling into this error would mean no more dynamically changing the load balancer IP addresses. The last good known state would remain until an admin took action by restarting MicroShift. This may be considered as a degraded state, but not failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, but are we communicating this in any other way than logging it? How does the admin should recognize microshift needs restarting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no other way right now but logs and docs. This situation shall be explicitly stated there as a degraded condition alongside a remediation procedure to solve it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

}
nicAddresses, err := ipAddressesFromNIC(nicName)
if err != nil {
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above and below we're continuing on "errors" but here we're quiting. What do you think if we'd continue on this error as well (or ignore it in the defaultRouterWatch) so this "controller" keeps running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say continuing on errors you mean skipping IPs which are not present anymore? This is not an abnormal situation in MicroShift, but in the configuration. If the values you configured are gone then it is the admin who should take action to fix it.
If, however, MicroShift is unable to retrieve host addresses or NICs then this is a different issue. Since its difficult to determine the cause from here, an error log will be written in the top handler and it will not change the last good known state. The next IP change (or an admin restarting something) will trigger this function again.
The key in this controller is avoiding changing the last good known state unless the new one is accurate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but it seems we're not communicating degraded state in any way (beside logs, but come on)? And the remediation is basically "restart microshift" - either manually or by reconfiguring some IPs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shall be part of the docs. We do not have a way of signaling this kind of condition in MicroShift. Killing the process could have other implications for an app if its unexpected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 12, 2024
Copy link
Contributor

openshift-ci bot commented Apr 12, 2024

@pacevedom: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/metal-bootc-test-arm 7ee7bf9da53e8a6a961a45df1c4dba541c68062d link true /test metal-bootc-test-arm
ci/prow/metal-bootc-test 7ee7bf9da53e8a6a961a45df1c4dba541c68062d link true /test metal-bootc-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 12, 2024
@pmtk
Copy link
Member

pmtk commented Apr 12, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 12, 2024
Copy link
Contributor

openshift-ci bot commented Apr 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pacevedom, pmtk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit fcf56a9 into openshift:main Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants