Skip to content

Conversation

pmtk
Copy link
Member

@pmtk pmtk commented May 14, 2025

Removed packaging/observability/opentelemetry-collector.yaml as it should be equivalent to the large example (see openshift/openshift-docs#92758 (comment)).

packaging/observability/opentelemetry-collector-large.yaml is copied as default config to /etc/microshift/observability/opentelemetry-collector.yaml. Not symlinked to avoid problematic scenario like:

  • User edits opentelemetry-collector.yaml which is symlink to opentelemetry-collector-large.yaml.
  • User upgrades MicroShift, opentelemetry-collector-large.yaml is overwritten and so is user's configuration.

Presets/examples description updates:

  • Changed Container, Pod, Volume, and Node metrics to avoid interpretation that the Pod's /metrics are also included
  • Removed (Warnings only) regarding events - could not find in the receivers source code any configuration or logic related to filtering events types
  • Added explicit priority: info (which is the default value) to journald receiver to match the description

Changed the OTEL backend vars to OTEL_BACKEND to avoid confusion when modifying the configuration. Kubelet-stats receiver's endpoint should not be changed, but if it's named the same as the backend vars, it's easy to bulk replace or misinterpret reconfiguration.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 14, 2025
@openshift-ci-robot
Copy link

@pmtk: This pull request references Jira Issue OCPBUGS-56157, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jogeo

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Removed packaging/observability/opentelemetry-collector.yaml as it should be equivalent to the large example (see openshift/openshift-docs#92758 (comment)).

packaging/observability/opentelemetry-collector-large.yaml is copied as default config to /etc/microshift/observability/opentelemetry-collector.yaml. Not symlinked to avoid problematic scenario like:

  • User edits opentelemetry-collector.yaml which is symlink to opentelemetry-collector-large.yaml.
  • User upgrades MicroShift, opentelemetry-collector-large.yaml is overwritten and so is user's configuration.

Presets/examples description updates:

  • Changed Container, Pod, Volume, and Node metrics to avoid interpretation that the Pod's /metrics are also included
  • Removed (Warnings only) regarding events - could not find in the receivers source code any configuration or logic related to filtering events types
  • Added explicit priority: info (which is the default value) to journald receiver to match the description

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from jogeo, pacevedom and vanhalenar May 14, 2025 06:45
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2025
@pmtk
Copy link
Member Author

pmtk commented May 14, 2025

/hold

Thinking about changing variables names to avoid confusion

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 14, 2025
@pmtk pmtk force-pushed the otel/default-symlink-to-large branch from 560b581 to 0626b99 Compare May 14, 2025 11:04
@pmtk
Copy link
Member Author

pmtk commented May 14, 2025

/retest

2 similar comments
@pmtk
Copy link
Member Author

pmtk commented May 14, 2025

/retest

@pmtk
Copy link
Member Author

pmtk commented May 15, 2025

/retest

@pmtk
Copy link
Member Author

pmtk commented May 15, 2025

/test e2e-aws-tests-bootc-arm

@openshift-ci-robot
Copy link

@pmtk: This pull request references Jira Issue OCPBUGS-56157, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jogeo

In response to this:

Removed packaging/observability/opentelemetry-collector.yaml as it should be equivalent to the large example (see openshift/openshift-docs#92758 (comment)).

packaging/observability/opentelemetry-collector-large.yaml is copied as default config to /etc/microshift/observability/opentelemetry-collector.yaml. Not symlinked to avoid problematic scenario like:

  • User edits opentelemetry-collector.yaml which is symlink to opentelemetry-collector-large.yaml.
  • User upgrades MicroShift, opentelemetry-collector-large.yaml is overwritten and so is user's configuration.

Presets/examples description updates:

  • Changed Container, Pod, Volume, and Node metrics to avoid interpretation that the Pod's /metrics are also included
  • Removed (Warnings only) regarding events - could not find in the receivers source code any configuration or logic related to filtering events types
  • Added explicit priority: info (which is the default value) to journald receiver to match the description

Changed the OTEL backend vars to OTEL_BACKEND to avoid confusion when modifying the configuration. Kubelet-stats receiver's endpoint should not be changed, but if it's named the same as the backend vars, it's easy to bulk replace or misinterpret reconfiguration.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@pmtk
Copy link
Member Author

pmtk commented May 16, 2025

standard1 test taking too long due to recent router test suite size increase
/retest

endpoint: ${env:K8S_NODE_NAME}:4317
# Endpoint must point an IP or hostname, and port of an OTLP backend service.
# Here, the OTEL_BACKEND env var is used. It should be changed to point to the backend.
# If left unchanged and undefined, it will be resolved to localhost.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, where the default value of ${OTEL_BACKEND} is set?

Copy link
Member Author

@pmtk pmtk May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowhere actually. But it doesn't cause an error, just evaluates to empty string which is later replaced with localhost:

2025-05-16T09:08:10.316+0200        warn        envprovider/provider.go:51        Configuration references unset environment variable        {"name": "OTEL_BACKEND"}
...
2025-05-16T09:08:10.359+0200        warn        zapgrpc/zapgrpc.go:193        [core] [Channel #1 SubChannel #3]grpc: addrConn.createTransport failed to connect to {Addr: "[::1]:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp [::1]:4317: connect: connection refused"

So user can just replace it directly in config or add an Environment= to microshift-observability.service, but I prefer the former.

Alternative is to use word like backend and that would be expected to be actual resolvable hostname, but again this doesn't cause otel-collector to exit with error:

exporters:
  otlp:
    endpoint: "backend:4317"
2025-05-16T10:58:52.371+0200        info        exporterhelper/retry_sender.go:118        Exporting failed. Will retry the request after interval.        {"kind": "exporter", "data_type": "logs", "name": "otlp", "error": "rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "interval": "2.644063999s"}

I'm open to other alternatives, anything beside reusing K8S_NODE_NAME as it's confusing

Copy link
Contributor

@agullon agullon May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So OTEL internally set localhost if the var is not defined. Fair enough.
Thinking about user experience, I'm not sure what's the best approach. I suggest adding that OTEL defaults it to localhost, but I'm ok either way.

Suggested change
# If left unchanged and undefined, it will be resolved to localhost.
# If left unchanged and undefined, OTEL will resolve it to localhost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the comment line from

    # If left unchanged and undefined, it will be resolved to localhost.

to

    # Unless replaced in config or defined in service file, it'll be empty and OTEL will use 'localhost' instead.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer resolve rather than use when talking about setting localhost here: OTEL will use 'localhost' instead. But I'm not really sure what's the correct verb, because if not set, it can not be resolved it, so...

Apart from that, I agree with your proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I think of resolve I think of DNS, there's no DNS here. Maybe someone else will chime in with ideas

@agullon
Copy link
Contributor

agullon commented May 16, 2025

this PR looks good to me

@pmtk pmtk force-pushed the otel/default-symlink-to-large branch from 84cc8e5 to 1dd20c5 Compare May 16, 2025 11:26
@pmtk
Copy link
Member Author

pmtk commented May 16, 2025

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 16, 2025
@copejon
Copy link
Contributor

copejon commented May 19, 2025

Looks like the observability rewrite of the otel config is tripping the rpm -V --nomtime checks run by other tests in the suite.

@agullon
Copy link
Contributor

agullon commented May 20, 2025

Looks like the observability rewrite of the otel config is tripping the rpm -V --nomtime checks run by other tests in the suite.

I also saw these error/issue in QE regression tests for 4.19 RC. Wondering if we should exclude configuration files from the rpm -V checks. wdyt?

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2025
@pmtk
Copy link
Member Author

pmtk commented May 20, 2025

Looks like the observability rewrite of the otel config is tripping the rpm -V --nomtime checks run by other tests in the suite.

I also saw these error/issue in QE regression tests for 4.19 RC. Wondering if we should exclude configuration files from the rpm -V checks. wdyt?

Looks like it looses eof newline, I'll just copy back from the large preset.

$ diff opentelemetry-collector.yaml opentelemetry-collector-large.yaml
127c127
<                 protocol: http/protobuf
\ No newline at end of file
---
>                 protocol: http/protobuf

@pmtk pmtk force-pushed the otel/default-symlink-to-large branch from 1dd20c5 to 592be3b Compare May 20, 2025 11:35
@pmtk pmtk force-pushed the otel/default-symlink-to-large branch from 592be3b to ec65b98 Compare May 20, 2025 11:35
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2025
@pmtk pmtk force-pushed the otel/default-symlink-to-large branch from ec65b98 to 1ae4110 Compare May 20, 2025 13:29
@copejon
Copy link
Contributor

copejon commented May 20, 2025

Looks like the observability rewrite of the otel config is tripping the rpm -V --nomtime checks run by other tests in the suite.

I also saw these error/issue in QE regression tests for 4.19 RC. Wondering if we should exclude configuration files from the rpm -V checks. wdyt?

Yeah, it's probably a good idea. @pmtk @pacevedom what do y'all think?

@pmtk
Copy link
Member Author

pmtk commented May 21, 2025

/test e2e-aws-tests-bootc

@pmtk
Copy link
Member Author

pmtk commented May 21, 2025

Looks like the observability rewrite of the otel config is tripping the rpm -V --nomtime checks run by other tests in the suite.

I also saw these error/issue in QE regression tests for 4.19 RC. Wondering if we should exclude configuration files from the rpm -V checks. wdyt?

Yeah, it's probably a good idea. @pmtk @pacevedom what do y'all think?

I fixed the test to copy back the correct config example, so it should be good now - no need.

Copy link
Contributor

openshift-ci bot commented May 21, 2025

@pmtk: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@pmtk
Copy link
Member Author

pmtk commented May 21, 2025

/cherrypick release-4.19

@openshift-cherrypick-robot

@pmtk: once the present PR merges, I will cherry-pick it on top of release-4.19 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ggiguash
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 22, 2025
Copy link
Contributor

openshift-ci bot commented May 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ggiguash, pmtk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 57ff22b into openshift:main May 22, 2025
10 checks passed
@openshift-ci-robot
Copy link

@pmtk: Jira Issue OCPBUGS-56157: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-56157 has been moved to the MODIFIED state.

In response to this:

Removed packaging/observability/opentelemetry-collector.yaml as it should be equivalent to the large example (see openshift/openshift-docs#92758 (comment)).

packaging/observability/opentelemetry-collector-large.yaml is copied as default config to /etc/microshift/observability/opentelemetry-collector.yaml. Not symlinked to avoid problematic scenario like:

  • User edits opentelemetry-collector.yaml which is symlink to opentelemetry-collector-large.yaml.
  • User upgrades MicroShift, opentelemetry-collector-large.yaml is overwritten and so is user's configuration.

Presets/examples description updates:

  • Changed Container, Pod, Volume, and Node metrics to avoid interpretation that the Pod's /metrics are also included
  • Removed (Warnings only) regarding events - could not find in the receivers source code any configuration or logic related to filtering events types
  • Added explicit priority: info (which is the default value) to journald receiver to match the description

Changed the OTEL backend vars to OTEL_BACKEND to avoid confusion when modifying the configuration. Kubelet-stats receiver's endpoint should not be changed, but if it's named the same as the backend vars, it's easy to bulk replace or misinterpret reconfiguration.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@pmtk: new pull request created: #4954

In response to this:

/cherrypick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants