Skip to content

Commit 6c0e6f0

Browse files
author
Adrian Fernandez De La Torre
committed
Update RFC-0011 based on the discussion had on Flux Dev Meeting 2025-05-07
1 parent 9ce7d02 commit 6c0e6f0

File tree

1 file changed

+77
-84
lines changed

1 file changed

+77
-84
lines changed

rfcs/0011-opentelemetry-tracing/README.md

Lines changed: 77 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -9,43 +9,39 @@ Must be one of `provisional`, `implementable`, `implemented`, `deferred`, `rejec
99

1010
**Creation date:** 2025-04-24
1111

12-
**Last update:** YYYY-MM-DD
12+
**Last update:** 2025-05-15
1313

1414
## Summary
15-
The aim is to collect traces via OpenTelemetry across all Flux related objects, such as HelmReleases, Kustomizations and among others. These may be sent towards a tracing provider where are going to be stored and visualized. Thereby, this may involve a new API definition obj called `Trace`, which may be capable of linking all the `EventSources` and send them out to a reusable tracing `Provider`. In this way, it could facilitate the observability and monitoring of Flux related objects.
15+
The aim is to be able to collect traces via OpenTelemetry (OTel) across all FluxCD related objects, such as HelmReleases, Kustomizations and among others. These may be sent towards a tracing provider where may be potentially stored and visualized. Flux does not have any responsability on storing and visualizing those, it keeps being completely stateless. Thereby, being seamless for the user, the implementation is going to be part of the already existing `Alert` API Type. Therefore, `EventSources` is going to discriminate the events belonging to the specific sources, which are going to be looked up to and send them out towards the `Provider` set. In this way, it could facilitate the observability and monitoring of FluxCD related objects.
1616

1717
## Motivation
18-
This RFC was born out of a need for end-to-end visibility into Flux’s multi-controller GitOps workflow. At the time Flux was one monolithic controller; it has since split into several specialized controllers (source-controller, kustomize-controller, helm-controller, notification-controller, etc.), which makes tracing the path of a single “Git change → applied resource → notification” much harder.
18+
This RFC was born out of a need for end-to-end visibility into Flux’s multi-controller GitOps workflow. At the time Flux was one monolithic controller; it has since split into several specialized controllers (source-, kustomize-, helm-, notification-, etc.), which makes tracing the path of a single "Source change → applied resource → notification” much harder. Additionally,
19+
users may not have to implement tools/sidecars around to maintain.
1920

2021
Correlate a Git commit with all downstream actions. You want one single trace that (via multiple spans) shows:
21-
- Source-controller current revision ID.
22-
- Any Kustomize or Helm reconciliations that ran.
22+
- Alert reference based on a unique ID (root span).
23+
- Any source pulling new content based on a new Digest Checksum.
24+
- Any subsequent rencolitiation that ran.
2325
- Events emitted and notifications sent by the notification-controller.
2426

2527
On top of this, can be built custom UIs that surface trace timelines alongside Git commit or Docker image tags, so operators can say “what exactly happened when I tagged v1.2.3?” in a single pane of glass.
2628

27-
By extending Flux’s CRD objects, users can manage tracing settings (sampling rates, backends, object filters). Therefore, users may not have to implement tools/sidecars around to maintain.
28-
2929
### Goals
30-
- **End-to-end GitOps traceability:** Capture the traces that follows a Git change through all Flux controllers for simply debugging and root-cause analysis.
31-
- **Declarative, CRD-drive configuration:** Reuse the concept of `Provider` and a similar definition as `Alerts` to build a new API/CR called `Trace`. Therefore, users can link `EventSources` and `Provider` where trace will be sent. Additionally, other setting can be set as sampling rates.
32-
- **Notification-Controller as the trace collector:** Leverage the notification-controller's existing event watching pipeline to ingest reconciliation events and turn me into OpenTelemetry spans, being forwarwed to an OLTP-compatible backend - `Provider`.
33-
- **Cross controller span correlation:** Ensure spans are emitted from multiple, stateless controller can be stiched together into a single trace by using Flux "revision" annotation (GitRepository sync to a downstream Kustomization/HelmRelease reconciliations).
30+
- **End-to-end GitOps traceability:** Capture the traces that follows "a Git change" (any source) through all FluxCD controllers for simply debugging and root-cause analysis.
31+
- **Declarative, CRD-drive configuration:** Reuse the concept of `Alerts` to be able to populate this feature over, out-of-the-box. Therefore, users can link `EventSources` and `Provider` where trace will be sent.
32+
- **Notification Controller as the trace-collector:** Leverage the notification-controller's existing event watching pipeline to ingest reconciliation events and turn me into OpenTelemetry spans, being forwarwed to an OLTP-compatible backend - `Provider`.
33+
- **Cross-controller span correlation:** Ensure spans are emitted from multiple, stateless controller can be stiched together into a single trace by using Flux "revision" annotation.
3434

3535
### Non-Goals
3636
- **Not a full-tracing backend:** We won't build or bundle a storage/visualization system. Users may have to still rely on a external collector for long-term retention, querying and UI.
3737
- **Not automatic instrumentation of user workloads:** This integration only captures FluxCd controller events (Source, Kustomize, Helm, etc.). It won't auto-inject spans into your application pods or third-party controllers running in the same cluster.
3838
- **Not a replacement for metrics or logs:** Flux's existing Prometheus metrics and structural logging remain the primary way to monitor performance and errors. Tracing is purely for request-flow visibility, not for time-series monitoring or log aggregation.
3939
- **No deep-code lelve spans beyond CRUD events:** Will emit spans around high-level reconciliation steps (e.g. "reconcile GitRepository", "dispatch Notification"), but we're not aiming to instrument every internal function call or library method within each controller.
4040
- **Not a service mesh integration:** It's not plan of the scope tieing this into Istio, Linkerd, or other mesh-sidecar approaches. It's strictly a controller-drive, CRD-based model.
41-
- **No per-span custom enrichment beyond basic metadata:** While Trace CRD will let you filter which Flux object kinds to trace (`EventSources`), it won't support (at least initially) complex span attributes or tag-enrichment rules. You may have to handle those in your downstream collector/processor if needed.
42-
- **Not a replacement for user-driven OpenTelemetry SDKs:** If you already have a Go-based operator that embed OpenTelemetry's SDK directly, this feature won't override or duplicate that. Think about it as a complementary, declrartive layer for flux controllers.
41+
- **No per-span custom enrichment beyond basic metadata:** At least intially, it won't support complex span attributes or tag-enrichment rules. You may have to handle those in your downstream collector/processor if needed.
42+
- **Not a replacement for user-driven OpenTelemetry SDKs:** If you already have a Go-based operator that embed OpenTelemetry's SDK directly, this feature won't override or duplicate that. Think about it as a complementary, declarative layer for flux controllers.
4343

4444
## Proposal
45-
46-
Add a new `Trace` custom definition in Flux's notification-controller. Under this, there is a conjuntion of: `EventSources` Flux's related objects to get the traces on and `Provider` external system where all the traces are going to be sent towards.
47-
48-
Additionally, as part of the `Provider`, we may have to onboard new type(s) to tackle OLTP compliant systems.
4945
<!--
5046
This is where we get down to the specifics of what the proposal actually is.
5147
This should have enough detail that reviewers can understand exactly what
@@ -55,52 +51,59 @@ implementation.
5551
If the RFC goal is to document best practices,
5652
then this section can be replaced with the actual documentation.
5753
-->
54+
The implementation will extend the notification-controller with OpenTelemetry tracing capabilities by leveraging the existing Alert API object model. This approach maintains Flux's declarative configuration paradigm while adding powerful distributed tracing functionality.
5855

59-
### User Stories
60-
61-
#### Story 1
62-
> As a cluster administrator, I want to see everything that happened
63-
> since a git change occurred in a single trace. All the applied yaml
64-
> files in Source Controller, notifications that went out to Notifications
65-
> Controller, all the HelmReleases that were applied in Helm-Controller,
66-
> Kustomize Controller, etc...
67-
68-
For instance, having the following setup:
56+
### Core Implementation Strategy
57+
1. **Extend the notification-controller:** Add OpenTelemetry tracing support to the notification-controller, which already has visibility into events across the Flux ecosystem.
58+
2. **Leverage existing Alert CRD structure:** Use the Alert Kind API object as the configuration entry point, where:
59+
- `EventSources` define which Flux resources to trace (GitRepositories, Kustomizations, HelmReleases, etc.).
60+
- `Provider` specifies where to send the trace data (Jaeger, Tempo, or other OpenTelemetry-compatible backends).
61+
3. **Span generation and correlation:** Generate spans for each reconciliation event from watched resources, ensuring proper parent-child relationships and context propagation using Flux's revision annotations as correlation identifiers.
62+
4. **Provider compatibility and fallback mechanism:** The implementation supports any provider that implements the OpenTelemetry Protocol (OTLP). When traces are sent to OTLP-compatible providers (like Jaeger or Tempo), they are transmitted as proper OpenTelemetry spans. For non-OTLP providers, the system gracefully degrades by logging trace information as structured warnings in the notification-controller logs, ensuring no alerting functionality is disrupted. This approach maintains system stability while encouraging the use of proper tracing backends.
6963

64+
This approach allows users to declaratively configure tracing using familiar Flux patterns, without requiring code changes to their applications or additional sidecar deployments. The notification-controller will handle the collection, correlation, and forwarding of spans to the configured tracing backend.
7065

66+
Example Configuration:
7167
```yaml
68+
# Define a tracing provider
7269
apiVersion: notification.toolkit.fluxcd.io/v1beta1
7370
kind: Provider
7471
metadata:
7572
name: jaeger
7673
namespace: default
7774
spec:
7875
type: jaeger
79-
address: http://jaeger-collector.jaeger-system.svc.cluster.local:9411
76+
address: http://jaeger-collector.jaeger-system.svc.cluster.local:9411 # Provider endpoint
8077
secretRef:
81-
name: jaeger-secret
78+
name: jaeger-secret # Optional authentication
79+
8280
---
81+
# Configure an alert (includes the tracing out-of-the-box)
8382
apiVersion: notification.toolkit.fluxcd.io/v1beta1
84-
kind: Trace
83+
kind: Alert
8584
metadata:
86-
name: webapp
85+
name: webapp-tracing
8786
namespace: default
8887
spec:
8988
providerRef:
9089
name: jaeger
9190
eventSources:
92-
- kind: Kustomization
91+
- kind: GitRepository # Source controller resources
92+
name: webapp-source
93+
- kind: Kustomization # Kustomize controller resources
9394
name: webapp-backend
94-
- kind: HelmRelease
95+
- kind: HelmRelease # Helm controller resources
9596
name: webapp-frontend
9697
```
9798
98-
#### Story 2
99-
> I want to build a UI using trace data to track release changes and integrate deeply
100-
> with a Git commit/tag, a Docker image tag, and the GOTK flow of operations
99+
Based on this configuration, the notification-controller will:
100+
- Watch for events from the specified resources.
101+
- Generate OpenTelemetry spans for each reconciliation event.
102+
- Correlate spans across controllers using Flux's revision annotations.
103+
- Forward the spans to the configured Jaeger endpoint - `Provider`.
104+
- This implementation maintains Flux's stateless design principles while providing powerful distributed tracing capabilities that help users understand and troubleshoot their GitOps workflows.
101105

102106
### Alternatives
103-
- Addition of a new `Provider` type to "onboard" OLTP-compliant systems. Could be adressed in generic way, just adding a new type to tackle them all or adding specific integrations for the most relevant ones: [Jaeger](https://www.jaegertracing.io/) & [Zipkin](https://zipkin.io/).
104107
<!--
105108
List plausible alternatives to the proposal and explain why the proposal is superior.
106109

@@ -109,56 +112,46 @@ This is a good place to incorporate suggestions made during discussion of the RF
109112

110113
## Design Details
111114

112-
Adding a new API `Trace` on Flux to manage the link between `Provider` (where the traces are going to be sent) and `EventSources` (Flux's related objects part of the "tracing chain").
115+
### Trace Identity and Correlation
116+
A key challenge in distributed tracing is establishing a reliable correlation mechanism that works across multiple controllers in a stateless, potentially unreliable environment. Our solution addresses this with a robust span identification strategy.
117+
118+
The root span ID is generated using a deterministic approach that combines:
119+
- **Alert Object UID** (guaranteed unique by Kubernetes across all clusters).
120+
- **Source's revision ID** (extracted from event payloads).
113121

114-
Example of `Trace` custom resource alongside the `Provider`:
122+
These values are concatenated and passed through a configurable checksum algorithm (SHA-256 by default). This approach ensures:
123+
- Globally unique trace identifiers across multi-tenant and multi-cluster environments.
124+
- Consistent trace correlation even when events arrive out of order.
125+
- Reliable identification of the originating source event.
126+
127+
Example:
115128
```yaml
116-
apiVersion: notification.toolkit.fluxcd.io/v1
117-
kind: Trace
118-
metadata:
119-
name: webapp
120-
namespace: default
121-
spec:
122-
providerRef:
123-
name: jaeger
124-
eventSources:
125-
- kind: Kustomization
126-
name: webapp-backend
127-
- kind: HelmRelease
128-
name: webapp-frontend
129-
---
130-
apiVersion: notification.toolkit.fluxcd.io/v1
131-
kind: Provider
132-
metadata:
133-
name: jaeger
134-
namespace: default
135-
spec:
136-
type: jaeger
137-
address: http://jaeger-collector.jaeger-system.svc.cluster.local:9411
138-
secretRef:
139-
name: jaeger-secret
140-
```
129+
# Input values
130+
Alert UID: "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
131+
Source Revision: "sha256:2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae"
132+
133+
# Concatenated value
134+
"a1b2c3d4-e5f6-7890-abcd-ef1234567890(<Alert-UID>):sha256:2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae(<source-revision>)"
141135
142-
The user will be able to define a Flux `Trace` custom resource and deploy it to the cluster. The trace may rely on revision annotation as the correlation key due to every Flux reconciliation writes the same `source.toolkit.fluxcd.io/reconcile-revision` annotation onto the object (and is emitted on its Event). That revision string (e.g. a Git commit SHA) is your trace ID surrogate. Therefore, if Source Controller emits an event for `GitRepository` with `revision = SHA123`, later on Kustomize Controller will emit another event for `Kustomization` with the same `revision = SHA123`.
143-
144-
In this way, rather than trying to carry a live span context through controller A -> API server -> controller B:
145-
1. Watch Flux Events.
146-
2. Group them by ``event.objectRef.kind`` + ``revision``. Hence, based on `EventSources` field:
147-
- First time you see `GitRepository@SHA123`, spawn a root span with traceID = hash(SHA123).
148-
- When you see `Kustomization@SHA123`, spawn a child span (parent = root) in that same trace.
149-
- And so on for `HelmRelease@SHA123`, `Notification@SHA123`, etc.
150-
3. Subsequently, it may generate the following parent/child relationship:
151-
```text
152-
root span: “reconcile GitRepository SHA123” (spanID = H1)
153-
↳ child span: “reconcile Kustomization foo” (spanID = H2, parent = H1)
154-
↳ child span: “reconcile HelmRelease bar” (spanID = H3, parent = H2)
155-
```
156-
157-
However, in order to make this design work, we need to ensure each controller:
158-
- Emits its normal Kubernetes `Event` with the `revision` annotation (already built-in).
159-
- Optionally tags the Event with `flux.event.type` and timestamp (they already do).
160-
161-
About sending the traces, `Provider` custom resource is going to be reused as the target external system where all the traces are going to be sent towards, based on each `Trace` custom resource definition. Thus, as most of the already existing providers are non-OLTP compliant, there is an open point about either add a new generic type to handle all OLTP's external systems or add a specific ones for the most relevant ones. Anyhow, the user should be completely agnostic about this point, because `Provider` custom resource definition may not differ much from the already existing ones.
136+
# Apply SHA-256 (default algorithm)
137+
Root Span ID: "f7846f55cf23e14eebeab5b4e1550cad5b509e3348fbc4efa3a1413d393cb650"
138+
```
139+
When events occur in the system:
140+
1. GitRepository reconciliation event with revision "sha256:2c26..." is captured by notification controller and creates root span with ID "f7846f55...".
141+
2. Kustomization acts on the previous one, creating another event with same revision creates child span linked to "f7846f55...".
142+
3. HelmRelease event with same revision creates another child span.
143+
4. All spans are collected into a single trace viewable in the tracing backend.
144+
145+
### Resilient Span Management
146+
The design accounts for the distributed nature of Flux controllers and potential delays/downtimes that a distributed system always implies:
147+
- **Asynchronous Event Processing:** Since events may arrive in any order due to the distributed nature of Flux controllers, the system doesn't assume sequential processing. Each event can independently locate its parent span or create a new root span as needed.
148+
- **Fault Tolerance:** If the notification-controller experiences downtime or latency issues, it implements a recovery mechanism:
149+
- When processing an event, it first attempts to locate an existing root span based on the calculated ID.
150+
- If found, it attaches the new span as a child to maintain the trace hierarchy.
151+
- If not found (due to previous failures or out-of-order processing), it automatically creates a new root span
152+
- Span Hierarchy Maintenance: All subsequent spans related to the same revision are properly attached to their parent spans, creating a coherent trace visualization regardless of when events are processed.
153+
154+
This design ensures trace continuity even in challenging distributed environments while maintaining Flux's core principles of statelessness and resilience.
162155

163156
## Implementation History
164157

0 commit comments

Comments
 (0)