Fix silent drop of ztunnel counter metrics in Istio ambient mode#23707
Conversation
Ztunnel emits the modern OpenMetrics counter convention (TYPE declared with the base name, samples emitted with the _total suffix). The integration's ztunnel sub-scraper was routed through the legacy prometheus_client parser, which does not add _total to a counter's allowed sample names and yields each sample as an untyped singleton. Combined with construct_metrics_config stripping _total from registration keys, every ztunnel counter was silently dropped (failure visible only at DEBUG level). Force use_latest_spec=True on the ztunnel sub-scraper so the OpenMetrics parser is used. Scoped to ztunnel only; waypoint (Envoy, legacy convention) and istiod (Go client_golang, legacy convention) are unaffected. The user can still opt out by setting use_latest_spec: false on the instance. Adds an ambient e2e environment (py3.13-1.24-ambient) that installs Istio 1.24.3 with the ambient profile, deploys bookinfo, applies a waypoint, and runs a traffic generator so ztunnel counters are non-zero. The unit test fixture is now the real output captured from ztunnel running in that environment; the prior hand-written 1.5/ztunnel.txt fixture used a non-realistic TYPE convention and is removed. Adds test_ambient_ztunnel_legacy_parser_drops_counters that pins the broken behavior under use_latest_spec=false so a future regression that removes the default cannot reintroduce the silent drop.
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
🎉 All green!🧪 All tests passed 🔗 Commit SHA: 50f5109 | Docs | Datadog PR Page | Give us feedback! |
In CI, ddev env test runs the dd_environment fixture's setup and teardown in different pytest invocations. The DDEV_E2E_ENV_* env vars that map per-port-forward TempDir names to actual paths do not survive that transition, so KillProcess opens a freshly mkdtemp'd path and raises FileNotFoundError on kubectl.pid lookup. The kubectl port-forward processes are already dying on their own because kind delete cluster has removed the API server they were forwarding to, so swallowing the missing-pid-file error is safe. Wrap port_forward in a safe_port_forward context manager that suppresses FileNotFoundError on exit. Sidecar (single port-forward) is not affected; ambient (three port-forwards) hits this every run.
The previous attempt port-forwarded pod/ztunnel-<random-suffix>. The dd_environment fixture runs in two pytest invocations under ddev env test (setup and teardown), and on the teardown invocation _get_first_ztunnel_pod runs against a kind cluster that has already been deleted, returning an empty string. port_forward then builds a different TempDir key than setup, the env-var bridge misses, and KillProcess opens an empty mkdtemp'd dir, raising FileNotFoundError on kubectl.pid. Other integrations like kuma avoid this by port-forwarding a Service (stable name). Create a Service in setup_istio_ambient via `kubectl expose daemonset ztunnel`, port-forward service/ztunnel-metrics. Drop the _get_first_- ztunnel_pod helper and the safe_port_forward workaround.
kubectl expose does not support DaemonSets (error: cannot expose a DaemonSet.apps). The previous attempt to use kubectl expose silently failed and the port-forward of service/ztunnel-metrics then connected to nothing, surfacing as 'Connection refused' on the agent side. Apply a tracked Service manifest at tests/kind/ztunnel_service.yaml instead.
The matrix previously folded mode into a single version-like dimension (version = ["1.13", "1.24-ambient"]), which scales poorly: adding ambient support for a new Istio version meant introducing a new "1.xx-ambient" pseudo-version alongside the plain "1.xx" entry. Split into orthogonal dimensions: version = ["1.13"] mode = ["sidecar"] # block 1 version = ["1.24"] mode = ["ambient"] # block 2 Two matrix blocks express the constraint that 1.13 does not support ambient (it was GA in 1.24). Adding a future version is a one-line change to the relevant block; adding a new mode on an existing version is a one-line change too once the matching setup function is in place. Env names become py3.13-1.13-sidecar and py3.13-1.24-ambient, with ISTIO_VERSION and ISTIO_MODE env vars driven independently from the two matrix axes. conftest's fallthrough switches to `else` (i.e. the sidecar branch) to match the new matrix semantics rather than gating on a hard-coded VERSION string.
Migrate setup_istio off the baked IstioOperator manifest and onto `istioctl install --set profile=demo`, which is version-agnostic. Drop the now-unused 1.13-only manifest and extend the sidecar matrix to also cover 1.24, so the integration is exercised end-to-end against a modern Istio release in sidecar mode.
A fresh Istio 1.24 install on bookinfo emits istio.galley.validation.passed but not the .failed variant (no validation errors occur). The .passed.count v2 sibling was also missing from the list. Move both pairs into the intermittent list so the e2e tests pass on a clean cluster.
The Go runtime metrics istio.go.memstats.gc_cpu_fraction and lookups_total are no longer emitted by Istio 1.24's binary (deprecated in Go 1.20 and later removed from client_golang's default exposition), but they are still emitted by Istio 1.13. Calling them "intermittent" on 1.13 silently hid any collection bug there. Pull them out into a LEGACY_GO_METRICS set that is strictly asserted on 1.13 envs and skipped on 1.24+. Genuinely-conditional metrics (validation failures, listener conflicts, sidecar injection events, etc.) remain in INTERMITTENT_METRICS and are asserted at_least=0 on every version.
- test_e2e_ambient: replace the dead at_least=0 conditional on ISTIOD_V2_METRICS with the existing _assert_istiod_metric helper so ambient mode also strictly asserts non-intermittent istiod metrics. - check.py: log a warning when the user opts out of the OpenMetrics parser (use_latest_spec: false) while ztunnel is being scraped. Comment the asymmetry between namespace (always restored) and use_latest_spec (user-overridable default) inside _generate_config. - conftest.py: replace the hard-coded 15-second sleep after the traffic generator pod is created with a 300-second polling loop that waits for ztunnel to report a non-zero TCP connection counter. - conftest.py: return osx-amd64 (not osx) for intel macOS in the Istio release suffix helper so local-dev downloads do not 404. - changelog: rephrase to name the affected version (9.4.0) so customers searching the changelog for the regression window can find this fix.
- _wait_for_ztunnel_traffic was exec'ing into the ztunnel pod, which runs a minimal Rust binary without curl. Move the kubectl exec into the traffic-gen pod (curlimages/curl) and target the ztunnel-metrics Service applied earlier in setup. Capture stderr instead of leaking it through CI logs. - Extend the matrix with Istio 1.29 (the current supported release as of 2026-05-18) in both sidecar and ambient mode. 1.24 stays since that is where ambient GA'd; 1.13 stays for the legacy Go-runtime safety net.
- check.py: use is_affirmative for the use_latest_spec opt-out guard
so YAML string booleans ("false") also trigger the warning.
- check.py: replace use_latest_spec_default kwarg with a keyword-only
scraper_defaults mapping; precedence (defaults -> instance -> per-
scraper restore) is now visible at the signature without an inline
comment.
- test_unit_istio_v2: assert directly that the ztunnel scraper config
carries use_latest_spec=True, so a refactor that drops the default
fails on shape rather than on fixture parsing.
- test_e2e: treat unparseable ISTIO_VERSION as legacy so misconfigured
local runs fail loudly instead of silently skipping LEGACY_GO_METRICS.
- conftest: justify the per-line ValueError skip in _ztunnel_has_traffic
with an inline comment.
- check.py: restore the invariant comment next to `config.update` so the namespace-must-follow-update rule is visible in code shape. - metadata.csv: rename the four ztunnel TCP counters from .total to .count to match the names the OpenMetrics V2 transformer actually submits (construct_metrics_config strips the _total suffix). This closes the metadata loop on the same regression class the PR fixes. - test_unit_istio_v2: add assert_metrics_using_metadata to the ztunnel metrics test so future mismatches between metrics.py, the transformer, and metadata.csv fail fast. - conftest: kubectl wait for traffic-gen Ready before polling so the 300s budget is spent on the actual wait-for-traffic signal instead of on "container not found" exec failures while the image pulls. - conftest: capture the last non-zero exec stderr and surface it in the RuntimeError so CI failures point at the real cause instead of a generic timeout message.
- Raise on the kubectl wait for traffic-gen so a timeout / pod-schedule failure surfaces immediately instead of being swallowed and then misattributed by the 300s polling loop downstream. - Build the ztunnel /stats/prometheus URL from the existing module constants so a future port or service rename touches one place. - Capture the last stdout excerpt on iterations where curl exits 0 but ztunnel reports no traffic (-sm 5 does not pass -f, so HTTP 4xx/5xx bodies land in stdout); include it in the final RuntimeError so the failure points at the real cause rather than reading "<none>".
The OpenMetrics V2 transformer strips _total from the registered name
and appends .count for counter-type metrics, so eight more ztunnel
counter rows in metadata.csv (DNS, on_demand_dns, xds.connection_
terminations, connection.{opens,closes,termination}) were declared
under the wrong suffix and went orphan once the V2 parser engaged.
Basenames already match what metrics.py registers and what ztunnel
1.24 emits, so the rename is mechanical.
The four remaining .total rows (proxies_{started,stopped}, active/
pending_proxy_count) have wrong basenames at the source — ztunnel
1.24 emits these under workload_manager_* — and need their
registration corrected in a follow-up, not a suffix-only rewrite.
The four in-pod proxy management metrics were registered under istio_proxies_started_total, istio_active_proxy_count_total etc., but ztunnel 1.24 emits them under the workload_manager_* family. Two more counters ztunnel emits (istio_xds_message_total and istio_xds_message_bytes_total) were not registered at all. Realign the registrations against the captured ztunnel exposition, update the four corresponding metadata.csv rows (correct suffix per gauge/counter type), add metadata entries for the two new xds counters, and expand V2_ZTUNNEL_METRICS so the unit and e2e tests assert all nine ztunnel metrics that the integration is now collecting. Split V2_ZTUNNEL_METRICS into counter / gauge sub-lists so the legacy-parser regression test pins the broken behavior on the counter set only (gauges are not affected by the parser bug).
- conftest: drop the seven-line docstring on the private _wait_for_ztunnel_traffic helper for a one-liner per AGENTS.md, with a short inline comment at the call site explaining the curl-from-traffic-gen indirection. - test_unit_istio_v2: assert in the happy-path ztunnel test that the opt-out warning stays silent, so a future change that emits it unconditionally fails the test instead of shipping noise. - common.py: collapse the multi-line block introducing the V2_ZTUNNEL_COUNTER_METRICS / V2_ZTUNNEL_GAUGE_METRICS split into two one-line comments.
- check.py: replace the four-line block above the ztunnel if-block with a single line. The scraper_defaults kwarg and the opt-out warning already explain the why. - common.py: replace the wrapped comment above V2_ZTUNNEL_COUNTER_METRICS with a single line that names the format-level reason (TYPE base name + _total samples) instead of the PR-narrative "before this PR's fix" framing.
These three counter registrations point at metric names ztunnel never emits. A search across the ztunnel source confirms only the tcp_connections_* and xds_connection_terminations families exist; the istio_connection_opens_total / closes_total / termination_total entries were dead from the start and carried matching orphan rows in metadata.csv. Removing both.
Review from joepeeples is dismissed. Related teams and files:
- documentation
- istio/metadata.csv
|
The following files, which will be shipped with the agent, were modified in this PR and You can ignore this if you are sure the changes in this PR do not require QA. Otherwise, List of modified files that will be shipped with the agent |
|
Non-blocking since it's assets - we should probably update assets and the README for ambient mode. |
nubtron
left a comment
There was a problem hiding this comment.
We should make the ambient settings in the spec as fleet_configurable.
nubtron
left a comment
There was a problem hiding this comment.
Downgraded adding "fleet_configurable" to non-blocking as well, since it's an asset.
PR #22581 introduced ambient mode but never updated the docs or spec metadata. The three instance options it added (istio_mode, ztunnel_endpoint, waypoint_endpoint) were the only instance-level settings in the spec without fleet_configurable: true, so customers managing Istio from Datadog Fleet Automation could remote-configure every Istio setting except those three. The README similarly had no mention of ambient mode at all. - Add fleet_configurable: true to the three ambient settings in assets/configuration/spec.yaml; regenerate config_models. - Add an "Ambient mode configuration" subsection to the README under Metric collection, showing the ztunnel annotation pattern and pointing at waypoint as the L7 option.
Review from nubtron is dismissed. Related teams and files:
- agent-integrations
- istio/README.md
- istio/assets/configuration/spec.yaml
- istio/datadog_checks/istio/config_models/init.py
- istio/datadog_checks/istio/config_models/defaults.py
- istio/datadog_checks/istio/config_models/instance.py
- istio/datadog_checks/istio/config_models/shared.py
Review from nubtron is dismissed. Related teams and files:
- agent-integrations
- istio/README.md
- istio/assets/configuration/spec.yaml
- istio/datadog_checks/istio/config_models/init.py
- istio/datadog_checks/istio/config_models/defaults.py
- istio/datadog_checks/istio/config_models/instance.py
- istio/datadog_checks/istio/config_models/shared.py
The previous wording suggested a second instance for waypoint metrics, but _parse_ambient_config reads ztunnel_endpoint, waypoint_endpoint, and istiod_endpoint from the same instance and spawns sub-scrapers for each. Replace the per-pod annotation example with a static conf.yaml covering all three components in one instance.
Port 15020 is the Istio default but configurable; mentioning specific numbers in the prose can mislead a reader running a non-default exposition. The example URLs still show the conventional defaults, which is the right place for that information since users edit the URLs directly when their setup differs.
- Spell out GA as 'generally available' on first mention. - Split the sentence joined by an em dash into two complete sentences. - Capitalize Autodiscovery (Datadog feature name).
Local ddev's model regeneration produced these files without the Datadog license header that the master branch tracks. CI's validate models check caught the drift.
Validation ReportAll 20 validations passed. Show details
|
) (#23761) * Fix silent drop of ztunnel counter metrics in Istio ambient mode Ztunnel emits the modern OpenMetrics counter convention (TYPE declared with the base name, samples emitted with the _total suffix). The integration's ztunnel sub-scraper was routed through the legacy prometheus_client parser, which does not add _total to a counter's allowed sample names and yields each sample as an untyped singleton. Combined with construct_metrics_config stripping _total from registration keys, every ztunnel counter was silently dropped (failure visible only at DEBUG level). Force use_latest_spec=True on the ztunnel sub-scraper so the OpenMetrics parser is used. Scoped to ztunnel only; waypoint (Envoy, legacy convention) and istiod (Go client_golang, legacy convention) are unaffected. The user can still opt out by setting use_latest_spec: false on the instance. Adds an ambient e2e environment (py3.13-1.24-ambient) that installs Istio 1.24.3 with the ambient profile, deploys bookinfo, applies a waypoint, and runs a traffic generator so ztunnel counters are non-zero. The unit test fixture is now the real output captured from ztunnel running in that environment; the prior hand-written 1.5/ztunnel.txt fixture used a non-realistic TYPE convention and is removed. Adds test_ambient_ztunnel_legacy_parser_drops_counters that pins the broken behavior under use_latest_spec=false so a future regression that removes the default cannot reintroduce the silent drop. * Add changelog entry * Make ambient e2e teardown robust to missing kubectl.pid In CI, ddev env test runs the dd_environment fixture's setup and teardown in different pytest invocations. The DDEV_E2E_ENV_* env vars that map per-port-forward TempDir names to actual paths do not survive that transition, so KillProcess opens a freshly mkdtemp'd path and raises FileNotFoundError on kubectl.pid lookup. The kubectl port-forward processes are already dying on their own because kind delete cluster has removed the API server they were forwarding to, so swallowing the missing-pid-file error is safe. Wrap port_forward in a safe_port_forward context manager that suppresses FileNotFoundError on exit. Sidecar (single port-forward) is not affected; ambient (three port-forwards) hits this every run. * Use a Service for ztunnel port-forward instead of a dynamic pod The previous attempt port-forwarded pod/ztunnel-<random-suffix>. The dd_environment fixture runs in two pytest invocations under ddev env test (setup and teardown), and on the teardown invocation _get_first_ztunnel_pod runs against a kind cluster that has already been deleted, returning an empty string. port_forward then builds a different TempDir key than setup, the env-var bridge misses, and KillProcess opens an empty mkdtemp'd dir, raising FileNotFoundError on kubectl.pid. Other integrations like kuma avoid this by port-forwarding a Service (stable name). Create a Service in setup_istio_ambient via `kubectl expose daemonset ztunnel`, port-forward service/ztunnel-metrics. Drop the _get_first_- ztunnel_pod helper and the safe_port_forward workaround. * Create ztunnel Service via manifest, not kubectl expose kubectl expose does not support DaemonSets (error: cannot expose a DaemonSet.apps). The previous attempt to use kubectl expose silently failed and the port-forward of service/ztunnel-metrics then connected to nothing, surfacing as 'Connection refused' on the agent side. Apply a tracked Service manifest at tests/kind/ztunnel_service.yaml instead. * Restructure matrix orthogonally: version x mode The matrix previously folded mode into a single version-like dimension (version = ["1.13", "1.24-ambient"]), which scales poorly: adding ambient support for a new Istio version meant introducing a new "1.xx-ambient" pseudo-version alongside the plain "1.xx" entry. Split into orthogonal dimensions: version = ["1.13"] mode = ["sidecar"] # block 1 version = ["1.24"] mode = ["ambient"] # block 2 Two matrix blocks express the constraint that 1.13 does not support ambient (it was GA in 1.24). Adding a future version is a one-line change to the relevant block; adding a new mode on an existing version is a one-line change too once the matching setup function is in place. Env names become py3.13-1.13-sidecar and py3.13-1.24-ambient, with ISTIO_VERSION and ISTIO_MODE env vars driven independently from the two matrix axes. conftest's fallthrough switches to `else` (i.e. the sidecar branch) to match the new matrix semantics rather than gating on a hard-coded VERSION string. * Use matrix-axis shorthand for ISTIO_MODE env var * Add Istio 1.24 to the sidecar e2e matrix Migrate setup_istio off the baked IstioOperator manifest and onto `istioctl install --set profile=demo`, which is version-agnostic. Drop the now-unused 1.13-only manifest and extend the sidecar matrix to also cover 1.24, so the integration is exercised end-to-end against a modern Istio release in sidecar mode. * Mark Istio galley validation pass/fail metrics intermittent A fresh Istio 1.24 install on bookinfo emits istio.galley.validation.passed but not the .failed variant (no validation errors occur). The .passed.count v2 sibling was also missing from the list. Move both pairs into the intermittent list so the e2e tests pass on a clean cluster. * Mark gc_cpu_fraction intermittent for Istio 1.24 * Mark memstats.lookups intermittent for Istio 1.24 * Mark all pilot.conflict variants intermittent * Split legacy Go metrics from intermittent in Istio e2e The Go runtime metrics istio.go.memstats.gc_cpu_fraction and lookups_total are no longer emitted by Istio 1.24's binary (deprecated in Go 1.20 and later removed from client_golang's default exposition), but they are still emitted by Istio 1.13. Calling them "intermittent" on 1.13 silently hid any collection bug there. Pull them out into a LEGACY_GO_METRICS set that is strictly asserted on 1.13 envs and skipped on 1.24+. Genuinely-conditional metrics (validation failures, listener conflicts, sidecar injection events, etc.) remain in INTERMITTENT_METRICS and are asserted at_least=0 on every version. * Parse Istio version numerically for legacy check * Use packaging.version for Istio version comparison * Address PR review feedback - test_e2e_ambient: replace the dead at_least=0 conditional on ISTIOD_V2_METRICS with the existing _assert_istiod_metric helper so ambient mode also strictly asserts non-intermittent istiod metrics. - check.py: log a warning when the user opts out of the OpenMetrics parser (use_latest_spec: false) while ztunnel is being scraped. Comment the asymmetry between namespace (always restored) and use_latest_spec (user-overridable default) inside _generate_config. - conftest.py: replace the hard-coded 15-second sleep after the traffic generator pod is created with a 300-second polling loop that waits for ztunnel to report a non-zero TCP connection counter. - conftest.py: return osx-amd64 (not osx) for intel macOS in the Istio release suffix helper so local-dev downloads do not 404. - changelog: rephrase to name the affected version (9.4.0) so customers searching the changelog for the regression window can find this fix. * Fix ztunnel poll target and add Istio 1.29 to the matrix - _wait_for_ztunnel_traffic was exec'ing into the ztunnel pod, which runs a minimal Rust binary without curl. Move the kubectl exec into the traffic-gen pod (curlimages/curl) and target the ztunnel-metrics Service applied earlier in setup. Capture stderr instead of leaking it through CI logs. - Extend the matrix with Istio 1.29 (the current supported release as of 2026-05-18) in both sidecar and ambient mode. 1.24 stays since that is where ambient GA'd; 1.13 stays for the legacy Go-runtime safety net. * Address round-2 review feedback - check.py: use is_affirmative for the use_latest_spec opt-out guard so YAML string booleans ("false") also trigger the warning. - check.py: replace use_latest_spec_default kwarg with a keyword-only scraper_defaults mapping; precedence (defaults -> instance -> per- scraper restore) is now visible at the signature without an inline comment. - test_unit_istio_v2: assert directly that the ztunnel scraper config carries use_latest_spec=True, so a refactor that drops the default fails on shape rather than on fixture parsing. - test_e2e: treat unparseable ISTIO_VERSION as legacy so misconfigured local runs fail loudly instead of silently skipping LEGACY_GO_METRICS. - conftest: justify the per-line ValueError skip in _ztunnel_has_traffic with an inline comment. * Address round-3 review feedback - check.py: restore the invariant comment next to `config.update` so the namespace-must-follow-update rule is visible in code shape. - metadata.csv: rename the four ztunnel TCP counters from .total to .count to match the names the OpenMetrics V2 transformer actually submits (construct_metrics_config strips the _total suffix). This closes the metadata loop on the same regression class the PR fixes. - test_unit_istio_v2: add assert_metrics_using_metadata to the ztunnel metrics test so future mismatches between metrics.py, the transformer, and metadata.csv fail fast. - conftest: kubectl wait for traffic-gen Ready before polling so the 300s budget is spent on the actual wait-for-traffic signal instead of on "container not found" exec failures while the image pulls. - conftest: capture the last non-zero exec stderr and surface it in the RuntimeError so CI failures point at the real cause instead of a generic timeout message. * Address round-4 review feedback - Raise on the kubectl wait for traffic-gen so a timeout / pod-schedule failure surfaces immediately instead of being swallowed and then misattributed by the 300s polling loop downstream. - Build the ztunnel /stats/prometheus URL from the existing module constants so a future port or service rename touches one place. - Capture the last stdout excerpt on iterations where curl exits 0 but ztunnel reports no traffic (-sm 5 does not pass -f, so HTTP 4xx/5xx bodies land in stdout); include it in the final RuntimeError so the failure points at the real cause rather than reading "<none>". * Rename mechanical ztunnel .total counters to .count The OpenMetrics V2 transformer strips _total from the registered name and appends .count for counter-type metrics, so eight more ztunnel counter rows in metadata.csv (DNS, on_demand_dns, xds.connection_ terminations, connection.{opens,closes,termination}) were declared under the wrong suffix and went orphan once the V2 parser engaged. Basenames already match what metrics.py registers and what ztunnel 1.24 emits, so the rename is mechanical. The four remaining .total rows (proxies_{started,stopped}, active/ pending_proxy_count) have wrong basenames at the source — ztunnel 1.24 emits these under workload_manager_* — and need their registration corrected in a follow-up, not a suffix-only rewrite. * Realign ztunnel metric registration with real exposition The four in-pod proxy management metrics were registered under istio_proxies_started_total, istio_active_proxy_count_total etc., but ztunnel 1.24 emits them under the workload_manager_* family. Two more counters ztunnel emits (istio_xds_message_total and istio_xds_message_bytes_total) were not registered at all. Realign the registrations against the captured ztunnel exposition, update the four corresponding metadata.csv rows (correct suffix per gauge/counter type), add metadata entries for the two new xds counters, and expand V2_ZTUNNEL_METRICS so the unit and e2e tests assert all nine ztunnel metrics that the integration is now collecting. Split V2_ZTUNNEL_METRICS into counter / gauge sub-lists so the legacy-parser regression test pins the broken behavior on the counter set only (gauges are not affected by the parser bug). * Tighten prose hygiene and warning-silence regression - conftest: drop the seven-line docstring on the private _wait_for_ztunnel_traffic helper for a one-liner per AGENTS.md, with a short inline comment at the call site explaining the curl-from-traffic-gen indirection. - test_unit_istio_v2: assert in the happy-path ztunnel test that the opt-out warning stays silent, so a future change that emits it unconditionally fails the test instead of shipping noise. - common.py: collapse the multi-line block introducing the V2_ZTUNNEL_COUNTER_METRICS / V2_ZTUNNEL_GAUGE_METRICS split into two one-line comments. * Collapse prose comments per AGENTS.md - check.py: replace the four-line block above the ztunnel if-block with a single line. The scraper_defaults kwarg and the opt-out warning already explain the why. - common.py: replace the wrapped comment above V2_ZTUNNEL_COUNTER_METRICS with a single line that names the format-level reason (TYPE base name + _total samples) instead of the PR-narrative "before this PR's fix" framing. * Expand changelog to cover full ambient mode fix * Consolidate ambient changelog into a single fix entry * Tighten changelog wording * Drop dead istio_connection_* registrations from ZTUNNEL_METRICS These three counter registrations point at metric names ztunnel never emits. A search across the ztunnel source confirms only the tcp_connections_* and xds_connection_terminations families exist; the istio_connection_opens_total / closes_total / termination_total entries were dead from the start and carried matching orphan rows in metadata.csv. Removing both. * Document ambient mode and mark its settings fleet-configurable PR #22581 introduced ambient mode but never updated the docs or spec metadata. The three instance options it added (istio_mode, ztunnel_endpoint, waypoint_endpoint) were the only instance-level settings in the spec without fleet_configurable: true, so customers managing Istio from Datadog Fleet Automation could remote-configure every Istio setting except those three. The README similarly had no mention of ambient mode at all. - Add fleet_configurable: true to the three ambient settings in assets/configuration/spec.yaml; regenerate config_models. - Add an "Ambient mode configuration" subsection to the README under Metric collection, showing the ztunnel annotation pattern and pointing at waypoint as the L7 option. * Correct README: one ambient instance scrapes all three endpoints The previous wording suggested a second instance for waypoint metrics, but _parse_ambient_config reads ztunnel_endpoint, waypoint_endpoint, and istiod_endpoint from the same instance and spawns sub-scrapers for each. Replace the per-pod annotation example with a static conf.yaml covering all three components in one instance. * Drop hard-coded ports from ambient README prose Port 15020 is the Istio default but configurable; mentioning specific numbers in the prose can mislead a reader running a non-default exposition. The example URLs still show the conventional defaults, which is the right place for that information since users edit the URLs directly when their setup differs. * Address documentation review feedback on ambient README section - Spell out GA as 'generally available' on first mention. - Split the sentence joined by an em dash into two complete sentences. - Capitalize Autodiscovery (Datadog feature name). * Restore license headers on regenerated config_models Local ddev's model regeneration produced these files without the Datadog license header that the master branch tracks. CI's validate models check caught the drift. (cherry picked from commit 3bce4ec) Co-authored-by: Juanpe Araque <juanpedro.araque@datadoghq.com>
What does this PR do?
Make Istio ambient mode collect ztunnel metrics. The ambient feature shipped in integration 9.4.0 was broken by two bugs:
# TYPE foo counter+foo_total{} Nis not reconciled by the legacy parser'sallowed_names).istio_*_totalnames ztunnel never emits — the real names areworkload_manager_*. Two xDS counters were not registered at all.Fix is to default
use_latest_spec: Trueon the ztunnel sub-scraper only (waypoint and istiod stay on the legacy parser), realign the fourworkload_manager_*registrations against the captured 1.24 exposition, add the two xDS counters, and rename the 12 affectedmetadata.csvrows from.totalto.count(or drop the suffix on gauges). Users can still overrideuse_latest_spec; the integration warns when that re-introduces the silent-drop bug.The matrix gains
py3.13-1.24-ambient,py3.13-1.29-ambient,py3.13-1.24-sidecar, andpy3.13-1.29-sidecar. Sidecarsetup_istiois migrated from the baked 1.13-only IstioOperator manifest toistioctl install --set profile=demo.Motivation
Customer support case AGENT-16157 — customer upgraded to 7.78.0 specifically to switch to native
istio.ztunnel.*metrics and lost their existing coverage with no visible error. Tracks GitHub issue #19166.Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged