Skip to content

Fix silent drop of ztunnel counter metrics in Istio ambient mode#23707

Merged
AAraKKe merged 33 commits into
masterfrom
aarakke/istio-ambient-ztunnel-fix
May 20, 2026
Merged

Fix silent drop of ztunnel counter metrics in Istio ambient mode#23707
AAraKKe merged 33 commits into
masterfrom
aarakke/istio-ambient-ztunnel-fix

Conversation

@AAraKKe

@AAraKKe AAraKKe commented May 14, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Make Istio ambient mode collect ztunnel metrics. The ambient feature shipped in integration 9.4.0 was broken by two bugs:

  1. The ztunnel sub-scraper went through the legacy Prometheus parser, which silently dropped every counter (# TYPE foo counter + foo_total{} N is not reconciled by the legacy parser's allowed_names).
  2. Four proxy management metrics were registered under istio_*_total names ztunnel never emits — the real names are workload_manager_*. Two xDS counters were not registered at all.

Fix is to default use_latest_spec: True on the ztunnel sub-scraper only (waypoint and istiod stay on the legacy parser), realign the four workload_manager_* registrations against the captured 1.24 exposition, add the two xDS counters, and rename the 12 affected metadata.csv rows from .total to .count (or drop the suffix on gauges). Users can still override use_latest_spec; the integration warns when that re-introduces the silent-drop bug.

The matrix gains py3.13-1.24-ambient, py3.13-1.29-ambient, py3.13-1.24-sidecar, and py3.13-1.29-sidecar. Sidecar setup_istio is migrated from the baked 1.13-only IstioOperator manifest to istioctl install --set profile=demo.

Motivation

Customer support case AGENT-16157 — customer upgraded to 7.78.0 specifically to switch to native istio.ztunnel.* metrics and lost their existing coverage with no visible error. Tracks GitHub issue #19166.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Ztunnel emits the modern OpenMetrics counter convention (TYPE declared with
the base name, samples emitted with the _total suffix). The integration's
ztunnel sub-scraper was routed through the legacy prometheus_client parser,
which does not add _total to a counter's allowed sample names and yields
each sample as an untyped singleton. Combined with construct_metrics_config
stripping _total from registration keys, every ztunnel counter was silently
dropped (failure visible only at DEBUG level).

Force use_latest_spec=True on the ztunnel sub-scraper so the OpenMetrics
parser is used. Scoped to ztunnel only; waypoint (Envoy, legacy convention)
and istiod (Go client_golang, legacy convention) are unaffected. The user
can still opt out by setting use_latest_spec: false on the instance.

Adds an ambient e2e environment (py3.13-1.24-ambient) that installs
Istio 1.24.3 with the ambient profile, deploys bookinfo, applies a
waypoint, and runs a traffic generator so ztunnel counters are non-zero.
The unit test fixture is now the real output captured from ztunnel
running in that environment; the prior hand-written 1.5/ztunnel.txt
fixture used a non-realistic TYPE convention and is removed.

Adds test_ambient_ztunnel_legacy_parser_drops_counters that pins the
broken behavior under use_latest_spec=false so a future regression that
removes the default cannot reintroduce the silent drop.
@AAraKKe AAraKKe added the qa/skip-qa Automatically skip this PR for the next QA label May 14, 2026
@codecov

codecov Bot commented May 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 44.95413% with 60 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.01%. Comparing base (c0767bc) to head (50f5109).
⚠️ Report is 34 commits behind head on master.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@datadog-prod-us1-3

datadog-prod-us1-3 Bot commented May 14, 2026

Copy link
Copy Markdown

Tests

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 50f5109 | Docs | Datadog PR Page | Give us feedback!

AAraKKe added 24 commits May 14, 2026 17:55
In CI, ddev env test runs the dd_environment fixture's setup and
teardown in different pytest invocations. The DDEV_E2E_ENV_* env vars
that map per-port-forward TempDir names to actual paths do not survive
that transition, so KillProcess opens a freshly mkdtemp'd path and
raises FileNotFoundError on kubectl.pid lookup. The kubectl
port-forward processes are already dying on their own because kind
delete cluster has removed the API server they were forwarding to, so
swallowing the missing-pid-file error is safe. Wrap port_forward in a
safe_port_forward context manager that suppresses FileNotFoundError on
exit. Sidecar (single port-forward) is not affected; ambient (three
port-forwards) hits this every run.
The previous attempt port-forwarded pod/ztunnel-<random-suffix>. The
dd_environment fixture runs in two pytest invocations under ddev env
test (setup and teardown), and on the teardown invocation
_get_first_ztunnel_pod runs against a kind cluster that has already
been deleted, returning an empty string. port_forward then builds a
different TempDir key than setup, the env-var bridge misses, and
KillProcess opens an empty mkdtemp'd dir, raising FileNotFoundError on
kubectl.pid. Other integrations like kuma avoid this by port-forwarding
a Service (stable name).

Create a Service in setup_istio_ambient via `kubectl expose daemonset
ztunnel`, port-forward service/ztunnel-metrics. Drop the _get_first_-
ztunnel_pod helper and the safe_port_forward workaround.
kubectl expose does not support DaemonSets (error: cannot expose a
DaemonSet.apps). The previous attempt to use kubectl expose silently
failed and the port-forward of service/ztunnel-metrics then connected
to nothing, surfacing as 'Connection refused' on the agent side. Apply
a tracked Service manifest at tests/kind/ztunnel_service.yaml instead.
The matrix previously folded mode into a single version-like dimension
(version = ["1.13", "1.24-ambient"]), which scales poorly: adding
ambient support for a new Istio version meant introducing a new
"1.xx-ambient" pseudo-version alongside the plain "1.xx" entry. Split
into orthogonal dimensions:

  version = ["1.13"] mode = ["sidecar"]   # block 1
  version = ["1.24"] mode = ["ambient"]   # block 2

Two matrix blocks express the constraint that 1.13 does not support
ambient (it was GA in 1.24). Adding a future version is a one-line
change to the relevant block; adding a new mode on an existing version
is a one-line change too once the matching setup function is in place.
Env names become py3.13-1.13-sidecar and py3.13-1.24-ambient, with
ISTIO_VERSION and ISTIO_MODE env vars driven independently from the
two matrix axes. conftest's fallthrough switches to `else` (i.e. the
sidecar branch) to match the new matrix semantics rather than gating
on a hard-coded VERSION string.
Migrate setup_istio off the baked IstioOperator manifest and onto
`istioctl install --set profile=demo`, which is version-agnostic. Drop
the now-unused 1.13-only manifest and extend the sidecar matrix to also
cover 1.24, so the integration is exercised end-to-end against a
modern Istio release in sidecar mode.
A fresh Istio 1.24 install on bookinfo emits istio.galley.validation.passed
but not the .failed variant (no validation errors occur). The .passed.count
v2 sibling was also missing from the list. Move both pairs into the
intermittent list so the e2e tests pass on a clean cluster.
The Go runtime metrics istio.go.memstats.gc_cpu_fraction and lookups_total
are no longer emitted by Istio 1.24's binary (deprecated in Go 1.20 and
later removed from client_golang's default exposition), but they are still
emitted by Istio 1.13. Calling them "intermittent" on 1.13 silently hid any
collection bug there.

Pull them out into a LEGACY_GO_METRICS set that is strictly asserted on
1.13 envs and skipped on 1.24+. Genuinely-conditional metrics (validation
failures, listener conflicts, sidecar injection events, etc.) remain in
INTERMITTENT_METRICS and are asserted at_least=0 on every version.
- test_e2e_ambient: replace the dead at_least=0 conditional on
  ISTIOD_V2_METRICS with the existing _assert_istiod_metric helper so
  ambient mode also strictly asserts non-intermittent istiod metrics.
- check.py: log a warning when the user opts out of the OpenMetrics
  parser (use_latest_spec: false) while ztunnel is being scraped.
  Comment the asymmetry between namespace (always restored) and
  use_latest_spec (user-overridable default) inside _generate_config.
- conftest.py: replace the hard-coded 15-second sleep after the traffic
  generator pod is created with a 300-second polling loop that waits
  for ztunnel to report a non-zero TCP connection counter.
- conftest.py: return osx-amd64 (not osx) for intel macOS in the Istio
  release suffix helper so local-dev downloads do not 404.
- changelog: rephrase to name the affected version (9.4.0) so customers
  searching the changelog for the regression window can find this fix.
- _wait_for_ztunnel_traffic was exec'ing into the ztunnel pod, which
  runs a minimal Rust binary without curl. Move the kubectl exec into
  the traffic-gen pod (curlimages/curl) and target the ztunnel-metrics
  Service applied earlier in setup. Capture stderr instead of leaking
  it through CI logs.
- Extend the matrix with Istio 1.29 (the current supported release as
  of 2026-05-18) in both sidecar and ambient mode. 1.24 stays since
  that is where ambient GA'd; 1.13 stays for the legacy Go-runtime
  safety net.
- check.py: use is_affirmative for the use_latest_spec opt-out guard
  so YAML string booleans ("false") also trigger the warning.
- check.py: replace use_latest_spec_default kwarg with a keyword-only
  scraper_defaults mapping; precedence (defaults -> instance -> per-
  scraper restore) is now visible at the signature without an inline
  comment.
- test_unit_istio_v2: assert directly that the ztunnel scraper config
  carries use_latest_spec=True, so a refactor that drops the default
  fails on shape rather than on fixture parsing.
- test_e2e: treat unparseable ISTIO_VERSION as legacy so misconfigured
  local runs fail loudly instead of silently skipping LEGACY_GO_METRICS.
- conftest: justify the per-line ValueError skip in _ztunnel_has_traffic
  with an inline comment.
- check.py: restore the invariant comment next to `config.update`
  so the namespace-must-follow-update rule is visible in code shape.
- metadata.csv: rename the four ztunnel TCP counters from .total to
  .count to match the names the OpenMetrics V2 transformer actually
  submits (construct_metrics_config strips the _total suffix). This
  closes the metadata loop on the same regression class the PR fixes.
- test_unit_istio_v2: add assert_metrics_using_metadata to the ztunnel
  metrics test so future mismatches between metrics.py, the transformer,
  and metadata.csv fail fast.
- conftest: kubectl wait for traffic-gen Ready before polling so the
  300s budget is spent on the actual wait-for-traffic signal instead
  of on "container not found" exec failures while the image pulls.
- conftest: capture the last non-zero exec stderr and surface it in
  the RuntimeError so CI failures point at the real cause instead of
  a generic timeout message.
- Raise on the kubectl wait for traffic-gen so a timeout / pod-schedule
  failure surfaces immediately instead of being swallowed and then
  misattributed by the 300s polling loop downstream.
- Build the ztunnel /stats/prometheus URL from the existing module
  constants so a future port or service rename touches one place.
- Capture the last stdout excerpt on iterations where curl exits 0 but
  ztunnel reports no traffic (-sm 5 does not pass -f, so HTTP 4xx/5xx
  bodies land in stdout); include it in the final RuntimeError so the
  failure points at the real cause rather than reading "<none>".
The OpenMetrics V2 transformer strips _total from the registered name
and appends .count for counter-type metrics, so eight more ztunnel
counter rows in metadata.csv (DNS, on_demand_dns, xds.connection_
terminations, connection.{opens,closes,termination}) were declared
under the wrong suffix and went orphan once the V2 parser engaged.
Basenames already match what metrics.py registers and what ztunnel
1.24 emits, so the rename is mechanical.

The four remaining .total rows (proxies_{started,stopped}, active/
pending_proxy_count) have wrong basenames at the source — ztunnel
1.24 emits these under workload_manager_* — and need their
registration corrected in a follow-up, not a suffix-only rewrite.
The four in-pod proxy management metrics were registered under
istio_proxies_started_total, istio_active_proxy_count_total etc.,
but ztunnel 1.24 emits them under the workload_manager_* family.
Two more counters ztunnel emits (istio_xds_message_total and
istio_xds_message_bytes_total) were not registered at all.

Realign the registrations against the captured ztunnel exposition,
update the four corresponding metadata.csv rows (correct suffix per
gauge/counter type), add metadata entries for the two new xds
counters, and expand V2_ZTUNNEL_METRICS so the unit and e2e tests
assert all nine ztunnel metrics that the integration is now
collecting. Split V2_ZTUNNEL_METRICS into counter / gauge sub-lists
so the legacy-parser regression test pins the broken behavior on
the counter set only (gauges are not affected by the parser bug).
- conftest: drop the seven-line docstring on the private
  _wait_for_ztunnel_traffic helper for a one-liner per AGENTS.md,
  with a short inline comment at the call site explaining the
  curl-from-traffic-gen indirection.
- test_unit_istio_v2: assert in the happy-path ztunnel test that
  the opt-out warning stays silent, so a future change that emits
  it unconditionally fails the test instead of shipping noise.
- common.py: collapse the multi-line block introducing the
  V2_ZTUNNEL_COUNTER_METRICS / V2_ZTUNNEL_GAUGE_METRICS split into
  two one-line comments.
- check.py: replace the four-line block above the ztunnel
  if-block with a single line. The scraper_defaults kwarg and
  the opt-out warning already explain the why.
- common.py: replace the wrapped comment above
  V2_ZTUNNEL_COUNTER_METRICS with a single line that names the
  format-level reason (TYPE base name + _total samples) instead
  of the PR-narrative "before this PR's fix" framing.
joepeeples
joepeeples previously approved these changes May 18, 2026
Comment thread istio/datadog_checks/istio/metrics.py Outdated
These three counter registrations point at metric names ztunnel never
emits. A search across the ztunnel source confirms only the
tcp_connections_* and xds_connection_terminations families exist; the
istio_connection_opens_total / closes_total / termination_total entries
were dead from the start and carried matching orphan rows in
metadata.csv. Removing both.
@temporal-github-worker-1 temporal-github-worker-1 Bot dismissed joepeeples’s stale review May 19, 2026 15:13

Review from joepeeples is dismissed. Related teams and files:

  • documentation
    • istio/metadata.csv
@dd-octo-sts

dd-octo-sts Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

⚠️ The qa/skip-qa label has been added with shippable changes

The following files, which will be shipped with the agent, were modified in this PR and
the qa/skip-qa label has been added.

You can ignore this if you are sure the changes in this PR do not require QA. Otherwise,
consider removing the label.

List of modified files that will be shipped with the agent
istio/changelog.d/23707.fixed
istio/datadog_checks/istio/check.py
istio/datadog_checks/istio/metrics.py
istio/hatch.toml

nubtron
nubtron previously approved these changes May 19, 2026
@nubtron

nubtron commented May 19, 2026

Copy link
Copy Markdown
Contributor

Non-blocking since it's assets - we should probably update assets and the README for ambient mode.

@nubtron nubtron left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make the ambient settings in the spec as fleet_configurable.

nubtron
nubtron previously approved these changes May 19, 2026

@nubtron nubtron left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downgraded adding "fleet_configurable" to non-blocking as well, since it's an asset.

PR #22581 introduced ambient mode but never updated the docs or spec
metadata. The three instance options it added (istio_mode,
ztunnel_endpoint, waypoint_endpoint) were the only instance-level
settings in the spec without fleet_configurable: true, so customers
managing Istio from Datadog Fleet Automation could remote-configure
every Istio setting except those three. The README similarly had no
mention of ambient mode at all.

- Add fleet_configurable: true to the three ambient settings in
  assets/configuration/spec.yaml; regenerate config_models.
- Add an "Ambient mode configuration" subsection to the README under
  Metric collection, showing the ztunnel annotation pattern and
  pointing at waypoint as the L7 option.
@temporal-github-worker-1 temporal-github-worker-1 Bot dismissed stale reviews from nubtron and nubtron May 19, 2026 16:36

Review from nubtron is dismissed. Related teams and files:

  • agent-integrations
    • istio/README.md
    • istio/assets/configuration/spec.yaml
    • istio/datadog_checks/istio/config_models/init.py
    • istio/datadog_checks/istio/config_models/defaults.py
    • istio/datadog_checks/istio/config_models/instance.py
    • istio/datadog_checks/istio/config_models/shared.py

Review from nubtron is dismissed. Related teams and files:

  • agent-integrations
    • istio/README.md
    • istio/assets/configuration/spec.yaml
    • istio/datadog_checks/istio/config_models/init.py
    • istio/datadog_checks/istio/config_models/defaults.py
    • istio/datadog_checks/istio/config_models/instance.py
    • istio/datadog_checks/istio/config_models/shared.py
AAraKKe added 4 commits May 19, 2026 18:40
The previous wording suggested a second instance for waypoint metrics,
but _parse_ambient_config reads ztunnel_endpoint, waypoint_endpoint,
and istiod_endpoint from the same instance and spawns sub-scrapers
for each. Replace the per-pod annotation example with a static
conf.yaml covering all three components in one instance.
Port 15020 is the Istio default but configurable; mentioning specific
numbers in the prose can mislead a reader running a non-default
exposition. The example URLs still show the conventional defaults,
which is the right place for that information since users edit the
URLs directly when their setup differs.
- Spell out GA as 'generally available' on first mention.
- Split the sentence joined by an em dash into two complete sentences.
- Capitalize Autodiscovery (Datadog feature name).
Local ddev's model regeneration produced these files without the
Datadog license header that the master branch tracks. CI's validate
models check caught the drift.
@dd-octo-sts

dd-octo-sts Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Validation Report

All 20 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

@AAraKKe AAraKKe enabled auto-merge May 20, 2026 07:39
@AAraKKe AAraKKe added this pull request to the merge queue May 20, 2026
Merged via the queue into master with commit 3bce4ec May 20, 2026
41 checks passed
@AAraKKe AAraKKe deleted the aarakke/istio-ambient-ztunnel-fix branch May 20, 2026 07:42
@dd-octo-sts dd-octo-sts Bot added this to the 7.81.0 milestone May 20, 2026
AAraKKe added a commit that referenced this pull request May 20, 2026
) (#23761)

* Fix silent drop of ztunnel counter metrics in Istio ambient mode

Ztunnel emits the modern OpenMetrics counter convention (TYPE declared with
the base name, samples emitted with the _total suffix). The integration's
ztunnel sub-scraper was routed through the legacy prometheus_client parser,
which does not add _total to a counter's allowed sample names and yields
each sample as an untyped singleton. Combined with construct_metrics_config
stripping _total from registration keys, every ztunnel counter was silently
dropped (failure visible only at DEBUG level).

Force use_latest_spec=True on the ztunnel sub-scraper so the OpenMetrics
parser is used. Scoped to ztunnel only; waypoint (Envoy, legacy convention)
and istiod (Go client_golang, legacy convention) are unaffected. The user
can still opt out by setting use_latest_spec: false on the instance.

Adds an ambient e2e environment (py3.13-1.24-ambient) that installs
Istio 1.24.3 with the ambient profile, deploys bookinfo, applies a
waypoint, and runs a traffic generator so ztunnel counters are non-zero.
The unit test fixture is now the real output captured from ztunnel
running in that environment; the prior hand-written 1.5/ztunnel.txt
fixture used a non-realistic TYPE convention and is removed.

Adds test_ambient_ztunnel_legacy_parser_drops_counters that pins the
broken behavior under use_latest_spec=false so a future regression that
removes the default cannot reintroduce the silent drop.

* Add changelog entry

* Make ambient e2e teardown robust to missing kubectl.pid

In CI, ddev env test runs the dd_environment fixture's setup and
teardown in different pytest invocations. The DDEV_E2E_ENV_* env vars
that map per-port-forward TempDir names to actual paths do not survive
that transition, so KillProcess opens a freshly mkdtemp'd path and
raises FileNotFoundError on kubectl.pid lookup. The kubectl
port-forward processes are already dying on their own because kind
delete cluster has removed the API server they were forwarding to, so
swallowing the missing-pid-file error is safe. Wrap port_forward in a
safe_port_forward context manager that suppresses FileNotFoundError on
exit. Sidecar (single port-forward) is not affected; ambient (three
port-forwards) hits this every run.

* Use a Service for ztunnel port-forward instead of a dynamic pod

The previous attempt port-forwarded pod/ztunnel-<random-suffix>. The
dd_environment fixture runs in two pytest invocations under ddev env
test (setup and teardown), and on the teardown invocation
_get_first_ztunnel_pod runs against a kind cluster that has already
been deleted, returning an empty string. port_forward then builds a
different TempDir key than setup, the env-var bridge misses, and
KillProcess opens an empty mkdtemp'd dir, raising FileNotFoundError on
kubectl.pid. Other integrations like kuma avoid this by port-forwarding
a Service (stable name).

Create a Service in setup_istio_ambient via `kubectl expose daemonset
ztunnel`, port-forward service/ztunnel-metrics. Drop the _get_first_-
ztunnel_pod helper and the safe_port_forward workaround.

* Create ztunnel Service via manifest, not kubectl expose

kubectl expose does not support DaemonSets (error: cannot expose a
DaemonSet.apps). The previous attempt to use kubectl expose silently
failed and the port-forward of service/ztunnel-metrics then connected
to nothing, surfacing as 'Connection refused' on the agent side. Apply
a tracked Service manifest at tests/kind/ztunnel_service.yaml instead.

* Restructure matrix orthogonally: version x mode

The matrix previously folded mode into a single version-like dimension
(version = ["1.13", "1.24-ambient"]), which scales poorly: adding
ambient support for a new Istio version meant introducing a new
"1.xx-ambient" pseudo-version alongside the plain "1.xx" entry. Split
into orthogonal dimensions:

  version = ["1.13"] mode = ["sidecar"]   # block 1
  version = ["1.24"] mode = ["ambient"]   # block 2

Two matrix blocks express the constraint that 1.13 does not support
ambient (it was GA in 1.24). Adding a future version is a one-line
change to the relevant block; adding a new mode on an existing version
is a one-line change too once the matching setup function is in place.
Env names become py3.13-1.13-sidecar and py3.13-1.24-ambient, with
ISTIO_VERSION and ISTIO_MODE env vars driven independently from the
two matrix axes. conftest's fallthrough switches to `else` (i.e. the
sidecar branch) to match the new matrix semantics rather than gating
on a hard-coded VERSION string.

* Use matrix-axis shorthand for ISTIO_MODE env var

* Add Istio 1.24 to the sidecar e2e matrix

Migrate setup_istio off the baked IstioOperator manifest and onto
`istioctl install --set profile=demo`, which is version-agnostic. Drop
the now-unused 1.13-only manifest and extend the sidecar matrix to also
cover 1.24, so the integration is exercised end-to-end against a
modern Istio release in sidecar mode.

* Mark Istio galley validation pass/fail metrics intermittent

A fresh Istio 1.24 install on bookinfo emits istio.galley.validation.passed
but not the .failed variant (no validation errors occur). The .passed.count
v2 sibling was also missing from the list. Move both pairs into the
intermittent list so the e2e tests pass on a clean cluster.

* Mark gc_cpu_fraction intermittent for Istio 1.24

* Mark memstats.lookups intermittent for Istio 1.24

* Mark all pilot.conflict variants intermittent

* Split legacy Go metrics from intermittent in Istio e2e

The Go runtime metrics istio.go.memstats.gc_cpu_fraction and lookups_total
are no longer emitted by Istio 1.24's binary (deprecated in Go 1.20 and
later removed from client_golang's default exposition), but they are still
emitted by Istio 1.13. Calling them "intermittent" on 1.13 silently hid any
collection bug there.

Pull them out into a LEGACY_GO_METRICS set that is strictly asserted on
1.13 envs and skipped on 1.24+. Genuinely-conditional metrics (validation
failures, listener conflicts, sidecar injection events, etc.) remain in
INTERMITTENT_METRICS and are asserted at_least=0 on every version.

* Parse Istio version numerically for legacy check

* Use packaging.version for Istio version comparison

* Address PR review feedback

- test_e2e_ambient: replace the dead at_least=0 conditional on
  ISTIOD_V2_METRICS with the existing _assert_istiod_metric helper so
  ambient mode also strictly asserts non-intermittent istiod metrics.
- check.py: log a warning when the user opts out of the OpenMetrics
  parser (use_latest_spec: false) while ztunnel is being scraped.
  Comment the asymmetry between namespace (always restored) and
  use_latest_spec (user-overridable default) inside _generate_config.
- conftest.py: replace the hard-coded 15-second sleep after the traffic
  generator pod is created with a 300-second polling loop that waits
  for ztunnel to report a non-zero TCP connection counter.
- conftest.py: return osx-amd64 (not osx) for intel macOS in the Istio
  release suffix helper so local-dev downloads do not 404.
- changelog: rephrase to name the affected version (9.4.0) so customers
  searching the changelog for the regression window can find this fix.

* Fix ztunnel poll target and add Istio 1.29 to the matrix

- _wait_for_ztunnel_traffic was exec'ing into the ztunnel pod, which
  runs a minimal Rust binary without curl. Move the kubectl exec into
  the traffic-gen pod (curlimages/curl) and target the ztunnel-metrics
  Service applied earlier in setup. Capture stderr instead of leaking
  it through CI logs.
- Extend the matrix with Istio 1.29 (the current supported release as
  of 2026-05-18) in both sidecar and ambient mode. 1.24 stays since
  that is where ambient GA'd; 1.13 stays for the legacy Go-runtime
  safety net.

* Address round-2 review feedback

- check.py: use is_affirmative for the use_latest_spec opt-out guard
  so YAML string booleans ("false") also trigger the warning.
- check.py: replace use_latest_spec_default kwarg with a keyword-only
  scraper_defaults mapping; precedence (defaults -> instance -> per-
  scraper restore) is now visible at the signature without an inline
  comment.
- test_unit_istio_v2: assert directly that the ztunnel scraper config
  carries use_latest_spec=True, so a refactor that drops the default
  fails on shape rather than on fixture parsing.
- test_e2e: treat unparseable ISTIO_VERSION as legacy so misconfigured
  local runs fail loudly instead of silently skipping LEGACY_GO_METRICS.
- conftest: justify the per-line ValueError skip in _ztunnel_has_traffic
  with an inline comment.

* Address round-3 review feedback

- check.py: restore the invariant comment next to `config.update`
  so the namespace-must-follow-update rule is visible in code shape.
- metadata.csv: rename the four ztunnel TCP counters from .total to
  .count to match the names the OpenMetrics V2 transformer actually
  submits (construct_metrics_config strips the _total suffix). This
  closes the metadata loop on the same regression class the PR fixes.
- test_unit_istio_v2: add assert_metrics_using_metadata to the ztunnel
  metrics test so future mismatches between metrics.py, the transformer,
  and metadata.csv fail fast.
- conftest: kubectl wait for traffic-gen Ready before polling so the
  300s budget is spent on the actual wait-for-traffic signal instead
  of on "container not found" exec failures while the image pulls.
- conftest: capture the last non-zero exec stderr and surface it in
  the RuntimeError so CI failures point at the real cause instead of
  a generic timeout message.

* Address round-4 review feedback

- Raise on the kubectl wait for traffic-gen so a timeout / pod-schedule
  failure surfaces immediately instead of being swallowed and then
  misattributed by the 300s polling loop downstream.
- Build the ztunnel /stats/prometheus URL from the existing module
  constants so a future port or service rename touches one place.
- Capture the last stdout excerpt on iterations where curl exits 0 but
  ztunnel reports no traffic (-sm 5 does not pass -f, so HTTP 4xx/5xx
  bodies land in stdout); include it in the final RuntimeError so the
  failure points at the real cause rather than reading "<none>".

* Rename mechanical ztunnel .total counters to .count

The OpenMetrics V2 transformer strips _total from the registered name
and appends .count for counter-type metrics, so eight more ztunnel
counter rows in metadata.csv (DNS, on_demand_dns, xds.connection_
terminations, connection.{opens,closes,termination}) were declared
under the wrong suffix and went orphan once the V2 parser engaged.
Basenames already match what metrics.py registers and what ztunnel
1.24 emits, so the rename is mechanical.

The four remaining .total rows (proxies_{started,stopped}, active/
pending_proxy_count) have wrong basenames at the source — ztunnel
1.24 emits these under workload_manager_* — and need their
registration corrected in a follow-up, not a suffix-only rewrite.

* Realign ztunnel metric registration with real exposition

The four in-pod proxy management metrics were registered under
istio_proxies_started_total, istio_active_proxy_count_total etc.,
but ztunnel 1.24 emits them under the workload_manager_* family.
Two more counters ztunnel emits (istio_xds_message_total and
istio_xds_message_bytes_total) were not registered at all.

Realign the registrations against the captured ztunnel exposition,
update the four corresponding metadata.csv rows (correct suffix per
gauge/counter type), add metadata entries for the two new xds
counters, and expand V2_ZTUNNEL_METRICS so the unit and e2e tests
assert all nine ztunnel metrics that the integration is now
collecting. Split V2_ZTUNNEL_METRICS into counter / gauge sub-lists
so the legacy-parser regression test pins the broken behavior on
the counter set only (gauges are not affected by the parser bug).

* Tighten prose hygiene and warning-silence regression

- conftest: drop the seven-line docstring on the private
  _wait_for_ztunnel_traffic helper for a one-liner per AGENTS.md,
  with a short inline comment at the call site explaining the
  curl-from-traffic-gen indirection.
- test_unit_istio_v2: assert in the happy-path ztunnel test that
  the opt-out warning stays silent, so a future change that emits
  it unconditionally fails the test instead of shipping noise.
- common.py: collapse the multi-line block introducing the
  V2_ZTUNNEL_COUNTER_METRICS / V2_ZTUNNEL_GAUGE_METRICS split into
  two one-line comments.

* Collapse prose comments per AGENTS.md

- check.py: replace the four-line block above the ztunnel
  if-block with a single line. The scraper_defaults kwarg and
  the opt-out warning already explain the why.
- common.py: replace the wrapped comment above
  V2_ZTUNNEL_COUNTER_METRICS with a single line that names the
  format-level reason (TYPE base name + _total samples) instead
  of the PR-narrative "before this PR's fix" framing.

* Expand changelog to cover full ambient mode fix

* Consolidate ambient changelog into a single fix entry

* Tighten changelog wording

* Drop dead istio_connection_* registrations from ZTUNNEL_METRICS

These three counter registrations point at metric names ztunnel never
emits. A search across the ztunnel source confirms only the
tcp_connections_* and xds_connection_terminations families exist; the
istio_connection_opens_total / closes_total / termination_total entries
were dead from the start and carried matching orphan rows in
metadata.csv. Removing both.

* Document ambient mode and mark its settings fleet-configurable

PR #22581 introduced ambient mode but never updated the docs or spec
metadata. The three instance options it added (istio_mode,
ztunnel_endpoint, waypoint_endpoint) were the only instance-level
settings in the spec without fleet_configurable: true, so customers
managing Istio from Datadog Fleet Automation could remote-configure
every Istio setting except those three. The README similarly had no
mention of ambient mode at all.

- Add fleet_configurable: true to the three ambient settings in
  assets/configuration/spec.yaml; regenerate config_models.
- Add an "Ambient mode configuration" subsection to the README under
  Metric collection, showing the ztunnel annotation pattern and
  pointing at waypoint as the L7 option.

* Correct README: one ambient instance scrapes all three endpoints

The previous wording suggested a second instance for waypoint metrics,
but _parse_ambient_config reads ztunnel_endpoint, waypoint_endpoint,
and istiod_endpoint from the same instance and spawns sub-scrapers
for each. Replace the per-pod annotation example with a static
conf.yaml covering all three components in one instance.

* Drop hard-coded ports from ambient README prose

Port 15020 is the Istio default but configurable; mentioning specific
numbers in the prose can mislead a reader running a non-default
exposition. The example URLs still show the conventional defaults,
which is the right place for that information since users edit the
URLs directly when their setup differs.

* Address documentation review feedback on ambient README section

- Spell out GA as 'generally available' on first mention.
- Split the sentence joined by an em dash into two complete sentences.
- Capitalize Autodiscovery (Datadog feature name).

* Restore license headers on regenerated config_models

Local ddev's model regeneration produced these files without the
Datadog license header that the master branch tracks. CI's validate
models check caught the drift.

(cherry picked from commit 3bce4ec)

Co-authored-by: Juanpe Araque <juanpedro.araque@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants