Skip to content

feat: add per-monitor configuration to selectively disable monitors#50

Merged
shvbsle merged 12 commits into
aws:mainfrom
shvbsle:monconfig
Mar 2, 2026
Merged

feat: add per-monitor configuration to selectively disable monitors#50
shvbsle merged 12 commits into
aws:mainfrom
shvbsle:monconfig

Conversation

@shvbsle
Copy link
Copy Markdown
Contributor

@shvbsle shvbsle commented Feb 25, 2026

Issue #, if available:
NA

Description of changes:
Adds a configuration layer for selectively disabling NMA monitors via Helm values. Config flows from values.yaml -> Helm-managed ConfigMap -> volume mount at /etc/nma/config.yaml -> parsed at startup. The per-monitor settings struct is extensible for future fields (e.g., intervals).

Example Usage

nodeAgent:
  monitors:
    networking:
      enabled: false

Testing Done:

  1. Deployed NMA with following value:
nodeAgent:
  image:
    override: $IMAGE_REGISTRY/$IMAGE_REPO:latest
    pullPolicy: IfNotPresent
  monitors:
    networking:
      enabled: false

Deploy like so:

make docker-build IMAGE_REGISTRY=$IMAGE_REGISTRY/$IMAGE_REPO:latest DOCKER_PLATFORMS=linux/amd64
make deploy HELM_EXTRA_FLAGS='-f local-values.yaml'

Logs:

> kubectl logs eks-node-monitoring-agent-xgmkb -n kube-system | grep configuration
{"level":"info","ts":"2026-03-02T21:03:22Z","msg":"monitor configuration","hostname":"ip-172-31-20-28.us-west-2.compute.internal","plugin":"neuron","enabled":true}
{"level":"info","ts":"2026-03-02T21:03:22Z","msg":"monitor configuration","hostname":"ip-172-31-20-28.us-west-2.compute.internal","plugin":"storage-monitor","enabled":true}
{"level":"info","ts":"2026-03-02T21:03:22Z","msg":"monitor configuration","hostname":"ip-172-31-20-28.us-west-2.compute.internal","plugin":"nvidia","enabled":true}
{"level":"info","ts":"2026-03-02T21:03:22Z","msg":"monitor configuration","hostname":"ip-172-31-20-28.us-west-2.compute.internal","plugin":"networking","enabled":false}
{"level":"info","ts":"2026-03-02T21:03:22Z","msg":"monitor configuration","hostname":"ip-172-31-20-28.us-west-2.compute.internal","plugin":"runtime","enabled":true}
{"level":"info","ts":"2026-03-02T21:03:22Z","msg":"monitor configuration","hostname":"ip-172-31-20-28.us-west-2.compute.internal","plugin":"kernel-monitor","enabled":true}
{"level":"info","ts":"2026-03-02T21:03:22Z","msg":"monitors disabled by configuration","hostname":"ip-172-31-20-28.us-west-2.compute.internal","plugins":["networking"]}

Confirmed that Networking condition is still true:

Conditions:
  Type                    Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                    ------  -----------------                 ------------------                ------                       -------
  MemoryPressure          False   Mon, 02 Mar 2026 21:02:39 +0000   Wed, 11 Feb 2026 01:48:46 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure            False   Mon, 02 Mar 2026 21:02:39 +0000   Wed, 11 Feb 2026 01:48:46 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure             False   Mon, 02 Mar 2026 21:02:39 +0000   Wed, 11 Feb 2026 01:48:46 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                   True    Mon, 02 Mar 2026 21:02:39 +0000   Sat, 14 Feb 2026 03:10:01 +0000   KubeletReady                 kubelet is posting ready status
  NetworkingReady         True    Mon, 02 Mar 2026 21:00:41 +0000   Mon, 02 Mar 2026 20:55:41 +0000   NetworkingIsReady            Monitoring for the Networking system is active
  KernelReady             True    Mon, 02 Mar 2026 21:03:22 +0000   Mon, 02 Mar 2026 21:03:22 +0000   KernelIsReady                Monitoring for the Kernel system is active
  ContainerRuntimeReady   True    Mon, 02 Mar 2026 21:03:22 +0000   Mon, 02 Mar 2026 21:03:22 +0000   ContainerRuntimeIsReady      Monitoring for the ContainerRuntime system is active
  StorageReady            True    Mon, 02 Mar 2026 21:03:22 +0000   Mon, 02 Mar 2026 21:03:22 +0000   DiskIsReady                  Monitoring for the Disk system is active
  1. Deleting configmap: Expected behaviour is that all monitors are enabled:
> kubectl logs eks-node-monitoring-agent-tdfqx -n kube-system | grep config
{"level":"info","ts":"2026-03-02T21:13:17Z","msg":"monitor config file not found, all monitors will be enabled by default","hostname":"ip-172-31-20-28.us-west-2.compute.internal","path":"/etc/nma/config.yaml"}
{"level":"info","ts":"2026-03-02T21:13:17Z","msg":"monitor configuration","hostname":"ip-172-31-20-28.us-west-2.compute.internal","plugin":"networking","enabled":true}
{"level":"info","ts":"2026-03-02T21:13:17Z","msg":"monitor configuration","hostname":"ip-172-31-20-28.us-west-2.compute.internal","plugin":"runtime","enabled":true}
{"level":"info","ts":"2026-03-02T21:13:17Z","msg":"monitor configuration","hostname":"ip-172-31-20-28.us-west-2.compute.internal","plugin":"kernel-monitor","enabled":true}
{"level":"info","ts":"2026-03-02T21:13:17Z","msg":"monitor configuration","hostname":"ip-172-31-20-28.us-west-2.compute.internal","plugin":"neuron","enabled":true}
{"level":"info","ts":"2026-03-02T21:13:17Z","msg":"monitor configuration","hostname":"ip-172-31-20-28.us-west-2.compute.internal","plugin":"storage-monitor","enabled":true}
{"level":"info","ts":"2026-03-02T21:13:17Z","msg":"monitor configuration","hostname":"ip-172-31-20-28.us-west-2.compute.internal","plugin":"nvidia","enabled":true}
  1. e2e tests:
go test ./pkg/config/ -v -run TestLoadMonitorConfig 2>&1
=== RUN   TestLoadMonitorConfig_NonExistentFile
--- PASS: TestLoadMonitorConfig_NonExistentFile (0.00s)
=== RUN   TestLoadMonitorConfig_ValidFileOneDisabled
--- PASS: TestLoadMonitorConfig_ValidFileOneDisabled (0.00s)
=== RUN   TestLoadMonitorConfig_InvalidYAML
--- PASS: TestLoadMonitorConfig_InvalidYAML (0.00s)
=== RUN   TestLoadMonitorConfig_UnknownPluginName
--- PASS: TestLoadMonitorConfig_UnknownPluginName (0.00s)
=== RUN   TestLoadMonitorConfig_EmptyFile
--- PASS: TestLoadMonitorConfig_EmptyFile (0.00s)
=== RUN   TestLoadMonitorConfig_StrictUnmarshalRejectsUnknownFields
--- PASS: TestLoadMonitorConfig_StrictUnmarshalRejectsUnknownFields (0.00s)
PASS
ok      github.com/aws/eks-node-monitoring-agent/pkg/config     0.006s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@shvbsle shvbsle requested a review from prasad0896 February 25, 2026 04:51
Comment thread cmd/eks-node-monitoring-agent/main.go
Comment thread pkg/config/monitor.go Outdated
Copy link
Copy Markdown
Contributor

@prasad0896 prasad0896 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also do either of the two:
a. Add a doc to examples/ to provide an actual example of how monitors can be disabled and all the places that need to be updated.
b. Update readme with instructions on configuring monitors for the agent.

@prasad0896
Copy link
Copy Markdown
Contributor

/ci

@github-actions
Copy link
Copy Markdown
Contributor

@prasad0896 roger that! I've dispatched a workflow. 👍

@github-actions
Copy link
Copy Markdown
Contributor

@prasad0896 the workflow that you requested has completed.

K8s Version Arch Instance Type Result Details
1.29 amd64 t3.medium failure ❌ logs
1.30 amd64 t3.medium failure ❌ logs
1.31 amd64 t3.medium failure ❌ logs
1.32 amd64 t3.medium success ✅ logs
1.33 amd64 t3.medium failure ❌ logs
1.34 amd64 t3.medium success ✅ logs
1.35 amd64 t3.medium success ✅ logs

⚠️ 4/7 version(s) failed

@shvbsle
Copy link
Copy Markdown
Contributor Author

shvbsle commented Feb 26, 2026

The CI is flaky when running on all versions of k8s because of account level limits that cause some clusters timeout while coming up. Going to test on one version only for now and figure out a way to make ci more stable.

@shvbsle
Copy link
Copy Markdown
Contributor Author

shvbsle commented Feb 26, 2026

/ci
+workflow:k8s_versions 1.34

@github-actions
Copy link
Copy Markdown
Contributor

@shvbsle roger that! I've dispatched a workflow. 👍

@github-actions
Copy link
Copy Markdown
Contributor

@shvbsle the workflow that you requested has completed.

K8s Version Arch Instance Type Result Details
1.34 amd64 t3.medium success ✅ logs

🎉 1/1 version(s) passed

Comment thread charts/eks-node-monitoring-agent/values.yaml
@shvbsle shvbsle merged commit 019a715 into aws:main Mar 2, 2026
2 checks passed
@shvbsle shvbsle deleted the monconfig branch March 2, 2026 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants