Skip to content

Collector netclass/bonding leads to scrape timeouts #1841

@sepich

Description

@sepich

Host operating system: output of uname -a

Linux 4.4.207-1.el7.elrepo.x86_64 #1 SMP Sat Dec 21 08:00:19 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

prom/node-exporter:v1.0.1

node_exporter command line flags

            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
            - --path.rootfs=/rootfs
            - --collector.netclass.ignored-devices=^(lo|docker[0-9]|kube-ipvs0|dummy0|kube-dummy-if|veth.+|br\-.+|cali\w{11}|tunl0|tun\-.+)$
            - --collector.netdev.device-blacklist=^(lo|docker[0-9]|kube-ipvs0|dummy0|kube-dummy-if|veth.+|br\-.+|cali\w{11}|tunl0|tun\-.+)$
            - --collector.filesystem.ignored-mount-points=^/(dev|sys|proc|host|etc|var/lib/kubelet|var/lib/docker/.+|home/.+|data/local-pv/.+)($|/)
            - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|efivarfs|tmpfs|nsfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rootfs|rpc_pipefs|securityfs|sysfs|tracefs)$
            - --collector.diskstats.ignored-devices=^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p|dm-|sr|nbd)\d+$
            - --collector.netstat.fields=^(.*_(InErrors|InErrs)|Ip_Forwarding|Ip(6|Ext)_(InOctets|OutOctets)|Icmp6?_(InMsgs|OutMsgs)|TcpExt_(Listen.*|Syncookies.*|TCPSynRetrans|TCPRcvCollapsed|PruneCalled|RcvPruned)|Tcp_(ActiveOpens|InSegs|OutSegs|PassiveOpens|RetransSegs|CurrEstab)|Udp6?_(InDatagrams|OutDatagrams|NoPorts|RcvbufErrors|SndbufErrors))$
            - --no-collector.systemd
            - --no-collector.bcache
            - --no-collector.infiniband
            - --no-collector.wifi
            - --no-collector.ipvs

Are you running node_exporter in Docker?

Yes, in k8s as a DaemonSet

What did you do that produced an error?

We're using scrape_interval: 15s and scrape_timeout: 15s on prometheus side, and noticed that some nodes have holes in graphs:
image
Which turns out to be due to large scrape time from bonding and netclass collectors:
node_scrape_collector_duration_seconds
image
Sometimes even like this:

# time curl -s localhost:9100/metrics >/dev/null

real	0m42.589s
user	0m0.003s
sys	0m0.005s

If we disable these collectors:

            - --no-collector.bonding
            - --no-collector.netclass

Then holes disappear (on graphs above after 17:30)

What did you expect to see?

Bonding collector metrics are very valuable for us. Currently we have to produce same metrics via textfile collector and custom script.
Is it possible to maybe add some configurable timeout for node_exporter, so that at least some metrics which are ready would be returned? Instead of failing the whole scrape.
In this case collectors maybe should also set node_scrape_collector_success=0 to not hide the issue.
Thank you.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions