Problem
cAdvisor emits duplicate Prometheus samples for container_fs_* disk I/O metrics when multiple block devices cannot be resolved from major/minor IDs to device names.
In that case, several per-disk stats are exported with the same labelset, usually:
device="", id="/", image="", name=""
The Prometheus client rejects the scrape/gather with:
collected metric "... " was collected before with the same name and label values
This causes the entire gather to fail, so unrelated metrics such as memory and CPU may not be exported by consumers using cAdvisor as a library.
Environment
- cAdvisor version:
v0.56.2
- Runtime: Docker
- Host: Linux VM
- cgroup mode: cgroup v2
- Deployment: cAdvisor embedded in an agent process, running inside Docker with host PID/network and Docker socket mounted
The issue was observed when the agent was run in Docker on a VM.
Docker run flags:
--pid=host \
--net=host \
--cgroupns=host \
--privileged \
-v /:/rootfs:ro \
-v /proc:/host/proc:ro \
-v /sys:/sys:ro \
-v /var/run:/host/var/run:ro \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /var/lib/docker:/var/lib/docker:ro \
-v /dev/disk:/dev/disk:ro \
The same agent does not reproduce this warning in a containerd Kubernetes environment.
Actual Behavior
The Prometheus gather fails with duplicate container_fs_* samples. Example error:
gather failed: 44 error(s) occurred:
* collected metric "container_fs_reads_bytes_total" {
label:{name:"device" value:""}
label:{name:"id" value:"/"}
label:{name:"image" value:""}
label:{name:"name" value:""}
label:{name:"runtime_container_id" value:""}
counter:{value:762880}
} was collected before with the same name and label values
The same happens for metrics such as:
container_fs_reads_bytes_total
container_fs_reads_total
container_fs_writes_bytes_total
container_fs_writes_total
Because the Prometheus gatherer rejects duplicate series, the scrape/gather fails as a whole.
Expected Behavior
cAdvisor should not emit duplicate Prometheus samples with identical metric name and identical labels.
If cAdvisor cannot resolve a block device major/minor pair to a device path, it should still preserve uniqueness or skip the unresolved stat. For example, unresolved devices could use a stable fallback label value such as:
or:
device="unknown:MAJOR:MINOR"
Suspected Cause
The problematic path appears to be:
- Docker disk stats call
AssignDeviceNamesToDiskStats
- Device names are resolved from major/minor IDs
- If resolution fails, the device string becomes empty
- Prometheus export uses only
device as the extra label for these disk I/O metrics
- Multiple unresolved devices therefore collapse into the same
device="" time series
Relevant source locations:
Proposed Fix
When device-name resolution fails, do not return an empty Device.
A possible fix is to make the fallback deterministic and unique per major/minor pair, for example:
s, ok := namer.DeviceName(major, minor)
if !ok || s == "" {
s = fmt.Sprintf("%d:%d", major, minor)
}
This would preserve metric uniqueness and keep the data usable.
Alternative fixes could be:
- drop unresolved per-disk stats
- aggregate unresolved stats before exporting
- add major/minor labels to the affected metrics
The least disruptive option seems to be a stable fallback device label because it avoids changing the metric schema while preventing duplicate samples.
Related Issues
This seems related to the broader problem that container_fs_* metrics only expose device as the disk discriminator:
Duplicate samples cause Prometheus gather failure, which can prevent consumers from exporting unrelated metrics such as container memory and CPU.
I am happy to send a PR if the fallback-device-label approach is acceptable.
Problem
cAdvisor emits duplicate Prometheus samples for
container_fs_*disk I/O metrics when multiple block devices cannot be resolved from major/minor IDs to device names.In that case, several per-disk stats are exported with the same labelset, usually:
The Prometheus client rejects the scrape/gather with:
This causes the entire gather to fail, so unrelated metrics such as memory and CPU may not be exported by consumers using cAdvisor as a library.
Environment
v0.56.2The issue was observed when the agent was run in Docker on a VM.
Docker run flags:
The same agent does not reproduce this warning in a containerd Kubernetes environment.
Actual Behavior
The Prometheus gather fails with duplicate
container_fs_*samples. Example error:The same happens for metrics such as:
Because the Prometheus gatherer rejects duplicate series, the scrape/gather fails as a whole.
Expected Behavior
cAdvisor should not emit duplicate Prometheus samples with identical metric name and identical labels.
If cAdvisor cannot resolve a block device major/minor pair to a device path, it should still preserve uniqueness or skip the unresolved stat. For example, unresolved devices could use a stable fallback label value such as:
or:
Suspected Cause
The problematic path appears to be:
AssignDeviceNamesToDiskStatsdeviceas the extra label for these disk I/O metricsdevice=""time seriesRelevant source locations:
ioValues()emitsstat.Devicedirectly as thedevicelabel:https://github.com/google/cadvisor/blob/v0.56.2/metrics/prometheus.go#L53-L75
container_fs_reads_bytes_totaland related metrics only usedeviceas the extra label:https://github.com/google/cadvisor/blob/v0.56.2/metrics/prometheus.go#L609-L637
Docker stats assign device names via
AssignDeviceNamesToDiskStats:https://github.com/google/cadvisor/blob/v0.56.2/container/docker/fs.go#L43-L65
deviceIdentifierMap.Find()currently caches and returns an empty string whenDeviceName()cannot resolve the major/minor pair:https://github.com/google/cadvisor/blob/v0.56.2/container/common/helpers.go#L372-L426
Proposed Fix
When device-name resolution fails, do not return an empty
Device.A possible fix is to make the fallback deterministic and unique per major/minor pair, for example:
This would preserve metric uniqueness and keep the data usable.
Alternative fixes could be:
The least disruptive option seems to be a stable fallback
devicelabel because it avoids changing the metric schema while preventing duplicate samples.Related Issues
This seems related to the broader problem that
container_fs_*metrics only exposedeviceas the disk discriminator:Duplicate samples cause Prometheus gather failure, which can prevent consumers from exporting unrelated metrics such as container memory and CPU.
I am happy to send a PR if the fallback-device-label approach is acceptable.