Skip to content

Redesign metrics system with thread-safety in mind #2899

@alindima

Description

@alindima

The metrics system is not fully thread-safe at the moment, due to some issues:

  1. IncMetrics inner state is mutated on serialisation. This causes race conditions when the write() function is called from multiple threads. See: Fix signal handler race condition on metrics.write() #2893
  2. While SharedIncMetrics use atomics, they always use Relaxed ordering. While on x86 memory access has Acquire-Release semantics, on Arm this is not the case. Hence, the process of writing metrics to file may use outdated values.
  3. Metrics are written from the signal handler, which may cause a deadlock if a thread is preempted by a signal while it was holding the metrics file lock.

Problems 1 and 3 can be fully solved by removing metrics usage from the signal handler. One option here is to have a special file used for logging the exit reason and the latest metrics values, similar to a coredump. We should also enforce that METRICS.write() is called from a single thread (and therefore removing the lazy_static declaration).

Problem 2 could be solved in two ways: by using tighter ordering constraints (need some further dive deep and may incur some overhead due to CPU reordering constraints and prevention of certain compiler optimisations) or by redesigning the metrics system to use per-thread values (this would also solve problem 1).

Another thing to keep in mind is potential need for having device-specific instances of a metric. For example, METRICS.net.tx_bytes_count may have sense to be reported per-device instance instead of being aggregated.

Metadata

Metadata

Assignees

Labels

Status: ParkedIndicates that an issues or pull request will be revisited laterType: EnhancementIndicates new feature requests

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions