-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
The metrics system is not fully thread-safe at the moment, due to some issues:
IncMetricsinner state is mutated on serialisation. This causes race conditions when thewrite()function is called from multiple threads. See: Fix signal handler race condition on metrics.write() #2893- While
SharedIncMetricsuse atomics, they always useRelaxedordering. While on x86 memory access has Acquire-Release semantics, on Arm this is not the case. Hence, the process of writing metrics to file may use outdated values. - Metrics are written from the signal handler, which may cause a deadlock if a thread is preempted by a signal while it was holding the metrics file lock.
Problems 1 and 3 can be fully solved by removing metrics usage from the signal handler. One option here is to have a special file used for logging the exit reason and the latest metrics values, similar to a coredump. We should also enforce that METRICS.write() is called from a single thread (and therefore removing the lazy_static declaration).
Problem 2 could be solved in two ways: by using tighter ordering constraints (need some further dive deep and may incur some overhead due to CPU reordering constraints and prevention of certain compiler optimisations) or by redesigning the metrics system to use per-thread values (this would also solve problem 1).
Another thing to keep in mind is potential need for having device-specific instances of a metric. For example, METRICS.net.tx_bytes_count may have sense to be reported per-device instance instead of being aggregated.