Skip to content

Reliability Dashboard - Align rpc_transaction histogram buckets with substrate_sub_txpool_timing_event buckets for gossip metrics analysis #10067

@olliecorbisiero

Description

@olliecorbisiero

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Motivation

Motivation

substrate_sub_txpool_timing_event metrics represent Total (RPC + Gossip) - added via #7505.

rpc_transaction metrics represent RPC only - added via #8345.

In order to differentiate and observe latencies on Gossip transactions, we need the buckets to be uniform between substrate_sub_txpool_timing_event and rpc_transaction for each event type, as this will enable arithmetic operations between the two histograms (gossip_histogram = total_histogram - rpc_histogram).

Current Impact: Misaligned buckets prevent histogram subtraction, making it impossible to isolate gossip transaction latencies for observability and debugging.

Bucket Alignment Analysis

Event Type Substrate Buckets RPC Buckets Aligned? Issues
dropped 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 75, 90, 120, 180, +Inf 0.01, 3.01, 6.01, 9.01, 12.01, 15.01, 18.01, 21.01, 24.01, 27.01, 30.01, 33.01, 36.01, 39.01, 42.01, 45.01, 48.01, 51.01, 54.01, 57.01, +Inf ❌ NO 0.01 offset; Missing: 60, 75, 90, 120, 180
finalized 0, 5, 10, 15, 20, 25, 30, 35, 40, 80, 120, 160, 200, 240, 280, 320, 360, 400, 440, 480, 520, 560, 600, 640, 680, 720, 760, +Inf 0.01, 40.01, 80.01, 120.01, 160.01, 200.01, 240.01, 280.01, 320.01, 360.01, 400.01, 440.01, 480.01, 520.01, 560.01, 600.01, 640.01, 680.01, 720.01, 760.01, +Inf ❌ NO 0.01 offset bug; Missing: 5, 10, 15, 20, 25, 30, 35
in_block 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 75, 90, 120, 180, +Inf 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, +Inf ⚠️ PARTIAL Core buckets match; Missing: 60, 75, 90, 120, 180
invalid 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 75, 90, 120, 180, +Inf 0.01, 3.01, 6.01, 9.01, 12.01, 15.01, 18.01, 21.01, 24.01, 27.01, 30.01, 33.01, 36.01, 39.01, 42.01, 45.01, 48.01, 51.01, 54.01, 57.01, +Inf ❌ NO 0.01 offset; Missing: 60, 75, 90, 120, 180
ready/validation 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12, 10.24, 20.48, 40.96, 81.92, 163.84, 327.68, +Inf 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56, 5.12, 10.24, 20.48, 40.96, 81.92, 163.84, 327.68, +Inf ✅ YES Perfect match

Additional Bucket Alignment Issue - Rejected Latencies

Beyond RPC vs Total differentiation, we need to aggregate dropped & finality_timeout into a rejected latency measure.

Event Type Buckets Use Case Impact
dropped 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, +Inf Cannot combine with finality_timeout to show aggregate "rejected transaction" latencies
finality_timeout 0, 40, 80, 120, 160, 200, 240, 280, 320, 360, 400, 440, 480, 520, 560, 600, 640, 680, 720, 760, +Inf Different bucket structure prevents histogram addition in Prometheus

Current Impact: Given the bucket mismatch we are unable to combine these two events into a rejected latency reading.

Request

  1. Please align the following rpc_transaction histogram buckets to match their corresponding substrate_sub_txpool_timing_event buckets:
RPC Metric Should Match RPC Bucket Changes Required
rpc_transaction_dropped_time_bucket substrate_sub_txpool_timing_event_dropped_bucket Remove 0.01 offset; Add: 60, 75, 90, 120, 180
rpc_transaction_finalized_time_bucket substrate_sub_txpool_timing_event_finalized_bucket Remove 0.01 offset; Add: 5, 10, 15, 20, 25, 30, 35
rpc_transaction_in_block_time_bucket substrate_sub_txpool_timing_event_in_block_bucket Add: 60, 75, 90, 120, 180
rpc_transaction_invalid_time_bucket substrate_sub_txpool_timing_event_invalid_bucket Remove 0.01 offset; Add: 60, 75, 90, 120, 180
  1. Please align substrate_sub_txpool_timing_event_finality_timeout_bucket with substrate_sub_txpool_timing_event_dropped_bucket histogram buckets:
Metric Current Buckets
substrate_sub_txpool_timing_event_dropped_bucket 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, +Inf
substrate_sub_txpool_timing_event_finality_timeout_bucket 0, 40, 80, 120, 160, 200, 240, 280, 320, 360, 400, 440, 480, 520, 560, 600, 640, 680, 720, 760, +Inf

Suggested unified bucket structure: 0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 70, 80, 120, 160, 200, 240, 280, 320, 360, 400, 440, 480, 520, 560, 600, 640, 680, 720, 760, +Inf

Please also clarify the following metric mappings:

  1. Does rpc_transaction_validation_time_bucket correspond to substrate_sub_txpool_timing_event_ready_bucket? (These currently match perfectly)

  2. Which substrate_sub_txpool_timing_event does rpc_transaction_error_time_bucket correspond to?

  3. Do the following substrate_sub_txpool_timing_event metrics have RPC equivalents?

    • substrate_sub_txpool_timing_event_future_bucket
    • substrate_sub_txpool_timing_event_finality_timeout_bucket
    • substrate_sub_txpool_timing_event_retracted_bucket
    • substrate_sub_txpool_timing_event_usurped_bucket

Are you willing to help with this request?

Yes!

Metadata

Metadata

Assignees

Labels

I10-unconfirmedIssue might be valid, but it's not yet known.I5-enhancementAn additional feature request.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions