Skip to content

feat(gossipsub): upgrade internal Behaviour Handler message queue#570

Merged
jxs merged 36 commits into
sigp:sigp-gossipsubfrom
jxs:rework-behaviour-handler-message-dispatch
Aug 20, 2025
Merged

feat(gossipsub): upgrade internal Behaviour Handler message queue#570
jxs merged 36 commits into
sigp:sigp-gossipsubfrom
jxs:rework-behaviour-handler-message-dispatch

Conversation

@jxs
Copy link
Copy Markdown
Member

@jxs jxs commented Mar 3, 2025

Description

This started with an attempt to solve libp2p#5751 using the previous internal async-channel.
After multiple ideas were discussed off band, replacing the async-channel with an internal more tailored priority queue seemed inevitable.
This priority queue allows us to implement the cancellation of in flight IDONTWANT's very cleanly with the retain_mut function.
Clearing the stale messages likwise becomes simpler as we also make use of retain_mut
And this has the added advantage of being able to only have a single priority queue and making the code simpler.
If a peer is not making progress we can assume it's not delivering High priority messages and we can penalize it.

Notes & open questions

I haven't performance tested this, but plan to do so with lighthouse if you agree this should be the path forward.
I am curious if iterating all the messages to remove the IDONTWANT'ed and stall ones affects the overall performance.
Will also add tests to the queue once the design is finished.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

Copy link
Copy Markdown
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like the new queue being able to modify it from the behaviour side.

However I think there is a drawback to this approach and also I didn't understand how this queue can remove the priority/non-priority logic.

In the current version, we have a priority and a non-priority queue. The priority is queue is reserved for messages that simply cannot fail, and timing is not important. For example, GRAFT/PRUNE/SUBSCRIBE/UNSUBSCRIBE. It's fine if these messages go out late, but its not okay if we just have some internal error and we never send them.

For example, if we have PRUNED someone from our mesh, but never tell them, the bi-directionality of the mesh is broken and peers can now never know if we are in other's peoples mesh's and a lot of the principles of the network break down.

If I've understood this PR, we are now grouping these messages into the same queue as normal publish/forward messages and this queue is bounded. We can now drop these priority messages if for example the user is sending lots of messages. This wasn't possible before and I think this is a big problem. I think we still need the priority queue, which is unbounded and cannot fail, so that these very important messages always get sent, albiet they could be sent late.

The second drawback to this approach is that I dont think we can actually stop true in-flight messages. We can remove messages that are being sent from the behaviour and awaiting for the handler to send out, but for large messages that we have started sending, we can't cancel them in the behaviour. I don't think this is a big issue tho, maybe its the queue that is the concern and not the actual sending of the messages.
When we were discussing this problem, I was imagining the handler when calling:

Some(OutboundSubstreamState::PendingFlush(mut substream)) => {

If that message has been canceled, that we close the substream and stop sending the in-flight message. However, now that I think about it, closing the substream would constitute an error I think, so perhaps there is no actual way of stopping partially sent messages with the current gossipsub spec.

Comment thread protocols/gossipsub/src/behaviour.rs
Comment thread protocols/gossipsub/src/behaviour.rs
Comment thread protocols/gossipsub/src/queue.rs Outdated
Comment thread protocols/gossipsub/src/queue.rs Outdated
Comment thread protocols/gossipsub/src/queue.rs Outdated
Comment thread protocols/gossipsub/src/queue.rs Outdated
Comment thread protocols/gossipsub/src/behaviour.rs
Comment thread protocols/gossipsub/src/behaviour.rs Outdated
Comment thread protocols/gossipsub/src/handler.rs Outdated
@AgeManning
Copy link
Copy Markdown
Member

I went back to look at this and realize the O(1) complexity in the binary heap for push(), which is really nice. It does the prioritization for us, negating the need for a second queue. 😍

The only thing I think we might need to modify is to allow priority messages to ignore the queue's capacity. We shouldn't be generating these messages in volumes that would cause significant memory concerns. If we wanted to cap the queue if this is a concern, we should drop the peer at some limit.

i.e We are never in a state where we are connected to a peer and threw away a priority message. If we are worried about memory, we should at worst case kick/drop/ban the peer before we throw away a priority message.

If we go this route, we should be able to bring back the unreachable statement regarding a failed priority message.

Copy link
Copy Markdown

@elenaf9 elenaf9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach of filtering the IDONTWANT-ed messages from the queue directly!

But I am wondering if we really need a priority queue if only two priority levels are used.
What's the advantage of it, compared to having two separate FIFO queues for prio- and non-priority messages? The retain logic could still be implemented for them, but the push/pop operations would be faster, and we could directly use VecDequeue::retain_mut.

Comment thread protocols/gossipsub/src/handler.rs Outdated
Comment thread protocols/gossipsub/src/queue.rs
Comment thread protocols/gossipsub/src/types.rs Outdated
@jxs jxs force-pushed the sigp-gossipsub branch from 3e24b1b to 7a36e4c Compare March 13, 2025 14:46
@jxs jxs changed the title feature(gossipsub): switch internal async-channel, feat(gossipsub): switch internal async-channel, Mar 21, 2025
@jxs jxs force-pushed the sigp-gossipsub branch from 7a36e4c to 2500889 Compare March 21, 2025 14:42
Copy link
Copy Markdown
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me.

From our discussions tho and to make sure my understanding is correct.

Grouping the priority and non-priority into a single queue, makes the code a bit nicer, but it costs us an O(log(n)) when pop'ing elements vs an O(1) with two queues right?

I'm fine with the trade-off if its intended and you guys are also.

Comment thread protocols/gossipsub/src/metrics.rs
Copy link
Copy Markdown
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't seem to link it in the review.
But line 371 in metrics.rs and below need to be removed:

        let non_priority_queue_size = Histogram::new(linear_buckets(0.0, 25.0, 100));
        registry.register(
            "non_priority_queue_size",
            "Histogram of observed non-priority queue sizes",
            non_priority_queue_size.clone(),
        );

@jxs jxs force-pushed the rework-behaviour-handler-message-dispatch branch from 4d02fea to 36b3e4d Compare May 20, 2025 11:03
@jxs jxs force-pushed the sigp-gossipsub branch from 61b2820 to 77aa836 Compare June 22, 2025 18:26
@AgeManning
Copy link
Copy Markdown
Member

I think this mostly looks good to me. Just got to update it and we can think about merging?

Comment thread protocols/gossipsub/src/types.rs Outdated
Comment thread protocols/gossipsub/src/queue.rs Outdated
@jxs jxs closed this Aug 5, 2025
@jxs jxs force-pushed the sigp-gossipsub branch 2 times, most recently from 06acb4c to 20e6ade Compare August 13, 2025 15:44
@jxs jxs force-pushed the rework-behaviour-handler-message-dispatch branch 3 times, most recently from 3ed3a84 to 5640680 Compare August 15, 2025 07:30
and record them during heartbeat.
@jxs jxs force-pushed the rework-behaviour-handler-message-dispatch branch from 5640680 to c3d4a2b Compare August 15, 2025 08:24
@jxs jxs force-pushed the rework-behaviour-handler-message-dispatch branch from 62a6993 to 4263216 Compare August 18, 2025 23:03
@jxs jxs force-pushed the rework-behaviour-handler-message-dispatch branch 3 times, most recently from 2b85d67 to ec90b87 Compare August 20, 2025 12:18
@jxs jxs force-pushed the rework-behaviour-handler-message-dispatch branch from ec90b87 to 119bef5 Compare August 20, 2025 14:41
Copy link
Copy Markdown
Member Author

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all for the reviews!

@jxs jxs changed the title feat(gossipsub): switch internal async-channel, feat(gossipsub): upgrade internal Behaviour Handler message queue Aug 20, 2025
@jxs jxs merged commit 7c2a458 into sigp:sigp-gossipsub Aug 20, 2025
68 checks passed
mergify Bot pushed a commit to libp2p/rust-libp2p that referenced this pull request Oct 7, 2025
This is the up-streaming of sigp#570 which has been beeing used by https://github.com/sigp/lighthouse/ for some weeks now:

This started with an attempt to solve #5751 using the previous internal async-channel.
After multiple ideas were discussed off band, replacing the async-channel with an internal more tailored priority queue seemed inevitable. This priority queue allows us to implement the cancellation of in flight IDONTWANT's very cleanly with the `remove_data_messages` function. Clearing the stale messages likewise becomes simpler as we also make use of `remove_data_messages` .

Pull-Request: #6175.
@jxs jxs mentioned this pull request Nov 3, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants