Skip to content

Connection Manager Overhaul #744

@vasco-santos

Description

@vasco-santos

Connection Manager Overhaul

This Issue is an EPIC to track the work related to the Connection Manager Overhaul. Each milestone context and initial thoughts are described next.

Background

As we land new features like the auto-relay and rendezvous as part of improving connectivity and discoverability in libp2p libp2p/js-libp2p#703, the connection manager overhaul becomes an important work stream to guarantee these protocols work as expected. In addition, this work will be important for some already implemented features/protocols like webrtc-star and bootstrap. Finally, this work is really important to enable the DHT work.

This overhaul should be an initial step towards the future ConnMgr v2.

Milestones Overview

Milestone Issue PR State
0) Documentation - Baseline NA #757 WIP
1) Watermarks Observation - Proactive Dial TODO TODO TODO
2) Keep Alive TODO TODO TODO
3) Protect Connections - Connection Tags TODO TODO TODO
4) Protect Connections - Decaying Tags TODO TODO TODO
5) Watermarks Observation - Trimming TODO TODO TODO
6) Connection Gater TODO #1142 Done
7) Dial retry TODO TODO TODO
8) Disconnect message TODO TODO TODO

These milestones do not need to be worked on in the displayed sequence. For instance, Connection tags, Connection Gater and Keep Alive can be isolated and implemented.

Context

The Connection manager is responsible for managing all the connections a peer has over time. It allows users to enforce an upper bound on the total number of open connections. To avoid possible service disruptions, connections can be tagged with metadata and optionally "protected" to guarantee that essential connections are kept alive.

0) Documentation - Connection flows

Create a DISCOVERABILITY_AND_CONNECTIVITY.md document to be a subsequent to the GETTING_STARTED document. After someone getting up to speed with how to configure and start libp2p on the getting started document, they should move into how to setup their peer/network according to their use case/environment, in order to enable peers to be discovered and connections with them to be established.

This will be divided in two categories:

1) Watermarks observation

Proactive dial

The connection manager proactively dials known peers, in order to have a meaningful set of connections to enable a node to work as expected, according to each use case/environment.

We have been relying on the connection manager low watermark, so that the peer keeps a reasonable number of arbitrary connections. Once we introduce protected connections, as well as tagging important peers, the proactive dial strategy can be modified to keep trying to dial more meaningful peers.

Proactive dial strategies

The following dial strategies should exist:

  1. Find our closest peers on the network, and attempt to stay connected to n to them. If peers from the previous search are no longer our closest peers, we should untag those connections, or just let decaying tags handle this.
  2. Finding, connecting to and protecting our gossipsub peers (same topics search)
  3. Finding and binding to relays with AutoRelay
  4. Finding and binding to application protocol peers (as needed via MulticodecTopology) -- We should clarify what libp2p will handle intrinsically and what users need to do. Ideally, I think libp2p should search for multicodecs for registered topologies automatically.
  5. ...

The above dial strategies should have sane defaults, but also support to be overwritten.
We should have an interval to double check if we have the most meaningful peers connected to, as well as to proactively dial on some events like Peer discovery/disconnect.

TODO: different strategy for Startup/Persistence?

Subsystems should be able to ask the connection manager for a slice of the connection pool. A connection that belongs in my gossipsub mesh should probably be protected

TODO: Figure out API for interaction between subsystems/topologies and connMgr
Subsystems might want to provide a selector function to choose a peer they care want. AutoRelay will want to check if a peer has metadata with hop = true

Trim Connections

The connection manager trims less useful connections to be below a high watermark number.

  • New connections should be given a grace period before they are subject to trimming - Short ttl decay tags
  • Trimming automatically run on demand
    • Verification on every Peer connect event
    • Attempt to keep a balance between subsystems connections and their needs
    • If a subsystem is exceeding its agreed allocation of connections, then we would look at disconnecting peers from it that no other system is using.

2) Keep Alive

Currently, if a connection does not have anything going on for a while, it will timeout and close.
Libp2p should guarantee that specific connections are alive. This is important for keeping connected to peers important to us, both in terms of infrastructure or application layer. Remote listening (webrtc-star, relay, etc) is really important in this context.

Keep Alive should be used for protected peers via the API (Milestone 3) and Peers provided in the configuration.

In most cases, a ping on the connection should be enough, but this needs to be tested for each transport.

3) Protect important connections

ConnManager tracks connections to peers, and allows consumers to associate metadata with each peer. This enables connections to be trimmed based on implementation-defined metadata per peer.

To see: #369

Connection tags

API

(based on go interface: https://github.com/libp2p/go-libp2p-core/blob/master/connmgr/manager.go)

  • Tag a peer with a string, associating a weight with the tag.
    • tagPeer (peerId: PeerId, tag: string, weight: number) : void
  • Untag removes the tagged value from the peer.
    • untagPeer (peerId: PeerId, tag: string) : void
  • Get the metadata assicuated with the peer connection
    • getTagInfo (peerId: PeerId) : TagInfo
    • tagInfo should be stored in the metadataBook
  • Protect a peer from having its connection(s) pruned.
    • protect (peerId: PeerId, tag: string)
      • This would need to return a boolean or throw
  • Unprotect a peer from having its connection(s) pruned.
    • unProtect (peerId: PeerId, tag: string)
  • Check if a peer connection is protected.
    • isProtected (peerId: PeerId, tag: string)

Data structures

/**
 * TagInfo object stores metadata associated with a peer
 * @typedef {Object} TagInfo
 * @property {Map<string, number>} tags map with tags and their current weight
 * @property {number} firstSeen timestamp of first connection establishment.
 * @property {number} weight seq counter.
 */

Integration with Trim connections

Connection tags will allows the trimming to become more intelligent in this stage. Peers should be iterated and the weight of the tags should be used as a first criterium.

4) Decaying tags

Note: Inspired by go-libp2p https://github.com/libp2p/go-libp2p-core/blob/master/connmgr/decay.go

A decaying tag is one whose value automatically decays over time. The decay behaviour is encapsulated in a user-provided decaying function (DecayFn). The function is called on every tick (determined by the interval parameter), and returns either the new value of the tag, or whether it should be erased altogether.

We do not set values on a decaying function, but "bump" decaying tags by a delta value. This calls the BumpFn with the old value and the delta, to determine the new value.

While users should be able to provide their own functions, we should provide some preset functions to be used. Behaviours that are straightforward to implement include:

  • Decay a tag by -1, or by half its current value, on every tick.
  • Every time a value is bumped, sum it to its current value.
  • Exponentially boost a score with every bump.
  • Sum the incoming score, but keep it within min, max bounds.

This is particularly important for scenarios like the Bootstrap discovery. When it starts, these connections are really important to get to know other peers. But as time passes and new connection exist, peers should disconnect from the bootstrap nodes.

API

  • setDecayingTag(tag: string, interval: time, decayFn: function, bumpFn: function)
// DecayFn applies a decay to the peer's score. The implementation must call
// DecayFn at the interval supplied when registering the tag.
//
// It receives a copy of the decaying value, and returns the score after
// applying the decay, as well as a flag to signal if the tag should be erased.
type DecayFn func(value DecayingValue) (after int, rm bool)

// BumpFn applies a delta onto an existing score, and returns the new score.
//
// Non-trivial bump functions include exponential boosting, moving averages,
// ceilings, etc.
type BumpFn func(value DecayingValue, delta int) (after int)

5) Connection Gater

TODO: https://github.com/libp2p/go-libp2p-core/blob/master/connmgr/gater.go

Related: #175

6) Connection Retry

Retry a dial if it fails on a first attempt.

7) Disconnect

Sometimes it will be possible to have flows where a peer A wants to disconnect from peer B because it has a lot of connections, all of them more important that the connection with peer B. However, peer B wants to be connected to peer A. A message should be exchanged so that peer B understands that it should not retry it (for a given time?) and eventually a peer exchange. This needs to be spec'ed. Initial discussion at libp2p/go-libp2p#238

Notes

  • Subsystems, such as pubsub, auto-relay, should provide a function to rank what peers they would like to have connections with.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    🎉Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions