Skip to content

[SignalR] Seamless Reconnect #48338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
May 26, 2023
Merged

[SignalR] Seamless Reconnect #48338

merged 25 commits into from
May 26, 2023

Conversation

BrennanConroy
Copy link
Member

@BrennanConroy BrennanConroy commented May 19, 2023

Initial implementation of #46691

High level details

  • Reconnect has been added to the transport layer
  • Add IReconnectFeature to enable telling higher layers that a reconnect occurs (used for protocol)
  • Two new SignalR message types
    • Ack Message, sent when acking message(s) so the other side can remove message(s) from the buffer
    • Sequence Message, sent on reconnect to tell other side what ID messages being sent are starting at
  • MessageBuffer type is where 90% of the implementation is, it's a ring buffer with knowledge of the ack and sequence messages.
    • Shared between Client and Server
    • Stores buffered messages until they are acked
    • Resends buffered messages on reconnect
  • Only WebSockets on the .NET Client are supported in this PR
    • Other transports and clients will come in later previews, it's just better to figure out all the details with one client + transport before implementing everywhere
    • Adds opt-in option of UseAcks to HttpConnectionOptions

Probably will have a server option in future previews to disallow acks, and have version control over the ack protocol

Follow-up work:

  • Configurable buffer limit(s)
  • Versioning, probably increment HubProtocol and add "ack protocol"
  • Try to push feature fully into SignalR layer, i.e. get new ConnectionContext from lower layer and map onto existing connection
    • If not, finalize IReconnectFeature API
    • Fix StopAsync race in WebSocketsTransport
  • Pool buffers
  • Finalize options for enabling the feature, both client and server side

@BrennanConroy BrennanConroy added the area-signalr Includes: SignalR clients and servers label May 19, 2023
Copy link
Member

@mgravell mgravell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looked through it; couldn't see anything actively questionable, but there were large parts of it that went over my head; maybe a little more "here's what we're doing and why" on the PR description might help there?

/// <summary>
///
/// </summary>
public interface IReconnectFeature
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't see one linked - has this gone though API review?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: intellisense

/// <summary>
///
/// </summary>
public Action NotifyOnReconnect { get; set; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any useful context that would be meaningful on a per-invoke basis?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just double-checking: should this be an event rather than a property?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like how there's a race where you don't know if you're writing to the original or new connection since NotifyOnReconnect fires at some arbitrary point after ConnectionContext.Transport has been swapped to the new connection. The write my throw sometimes, but it also might not. With the way we use it, that's okay because we can ignore anything prior to the sequence message on the reading side, but it feels like an unnecessary weakness in the design of the feature.

Can we move the reconnect logic to be more above the transport layer? I'm thinking a new ConnectionContext with a new connection ID. This would still require a new feature to correlate the new connection with the old connection ID based on the token on the server, and we'd still want to make it seamlessly use the old connection ID from the perspective code using Hub APIs. But since there's no seamless deduping at the transport layer, I think it's best to avoid any magical potentially leaky abstractions there.

I know this would be a major redesign, so I'm okay with shipping this as a WebSocket transport feature at first. But if we can relayer this, we could probably avoid a bunch of transport-specific logic.

just double-checking: should this be an event rather than a property?

In SignalR and most of the rest of ASP.NET Core we prefer plain old Funcs and Actions over events.

@@ -1055,6 +1074,20 @@ private async Task SendWithLock(ConnectionState expectedConnectionState, HubMess
Log.ReceivedPing(_logger);
// timeout is reset above, on receiving any message
break;
case AckMessage ackMessage:
_logger.LogInformation("Received Ack with ID {id}", ackMessage.SequenceId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subjective, but I'd say that this (and possibly some of the others) sound more like "debug" than "info"; but: that's literally as much energy as I have for that topic, so: if you disagree - just hit "mark resolved" and ignore me - totally fine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at the PR Marc 😃

I'm updating this to Trace and using the source gen logging right now. I just put them here as Info and non-source-gen for easier debugging purposes, and since I fixed the bugs I was looking for and am doing cleanup now, this is being updated.

public async ValueTask<FlushResult> WriteAsync(SerializedHubMessage hubMessage, CancellationToken cancellationToken)
{
// TODO: Backpressure based on message count and total message size
if (_buffer[_bufferIndex].Message is not null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checking: is this only accessed inside the write lock? (maybe via a re-entract comeback from _protocol.WriteMessage)? just: as a public method I'm unclear how this interacts vs the lock

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WriteAsync is always called in a lock by the calling code, so there will never be parallel WriteAsync calls, however we can't really express that here since this is a separate class.

We can add a comment to the method that it assumes the calling code is doing the needful.

// Or in exceptional cases we could miss multiple messages, but the next ack will clear them
var index = _bufferIndex;
var finalIndex = -1;
for (var i = 0; i < _buffer.Length; i++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto re locking semantics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is also only called once at a time (again enforced by calling code), it can be in parallel with other methods on MessageBuffer. But I think as it's currently written it is safe, unless the ValueTuple assignment to the array can have tearing since it's technically two values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok tearing is probably possible here. Will need to lock, but it shouldn't be contended too much.

Copy link
Member

@halter73 halter73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good so far.

/// <summary>
///
/// </summary>
public Action NotifyOnReconnect { get; set; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like how there's a race where you don't know if you're writing to the original or new connection since NotifyOnReconnect fires at some arbitrary point after ConnectionContext.Transport has been swapped to the new connection. The write my throw sometimes, but it also might not. With the way we use it, that's okay because we can ignore anything prior to the sequence message on the reading side, but it feels like an unnecessary weakness in the design of the feature.

Can we move the reconnect logic to be more above the transport layer? I'm thinking a new ConnectionContext with a new connection ID. This would still require a new feature to correlate the new connection with the old connection ID based on the token on the server, and we'd still want to make it seamlessly use the old connection ID from the perspective code using Hub APIs. But since there's no seamless deduping at the transport layer, I think it's best to avoid any magical potentially leaky abstractions there.

I know this would be a major redesign, so I'm okay with shipping this as a WebSocket transport feature at first. But if we can relayer this, we could probably avoid a bunch of transport-specific logic.

just double-checking: should this be an event rather than a property?

In SignalR and most of the rest of ASP.NET Core we prefer plain old Funcs and Actions over events.

public MessageBuffer(ConnectionContext connection, IHubProtocol protocol)
{
// Arbitrary size, we can figure out defaults and configurability later
const int bufferSize = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the primary or only limit should be based on the total size of the serialized buffer in bytes. 10 messages before observing backpressure could be really bad for any app that sends a lot of small messages over high latency connections.

I think the default should probably be on the order of 100 KB or 1 MB on the server (similar to SocketTransportOptions.MaxReadBufferSize and Http2Limits InitialConnectionWindowSize are both 1 MB) and more on the client to avoid unnecessary backpressure.

On the server, backpressure could slow down some server logic and punish faster connections in the same group because of one slow connection. This is already an issue with socket backpressure, but socket writes usually involve large amounts of buffering at lower layers, and this is all app level.

This wouldn't count for the fact that some of these serialized messages have multiple references, but I think it's okay to overcount for this. At least it's simple to explain. This also wouldn't account for the size of the slots in the collection referencing the SerializedHubMessage, but we can probably mitigate that with smart collection design. Worst case, we could have a separate limit for the number of buffered hub messages, but it should also be large by default. 10 feels too small. 1000 seems more reasonable, but this should definitely be discussed as part of the threat modeling.


// TODO: Handle TCP connection errors
// https://github.com/SignalR/SignalR/blob/1fba14fa3437e24c204dfaf8a18db3fce8acad3c/src/Microsoft.AspNet.SignalR.Core/Owin/WebSockets/WebSocketHandler.cs#L248-L251
Running = ProcessSocketAsync(_webSocket);
Running = ProcessSocketAsync(_webSocket, url, ignoreFirstCanceled);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this just overwrites the Running task when this is called a second time for a seamless reconnect? What happens if StopAsync is already awaiting the first one which wich will now stop once this method ends at the end of the first call to ProcessSocketAsync? StopAsync would finish while the new connection has already taken over, right? At least the _stopCts should still stop everything eventually. I could see this causing problems for the fallback to a normal reconnect though.

if (_useAck && !_gracefulClose)
{
UpdateConnectionPair();
await StartAsync(url, _webSocketMessageType == WebSocketMessageType.Binary ? TransferFormat.Binary : TransferFormat.Text, default).ConfigureAwait(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should retry the seamless reconnect attempts until we hit a configured limit. It probably makes sense to use an IRetryPolicy for this like we do for normal reconnects.

@BrennanConroy BrennanConroy marked this pull request as ready for review May 24, 2023 22:39
@BrennanConroy BrennanConroy requested a review from Tratcher as a code owner May 24, 2023 22:39
@BrennanConroy BrennanConroy requested a review from JamesNK as a code owner May 24, 2023 22:39
Comment on lines +49 to +50
private int _bufferedByteCount;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider, for the preview release, making this cheaply configurable (e.g. via environment variable) or at least just making it much larger.

@adityamandaleeka
Copy link
Member

Looks good to me overall. There's a lot of TODOs in there for future previews. If you don't already have a good way to track those I recommend opening up a follow up checklist issue (or separate issues).

Copy link
Member

@halter73 halter73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sooner people can try out the feature the better. This will have to go through API review before the "rtm" release, but I would like to see this in preview5.

I don't think there's much of a risk since it's opt-in on the client and a preview release. I'll be interested to see if anyone runs into issues with the 10 item buffer 😆 I think we'll eventually want to make this at least opt-out on the server too.

Edit: I see we've already implemented the 100 KB limit!

@BrennanConroy
Copy link
Member Author

/backport to release/8.0-preview5

@github-actions
Copy link
Contributor

Started backporting to release/8.0-preview5: https://github.com/dotnet/aspnetcore/actions/runs/5082466603

@BrennanConroy BrennanConroy merged commit f56c242 into main May 26, 2023
@BrennanConroy BrennanConroy deleted the brecon/ack branch May 26, 2023 00:27
@ghost ghost added this to the 8.0-preview6 milestone May 26, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Dec 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-signalr Includes: SignalR clients and servers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants