[SignalR] Seamless Reconnect #48338

BrennanConroy · 2023-05-19T23:19:29Z

Initial implementation of #46691

High level details

Reconnect has been added to the transport layer
Add IReconnectFeature to enable telling higher layers that a reconnect occurs (used for protocol)
Two new SignalR message types
- Ack Message, sent when acking message(s) so the other side can remove message(s) from the buffer
- Sequence Message, sent on reconnect to tell other side what ID messages being sent are starting at
MessageBuffer type is where 90% of the implementation is, it's a ring buffer with knowledge of the ack and sequence messages.
- Shared between Client and Server
- Stores buffered messages until they are acked
- Resends buffered messages on reconnect
Only WebSockets on the .NET Client are supported in this PR
- Other transports and clients will come in later previews, it's just better to figure out all the details with one client + transport before implementing everywhere
- Adds opt-in option of UseAcks to HttpConnectionOptions

Probably will have a server option in future previews to disallow acks, and have version control over the ack protocol

Follow-up work:

Configurable buffer limit(s)
Versioning, probably increment HubProtocol and add "ack protocol"
Try to push feature fully into SignalR layer, i.e. get new ConnectionContext from lower layer and map onto existing connection
- If not, finalize IReconnectFeature API
- Fix StopAsync race in WebSocketsTransport
Pool buffers
Finalize options for enabling the feature, both client and server side

mgravell

looked through it; couldn't see anything actively questionable, but there were large parts of it that went over my head; maybe a little more "here's what we're doing and why" on the PR description might help there?

mgravell · 2023-05-22T15:18:21Z

src/Servers/Connections.Abstractions/src/Features/IReconnectFeature.cs

+/// <summary>
+/// 
+/// </summary>
+public interface IReconnectFeature


couldn't see one linked - has this gone though API review?

also: intellisense

mgravell · 2023-05-22T15:19:18Z

src/Servers/Connections.Abstractions/src/Features/IReconnectFeature.cs

+    /// <summary>
+    /// 
+    /// </summary>
+    public Action NotifyOnReconnect { get; set; }


is there any useful context that would be meaningful on a per-invoke basis?

just double-checking: should this be an event rather than a property?

I don't like how there's a race where you don't know if you're writing to the original or new connection since NotifyOnReconnect fires at some arbitrary point after ConnectionContext.Transport has been swapped to the new connection. The write my throw sometimes, but it also might not. With the way we use it, that's okay because we can ignore anything prior to the sequence message on the reading side, but it feels like an unnecessary weakness in the design of the feature.

Can we move the reconnect logic to be more above the transport layer? I'm thinking a new ConnectionContext with a new connection ID. This would still require a new feature to correlate the new connection with the old connection ID based on the token on the server, and we'd still want to make it seamlessly use the old connection ID from the perspective code using Hub APIs. But since there's no seamless deduping at the transport layer, I think it's best to avoid any magical potentially leaky abstractions there.

I know this would be a major redesign, so I'm okay with shipping this as a WebSocket transport feature at first. But if we can relayer this, we could probably avoid a bunch of transport-specific logic.

just double-checking: should this be an event rather than a property?

In SignalR and most of the rest of ASP.NET Core we prefer plain old Funcs and Actions over events.

mgravell · 2023-05-22T15:23:42Z

src/SignalR/clients/csharp/Client.Core/src/HubConnection.cs

@@ -1055,6 +1074,20 @@ private async Task SendWithLock(ConnectionState expectedConnectionState, HubMess
                Log.ReceivedPing(_logger);
                // timeout is reset above, on receiving any message
                break;
+            case AckMessage ackMessage:
+                _logger.LogInformation("Received Ack with ID {id}", ackMessage.SequenceId);


subjective, but I'd say that this (and possibly some of the others) sound more like "debug" than "info"; but: that's literally as much energy as I have for that topic, so: if you disagree - just hit "mark resolved" and ignore me - totally fine

Thanks for taking a look at the PR Marc 😃

I'm updating this to Trace and using the source gen logging right now. I just put them here as Info and non-source-gen for easier debugging purposes, and since I fixed the bugs I was looking for and am doing cleanup now, this is being updated.

src/SignalR/common/Shared/MessageBuffer.cs

mgravell · 2023-05-22T16:37:32Z

src/SignalR/common/Shared/MessageBuffer.cs

+    public async ValueTask<FlushResult> WriteAsync(SerializedHubMessage hubMessage, CancellationToken cancellationToken)
+    {
+        // TODO: Backpressure based on message count and total message size
+        if (_buffer[_bufferIndex].Message is not null)


just checking: is this only accessed inside the write lock? (maybe via a re-entract comeback from _protocol.WriteMessage)? just: as a public method I'm unclear how this interacts vs the lock

WriteAsync is always called in a lock by the calling code, so there will never be parallel WriteAsync calls, however we can't really express that here since this is a separate class.

We can add a comment to the method that it assumes the calling code is doing the needful.

mgravell · 2023-05-22T16:38:18Z

src/SignalR/common/Shared/MessageBuffer.cs

+        // Or in exceptional cases we could miss multiple messages, but the next ack will clear them
+        var index = _bufferIndex;
+        var finalIndex = -1;
+        for (var i = 0; i < _buffer.Length; i++)


ditto re locking semantics

This method is also only called once at a time (again enforced by calling code), it can be in parallel with other methods on MessageBuffer. But I think as it's currently written it is safe, unless the ValueTuple assignment to the array can have tearing since it's technically two values.

Ok tearing is probably possible here. Will need to lock, but it shouldn't be contended too much.

halter73

This is looking good so far.

halter73 · 2023-05-24T00:20:06Z

src/Servers/Connections.Abstractions/src/Features/IReconnectFeature.cs

+    /// <summary>
+    /// 
+    /// </summary>
+    public Action NotifyOnReconnect { get; set; }


I don't like how there's a race where you don't know if you're writing to the original or new connection since NotifyOnReconnect fires at some arbitrary point after ConnectionContext.Transport has been swapped to the new connection. The write my throw sometimes, but it also might not. With the way we use it, that's okay because we can ignore anything prior to the sequence message on the reading side, but it feels like an unnecessary weakness in the design of the feature.

Can we move the reconnect logic to be more above the transport layer? I'm thinking a new ConnectionContext with a new connection ID. This would still require a new feature to correlate the new connection with the old connection ID based on the token on the server, and we'd still want to make it seamlessly use the old connection ID from the perspective code using Hub APIs. But since there's no seamless deduping at the transport layer, I think it's best to avoid any magical potentially leaky abstractions there.

I know this would be a major redesign, so I'm okay with shipping this as a WebSocket transport feature at first. But if we can relayer this, we could probably avoid a bunch of transport-specific logic.

just double-checking: should this be an event rather than a property?

In SignalR and most of the rest of ASP.NET Core we prefer plain old Funcs and Actions over events.

halter73 · 2023-05-24T00:31:37Z

src/SignalR/common/Shared/MessageBuffer.cs

+    public MessageBuffer(ConnectionContext connection, IHubProtocol protocol)
+    {
+        // Arbitrary size, we can figure out defaults and configurability later
+        const int bufferSize = 10;


I think the primary or only limit should be based on the total size of the serialized buffer in bytes. 10 messages before observing backpressure could be really bad for any app that sends a lot of small messages over high latency connections.

I think the default should probably be on the order of 100 KB or 1 MB on the server (similar to SocketTransportOptions.MaxReadBufferSize and Http2Limits InitialConnectionWindowSize are both 1 MB) and more on the client to avoid unnecessary backpressure.

On the server, backpressure could slow down some server logic and punish faster connections in the same group because of one slow connection. This is already an issue with socket backpressure, but socket writes usually involve large amounts of buffering at lower layers, and this is all app level.

This wouldn't count for the fact that some of these serialized messages have multiple references, but I think it's okay to overcount for this. At least it's simple to explain. This also wouldn't account for the size of the slots in the collection referencing the SerializedHubMessage, but we can probably mitigate that with smart collection design. Worst case, we could have a separate limit for the number of buffered hub messages, but it should also be large by default. 10 feels too small. 1000 seems more reasonable, but this should definitely be discussed as part of the threat modeling.

halter73 · 2023-05-24T00:41:11Z

src/SignalR/clients/csharp/Http.Connections.Client/src/Internal/WebSocketsTransport.cs


        // TODO: Handle TCP connection errors
        // https://github.com/SignalR/SignalR/blob/1fba14fa3437e24c204dfaf8a18db3fce8acad3c/src/Microsoft.AspNet.SignalR.Core/Owin/WebSockets/WebSocketHandler.cs#L248-L251
-        Running = ProcessSocketAsync(_webSocket);
+        Running = ProcessSocketAsync(_webSocket, url, ignoreFirstCanceled);


So this just overwrites the Running task when this is called a second time for a seamless reconnect? What happens if StopAsync is already awaiting the first one which wich will now stop once this method ends at the end of the first call to ProcessSocketAsync? StopAsync would finish while the new connection has already taken over, right? At least the _stopCts should still stop everything eventually. I could see this causing problems for the fallback to a normal reconnect though.

halter73 · 2023-05-24T00:43:07Z

src/SignalR/clients/csharp/Http.Connections.Client/src/Internal/WebSocketsTransport.cs

+        if (_useAck && !_gracefulClose)
+        {
+            UpdateConnectionPair();
+            await StartAsync(url, _webSocketMessageType == WebSocketMessageType.Binary ? TransferFormat.Binary : TransferFormat.Text, default).ConfigureAwait(false);


We should retry the seamless reconnect attempts until we hit a configured limit. It probably makes sense to use an IRetryPolicy for this like we do for normal reconnects.

adityamandaleeka · 2023-05-24T23:43:45Z

src/SignalR/common/Shared/MessageBuffer.cs

+    private int _bufferedByteCount;
+


Consider, for the preview release, making this cheaply configurable (e.g. via environment variable) or at least just making it much larger.

adityamandaleeka · 2023-05-25T00:10:12Z

Looks good to me overall. There's a lot of TODOs in there for future previews. If you don't already have a good way to track those I recommend opening up a follow up checklist issue (or separate issues).

halter73

The sooner people can try out the feature the better. This will have to go through API review before the "rtm" release, but I would like to see this in preview5.

I don't think there's much of a risk since it's opt-in on the client and a preview release. ~~I'll be interested to see if anyone runs into issues with the 10 item buffer 😆~~ I think we'll eventually want to make this at least opt-out on the server too.

Edit: I see we've already implemented the 100 KB limit!

BrennanConroy · 2023-05-25T16:43:50Z

/backport to release/8.0-preview5

github-actions · 2023-05-25T16:44:03Z

Started backporting to release/8.0-preview5: https://github.com/dotnet/aspnetcore/actions/runs/5082466603

BrennanConroy added 21 commits May 19, 2023 15:56

Initial

40a2313

It's working

a20570a

tests and base64

f69e607

E2E test :o

73fe26e

some cleanup

6371689

cleanup/comments

e57793c

move files and namespace

0fb6bda

rebase

2bde741

cleanup

638ecc0

spec

510eef1

some fb

3eee427

backpressure and concrete pipe

ba54c5d

stash

32ba07c

stash

6c7de6c

stash

512aee6

stash

83647d9

stash, but it 'works'

e398721

tiny cleanup

ad31426

stash fix bugs

39cdb7e

acking

ef40a04

more cleanup and bug fixes

3f83ef1

BrennanConroy added the area-signalr Includes: SignalR clients and servers label May 19, 2023

fix exception

3da2c76

mgravell reviewed May 22, 2023

View reviewed changes

BrennanConroy added 2 commits May 22, 2023 14:16

fb and cleanup

680bb23

lock + some tests and fixes

209885f

halter73 reviewed May 24, 2023

View reviewed changes

byte count backpressure

610b8c8

BrennanConroy marked this pull request as ready for review May 24, 2023 22:39

BrennanConroy requested a review from Tratcher as a code owner May 24, 2023 22:39

BrennanConroy requested a review from JamesNK as a code owner May 24, 2023 22:39

adityamandaleeka reviewed May 24, 2023

View reviewed changes

halter73 approved these changes May 25, 2023

View reviewed changes

github-actions bot mentioned this pull request May 25, 2023

[release/8.0-preview5] [SignalR] Seamless Reconnect #48427

Merged

10 tasks

BrennanConroy merged commit f56c242 into main May 26, 2023

BrennanConroy deleted the brecon/ack branch May 26, 2023 00:27

ghost added this to the 8.0-preview6 milestone May 26, 2023

BrennanConroy modified the milestones: 8.0-preview6, 8.0-preview5 May 26, 2023

BrennanConroy mentioned this pull request May 30, 2023

[SignalR] Seamless Reconnect #47748

Closed

github-actions bot locked and limited conversation to collaborators Dec 8, 2023

[SignalR] Seamless Reconnect #48338

[SignalR] Seamless Reconnect #48338

Uh oh!

Conversation

BrennanConroy commented May 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgravell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

halter73 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adityamandaleeka commented May 25, 2023

Uh oh!

halter73 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BrennanConroy commented May 25, 2023

Uh oh!

github-actions bot commented May 25, 2023

Uh oh!

Uh oh!

BrennanConroy commented May 19, 2023 •

edited

Loading

halter73 left a comment •

edited

Loading

halter73 left a comment •

edited

Loading