-
Notifications
You must be signed in to change notification settings - Fork 10.3k
[SignalR] Seamless Reconnect #48338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SignalR] Seamless Reconnect #48338
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looked through it; couldn't see anything actively questionable, but there were large parts of it that went over my head; maybe a little more "here's what we're doing and why" on the PR description might help there?
/// <summary> | ||
/// | ||
/// </summary> | ||
public interface IReconnectFeature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't see one linked - has this gone though API review?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also: intellisense
/// <summary> | ||
/// | ||
/// </summary> | ||
public Action NotifyOnReconnect { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any useful context that would be meaningful on a per-invoke basis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just double-checking: should this be an event rather than a property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like how there's a race where you don't know if you're writing to the original or new connection since NotifyOnReconnect fires at some arbitrary point after ConnectionContext.Transport has been swapped to the new connection. The write my throw sometimes, but it also might not. With the way we use it, that's okay because we can ignore anything prior to the sequence message on the reading side, but it feels like an unnecessary weakness in the design of the feature.
Can we move the reconnect logic to be more above the transport layer? I'm thinking a new ConnectionContext with a new connection ID. This would still require a new feature to correlate the new connection with the old connection ID based on the token on the server, and we'd still want to make it seamlessly use the old connection ID from the perspective code using Hub APIs. But since there's no seamless deduping at the transport layer, I think it's best to avoid any magical potentially leaky abstractions there.
I know this would be a major redesign, so I'm okay with shipping this as a WebSocket transport feature at first. But if we can relayer this, we could probably avoid a bunch of transport-specific logic.
just double-checking: should this be an event rather than a property?
In SignalR and most of the rest of ASP.NET Core we prefer plain old Funcs and Actions over events.
@@ -1055,6 +1074,20 @@ private async Task SendWithLock(ConnectionState expectedConnectionState, HubMess | |||
Log.ReceivedPing(_logger); | |||
// timeout is reset above, on receiving any message | |||
break; | |||
case AckMessage ackMessage: | |||
_logger.LogInformation("Received Ack with ID {id}", ackMessage.SequenceId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subjective, but I'd say that this (and possibly some of the others) sound more like "debug" than "info"; but: that's literally as much energy as I have for that topic, so: if you disagree - just hit "mark resolved" and ignore me - totally fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look at the PR Marc 😃
I'm updating this to Trace and using the source gen logging right now. I just put them here as Info and non-source-gen for easier debugging purposes, and since I fixed the bugs I was looking for and am doing cleanup now, this is being updated.
public async ValueTask<FlushResult> WriteAsync(SerializedHubMessage hubMessage, CancellationToken cancellationToken) | ||
{ | ||
// TODO: Backpressure based on message count and total message size | ||
if (_buffer[_bufferIndex].Message is not null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just checking: is this only accessed inside the write lock? (maybe via a re-entract comeback from _protocol.WriteMessage)? just: as a public
method I'm unclear how this interacts vs the lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WriteAsync is always called in a lock by the calling code, so there will never be parallel WriteAsync calls, however we can't really express that here since this is a separate class.
We can add a comment to the method that it assumes the calling code is doing the needful.
// Or in exceptional cases we could miss multiple messages, but the next ack will clear them | ||
var index = _bufferIndex; | ||
var finalIndex = -1; | ||
for (var i = 0; i < _buffer.Length; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto re locking semantics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is also only called once at a time (again enforced by calling code), it can be in parallel with other methods on MessageBuffer
. But I think as it's currently written it is safe, unless the ValueTuple assignment to the array can have tearing since it's technically two values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok tearing is probably possible here. Will need to lock, but it shouldn't be contended too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good so far.
/// <summary> | ||
/// | ||
/// </summary> | ||
public Action NotifyOnReconnect { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like how there's a race where you don't know if you're writing to the original or new connection since NotifyOnReconnect fires at some arbitrary point after ConnectionContext.Transport has been swapped to the new connection. The write my throw sometimes, but it also might not. With the way we use it, that's okay because we can ignore anything prior to the sequence message on the reading side, but it feels like an unnecessary weakness in the design of the feature.
Can we move the reconnect logic to be more above the transport layer? I'm thinking a new ConnectionContext with a new connection ID. This would still require a new feature to correlate the new connection with the old connection ID based on the token on the server, and we'd still want to make it seamlessly use the old connection ID from the perspective code using Hub APIs. But since there's no seamless deduping at the transport layer, I think it's best to avoid any magical potentially leaky abstractions there.
I know this would be a major redesign, so I'm okay with shipping this as a WebSocket transport feature at first. But if we can relayer this, we could probably avoid a bunch of transport-specific logic.
just double-checking: should this be an event rather than a property?
In SignalR and most of the rest of ASP.NET Core we prefer plain old Funcs and Actions over events.
public MessageBuffer(ConnectionContext connection, IHubProtocol protocol) | ||
{ | ||
// Arbitrary size, we can figure out defaults and configurability later | ||
const int bufferSize = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the primary or only limit should be based on the total size of the serialized buffer in bytes. 10 messages before observing backpressure could be really bad for any app that sends a lot of small messages over high latency connections.
I think the default should probably be on the order of 100 KB or 1 MB on the server (similar to SocketTransportOptions.MaxReadBufferSize and Http2Limits InitialConnectionWindowSize are both 1 MB) and more on the client to avoid unnecessary backpressure.
On the server, backpressure could slow down some server logic and punish faster connections in the same group because of one slow connection. This is already an issue with socket backpressure, but socket writes usually involve large amounts of buffering at lower layers, and this is all app level.
This wouldn't count for the fact that some of these serialized messages have multiple references, but I think it's okay to overcount for this. At least it's simple to explain. This also wouldn't account for the size of the slots in the collection referencing the SerializedHubMessage, but we can probably mitigate that with smart collection design. Worst case, we could have a separate limit for the number of buffered hub messages, but it should also be large by default. 10 feels too small. 1000 seems more reasonable, but this should definitely be discussed as part of the threat modeling.
|
||
// TODO: Handle TCP connection errors | ||
// https://github.com/SignalR/SignalR/blob/1fba14fa3437e24c204dfaf8a18db3fce8acad3c/src/Microsoft.AspNet.SignalR.Core/Owin/WebSockets/WebSocketHandler.cs#L248-L251 | ||
Running = ProcessSocketAsync(_webSocket); | ||
Running = ProcessSocketAsync(_webSocket, url, ignoreFirstCanceled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this just overwrites the Running
task when this is called a second time for a seamless reconnect? What happens if StopAsync
is already awaiting the first one which wich will now stop once this method ends at the end of the first call to ProcessSocketAsync
? StopAsync
would finish while the new connection has already taken over, right? At least the _stopCts
should still stop everything eventually. I could see this causing problems for the fallback to a normal reconnect though.
if (_useAck && !_gracefulClose) | ||
{ | ||
UpdateConnectionPair(); | ||
await StartAsync(url, _webSocketMessageType == WebSocketMessageType.Binary ? TransferFormat.Binary : TransferFormat.Text, default).ConfigureAwait(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should retry the seamless reconnect attempts until we hit a configured limit. It probably makes sense to use an IRetryPolicy
for this like we do for normal reconnects.
private int _bufferedByteCount; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider, for the preview release, making this cheaply configurable (e.g. via environment variable) or at least just making it much larger.
Looks good to me overall. There's a lot of TODOs in there for future previews. If you don't already have a good way to track those I recommend opening up a follow up checklist issue (or separate issues). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sooner people can try out the feature the better. This will have to go through API review before the "rtm" release, but I would like to see this in preview5.
I don't think there's much of a risk since it's opt-in on the client and a preview release. I'll be interested to see if anyone runs into issues with the 10 item buffer 😆 I think we'll eventually want to make this at least opt-out on the server too.
Edit: I see we've already implemented the 100 KB limit!
/backport to release/8.0-preview5 |
Started backporting to release/8.0-preview5: https://github.com/dotnet/aspnetcore/actions/runs/5082466603 |
Initial implementation of #46691
High level details
Probably will have a server option in future previews to disallow acks, and have version control over the ack protocol
Follow-up work: