Skip to content

GODRIVER-2037 Don't clear the connection pool on client-side connect timeout errors. #688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 30, 2021

Conversation

matthewdale
Copy link
Collaborator

@matthewdale matthewdale commented Jun 17, 2021

Update topology.Server#ProcessHandshakeError to not clear the server connection pool on client-side timeout errors that occur while creating a connection in-line with an operation. This is intended as a patch to prevent clearing the connection pool on specific identified conditions that currently cause the driver to clear the server connection pool unnecessarily and is not intended to handle a comprehensive set of errors that could occur during handshake.
Note that the changes here are likely irrelevant once all connections are established in the background with the connectTimeoutMS deadline (GODRIVER-2038).

Possible cases:

  1. server.Connection(context.Background())
    Get a connection to server using connectTimeoutMS. The server connection pool will be cleared on timeout during handshake. This case is used for connections created by the MinPoolSize maintenance goroutine and by operations that use context.Background() for the timeout.
  2. server.Connection(context.WithTimeout(..., 1*time.Second))
    Get a connection to server using a timeout specified on an operation lower than connectTimeoutMS. The server connection pool will not be cleared on timeout during handshake. This case is used for connections created by operations that use context.WithTimeout() for timeouts shorter than the connectTimeoutMS.
  3. server.Connection(context.WithTimeout(..., 10*time.Minute))
    Get a connection to server using a timeout specified on an operation higher than connectTimeoutMS. The server connection pool will be cleared on timeout during handshake. This case is used for connections created by operations that use context.WithTimeout() for timeouts longer than connectTimeoutMS.
  4. server.Connection(context.WithCancel(...))
    Get a connection to server using a cancellation specified on an operation. The server connection pool will not be cleared on timeout during handshake. This case is used for connections created by operations that use context.WithCancel() for cancellation.

GODRIVER-2037

@matthewdale matthewdale force-pushed the godriver2037 branch 2 times, most recently from b84c70d to 859168b Compare June 17, 2021 06:11
@matthewdale matthewdale requested review from kevinAlbs and iwysiu June 17, 2021 06:30
@matthewdale matthewdale marked this pull request as ready for review June 17, 2021 06:31
@kevinAlbs kevinAlbs requested a review from divjotarora June 17, 2021 14:45
Copy link
Contributor

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great!

Copy link
Contributor

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work, LGTM!

Copy link
Contributor

@divjotarora divjotarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just some minor comments about testing.

@matthewdale matthewdale requested a review from divjotarora June 25, 2021 05:42
Copy link
Contributor

@divjotarora divjotarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! The new operationTimeout parameter for tests is much clearer IMO. I responded to a previous thread about testing context.Canceled errors so I'll hold off on approving until that's resolved, but everything else pretty much LGTM.

…ake error handling, improve SDAM error handling tests, add handshake cancellation error test.
@matthewdale matthewdale requested a review from divjotarora June 28, 2021 18:51
@matthewdale
Copy link
Collaborator Author

@kevinAlbs and @divjotarora I've made 2 somewhat significant changes since the last time you reviewed this:

  1. Pass the operation-scoped context to the ProcessHandshakeError function and check if the actual deadline has passed (if there is one) instead of just checking for ctx.Err(). This handles a race condition caused by some net functions that extract the deadline from a context and time out on that deadline, but the goroutine that closes ctx.Done() and causes ctx.Err() to return an error hasn't run yet.
  2. Use individual instances of the new testPoolMonitor type to replace the package-global poolMonitor in the mongo/integration package tests. This fixes unexpected interaction between the pool monitor in different test cases that results in assertion failures and test timeouts (caused by infinitely blocking channel sends).

@kevinAlbs kevinAlbs self-requested a review June 28, 2021 19:20
@matthewdale matthewdale requested a review from iwysiu June 28, 2021 23:03
Copy link
Contributor

@divjotarora divjotarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for investigating this issue and all of the approaches so thoroughly!

}
}

func TestServerConnectionCancellation(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For readability, can you write a comment at the beginning of this test that explains what we're testing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add comment

Copy link
Contributor

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod optional comments.

@@ -23,7 +24,65 @@ const (
errorInterruptedAtShutdown int32 = 11600
)

// testPoolMonitor exposes an *event.PoolMonitor and collects all events logged to that
// *event.PoolMonitor. It is safe to use from multiple concurrent goroutines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh very nice. Optional: should this be in a separate file, like connection_pool_helpers_test.go? Though isPoolCleared was used by other files, the name of this file suggests it is only for the tests specified in https://github.com/mongodb/specifications/blob/master/source/connections-survive-step-down/tests/README.rst

Copy link
Collaborator Author

@matthewdale matthewdale Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wanted to locate it "next to" the existing poolMonitor so that there aren't two implementations of the same thing in different places. I think the testPoolMonitor could actually be a good candidate for a "test utilities" package because numerous test packages need to record events (including the Server tests added in this PR). However, I'm not sure it's worth moving into a separate file until we want to move it into a test utilities package, which probably shouldn't be part of this PR.

I've added investigating moving testPoolMonitor to a shared test utilities package to the description of GODRIVER-2068..

return true
}

// In some networking functions, the deadline from the context is used to determine timeouts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the helpful explanation! That is interesting, and sounds like it could be possible buggy behavior in the net package if it inconsistently sets context errors.

// Create a connection pool event monitor that sends all events to an events channel
// so we can assert on the connection pool events later.
WithConnectionPoolMonitor(func(_ *event.PoolMonitor) *event.PoolMonitor {
return &event.PoolMonitor{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: IIUC this could use the new testPoolMonitor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it could, although it would require moving testPoolMonitor into a test util package like internal/testutil (or even an external test utilities package if that makes sense). I'll leave a comment to indicate that change should be considered as part of GODRIVER-2068.

@matthewdale matthewdale merged commit 5199a0b into master Jun 30, 2021
matthewdale added a commit that referenced this pull request Jun 30, 2021
@matthewdale matthewdale deleted the godriver2037 branch July 13, 2021 05:25
faem pushed a commit to kubedb/mongo-go-driver that referenced this pull request Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants