Fix race condition in ConntrackConnectionStore and FlowExporter#3655
Merged
Conversation
e00f8c1 to
a6093c3
Compare
Codecov Report
@@ Coverage Diff @@
## main #3655 +/- ##
==========================================
- Coverage 64.65% 57.10% -7.55%
==========================================
Files 278 392 +114
Lines 39363 54823 +15460
==========================================
+ Hits 25449 31306 +5857
- Misses 11939 21060 +9121
- Partials 1975 2457 +482
Flags with carried forward coverage won't be shown. Click here to find out more.
|
antoninbas
reviewed
Apr 19, 2022
Contributor
antoninbas
left a comment
There was a problem hiding this comment.
Thanks for the helpful PR description.
A couple of typos there, e.g. "We fix it by hold the lock until finish 1&2." instead of "We fix it by holding the lock until we finish 1&2."
44d2030 to
fd726be
Compare
Contributor
Author
|
/test-all |
6737581 to
c4e0568
Compare
Contributor
Author
|
Squash the commits and rewrite the commit message... /test-all |
c4e0568 to
31b8a81
Compare
Contributor
Author
|
/test-all |
antoninbas
previously approved these changes
Apr 19, 2022
Contributor
Author
|
/test-networkpolicy |
Contributor
Author
|
/test-e2e |
annakhm
reviewed
Apr 20, 2022
Conntrack connection store's polling go routine and flow exporter both access to conntrack connection store, and there's a race condition error. In the polling go routine, `deleteIfStaleOrResetConn` and `AddOrUpdateConn` both grab the lock, modify `conn.IsPresent` field, and release the lock. Between the execution of these two functions, it is likely that FlowExporter's timer is triggered and it reads the wrong `conn.IsPresent` value in an intermidiate state. We fix it by holding the lock until we finish the execution of both two functions. Fixes: antrea-io#3650 Signed-off-by: heanlan <hanlan@vmware.com>
31b8a81 to
323ba53
Compare
Contributor
Author
|
/test-all |
Contributor
Author
|
/test-e2e |
Contributor
Author
|
/test-e2e |
annakhm
approved these changes
Apr 25, 2022
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Conntrack connection store's polling go routine and flow exporter both access to conntrack connection store, and there's a race condition error.
In
func (cs *ConntrackConnectionStore) Poll(), two things happen in sequence:deleteIfStaleOrResetConn, we acquire the lock, resetconn.IsPresent = falsefor all the connections in connection map, and then release the lock (conn.IsPresentis used to describe whether the connection exist in conntrack table or not)AddOrUpdateConn, we acquire the lock, setconn.IsPresent = true, then release the lockIt is likely to happen, when flow exporter's timer is triggered between 1 and 2, it will grab the lock, and read a connection with
IsPresentset to false. In the corresponding exported flow record,flowEndReasonwill be set to 3, representing the flow has ended. Here's an antrea-agent log to verify the existence of this error: logWe fix it by holding the lock until we finish 1&2.
The observation comes from the error log of flowaggregator e2e test. In the record,
flowEndReasonwas set to 3, so the test treated the record as the last record. Pointer to test code. It computed the throughput value by totalByteCount/iperfTimeSec, without reading from the throughput field in the record, which has the correct value.Fixes: #3650
Signed-off-by: heanlan hanlan@vmware.com