[Multicast] Remove stale local members in the group cache by wenyingd · Pull Request #7154 · antrea-io/antrea

wenyingd · 2025-05-08T12:27:15Z

This change is to resolve the issue that the same receiver may fail to receive multicast packets after it rejoins the group with encap mode. The issue happens when the last local member has left the Multicast group, but there exists receivers located on other Nodes in the cluster.

The issue was introduced because the Multicast controller directly adds the group into the worker queue but didn't update status in the cache, which makes the re-join event from the same Pod is ignored.

The fix is to generate IGMP leave message for the stale members even if it is the last member on the current Node. It also calls checkLastMember in function clearStaleGroups when no local members are left, rather than directly adds the group into the worker queue. It also removes the field "lastIGMPReport" in struct GroupMemberStatus.

Fix: #7140

wenyingd · 2025-05-08T12:32:29Z

/test-all

antoninbas

The PR description reads:

The fix is to generate IGMP leave message for the stale members even if it is the last member on the current Node. It also calls checkLastMember in function clearStaleGroups when no local members are left, rather than directly adds the group into the worker queue.

This would imply that we are making 2 changes as part of this PR, but really I only see one (replacing c.queue.Add with a call to c. checkLastMember). I don't see the first change described: "generate IGMP leave message for the stale members even if it is the last member on the current Node". What am I missing?

antoninbas · 2025-05-08T20:29:55Z

@@ -1393,6 +1461,6 @@ func createIGMPJoinMessage(groups []net.IP, version uint8) []util.Message {
 }

 func TestMain(m *testing.M) {


I wonder why we have a TestMain specifically for this package

I can't recall the exact reason, maybe it is to reset igmpMaxResponseTime. Removed the function in the latest change, and introduced a dedicated function to reset the igmpMaxResponseTime instead.

wenyingd · 2025-05-09T02:27:37Z

This would imply that we are making 2 changes as part of this PR, but really I only see one (replacing c.queue.Add with a call to c. checkLastMember). I don't see the first change described: "generate IGMP leave message for the stale members even if it is the last member on the current Node". What am I missing?

I add an additional check that no local members are existing with the if condition: https://github.com/antrea-io/antrea/pull/7154/files#diff-8e302e747c71db4b020ddf40f1debcb088af5aaea45266888f5f153c1be1b67bR184

The issue was actually because the groupStatus.lastIGMPReport is timeout, and there still exists local members in the cache. The old logic ignores the existence of the local members and directly enqueue the group.

wenyingd · 2025-05-09T09:00:14Z

/test-multicast-e2e
/test-all

wenyingd · 2025-05-12T11:16:47Z

/test-multicast-e2e
/test-all

antoninbas · 2025-05-13T22:52:37Z

generate IGMP leave message for the stale members even if it is the last member on the current Node

I think I get it now. Before this change when there was a single member left and timing out we would always hit the first case (diff > c.mcastGroupTimeout), and not send the leave message.

antrea-bot · 2025-05-13T22:52:39Z

Can one of the admins verify this patch?

antoninbas

does the definition of the groupIsStale function need to change as well?

wenyingd · 2025-05-14T01:46:27Z

does the definition of the groupIsStale function need to change as well?

Thanks for catching it, yes, we should update it .

update: Thinking more, I would leave the only condition of "len(status.localmembers) == 0" to decide if a group is stale, as we don't need to wail until multicast group timed out after the last local receiver has left when calling syncGroup. And we could leave more conditions check to function clearStaleGroups and eventHandler which decide what Mutlicast group should be sync. Since the function groupIsStale is called only by syncGroup, I removed it, and directly use the check condition in the caller.

wenyingd · 2025-05-14T10:59:51Z

/test-multicast-e2e
/test-all

wenyingd · 2025-05-15T04:26:55Z

Any other comments in this PR @antoninbas @tnqn ?

antoninbas

LGTM, please backport as needed

tnqn

The fix LGTM, but I wonder if status.lastIGMPReport is still useful.

tnqn · 2025-05-16T15:35:34Z

-		if diff > c.mcastGroupTimeout {
-			// Notify worker to remove the group from groupCache if all its members are not updated before mcastGroupTimeout.
-			c.queue.Add(status.group.String())
+		if diff > c.mcastGroupTimeout && len(status.localMembers) == 0 {


question: is there still meaningful to track when the last IGMP report is received and have the check here, since every member will be checked individually?

Removed the field lastIGMPReport in struct GroupMemberStatus

wenyingd · 2025-05-19T03:56:43Z

/test-multicast-e2e
/test-all

tnqn

LGTM

tnqn · 2025-05-19T05:59:34Z

 			// Create a "leave" event for a local member if it is not updated before mcastGroupTimeout.
 			for member, lastUpdate := range status.localMembers {
-				if now.Sub(lastUpdate) > c.mcastGroupTimeout {
+				containerDiff := now.Sub(lastUpdate)


could it be just named diff now as there is no name conflict? containerDiff makes people wonder what container represents here.

This change is to resolve the issue that the same receiver may fail to receive multicast packets after it rejoins the group with encap mode. The issue happens when the last local member has left the Multicast group, but there exists receivers located on other Nodes in the cluster. The issue was introduced because the Multicast controller directly adds the group into the worker queue but didn't update status in the cache, which makes the re-join event from the same Pod is ignored. The fix is to generate IGMP leave message for the stale members even if it is the last member on the current Node. It also calls checkLastMember in function `clearStaleGroups` when no local members are left, rather than directly adds the group into the worker queue. It also removes the field "lastIGMPReport" in struct GroupMemberStatus. Signed-off-by: Wenying Dong <wenying.dong@broadcom.com>

tnqn

LGTM

tnqn · 2025-05-19T08:20:25Z

/test-multicast-e2e
/test-all

This change is to resolve the issue that the same receiver may fail to receive multicast packets after it rejoins the group with encap mode. The issue happens when the last local member has left the Multicast group, but there exists receivers located on other Nodes in the cluster. The issue was introduced because the Multicast controller directly adds the group into the worker queue but didn't update status in the cache, which makes the re-join event from the same Pod is ignored. The fix is to generate IGMP leave message for the stale members even if it is the last member on the current Node. It also calls checkLastMember in function `clearStaleGroups` when no local members are left, rather than directly adds the group into the worker queue. It also removes the field "lastIGMPReport" in struct GroupMemberStatus. Fixes antrea-io#7140 Signed-off-by: Wenying Dong <wenying.dong@broadcom.com>

This change is to resolve the issue that the same receiver may fail to receive multicast packets after it rejoins the group with encap mode. The issue happens when the last local member has left the Multicast group, but there exists receivers located on other Nodes in the cluster. The issue was introduced because the Multicast controller directly adds the group into the worker queue but didn't update status in the cache, which makes the re-join event from the same Pod is ignored. The fix is to generate IGMP leave message for the stale members even if it is the last member on the current Node. It also calls checkLastMember in function `clearStaleGroups` when no local members are left, rather than directly adds the group into the worker queue. It also removes the field "lastIGMPReport" in struct GroupMemberStatus. Fixes #7140 Signed-off-by: Wenying Dong <wenying.dong@broadcom.com>

wenyingd requested review from antoninbas and tnqn May 8, 2025 12:27

antoninbas reviewed May 8, 2025

View reviewed changes

wenyingd force-pushed the bugfix_stalegroup branch from ce3572e to 8144d73 Compare May 9, 2025 02:22

wenyingd force-pushed the bugfix_stalegroup branch 3 times, most recently from 0101102 to 9f98601 Compare May 9, 2025 08:05

wenyingd force-pushed the bugfix_stalegroup branch from 9f98601 to e9789db Compare May 12, 2025 06:40

wenyingd requested a review from antoninbas May 12, 2025 11:17

antoninbas reviewed May 13, 2025

View reviewed changes

Comment thread pkg/agent/multicast/mcast_controller.go Outdated

Comment thread pkg/agent/multicast/mcast_controller.go Outdated

wenyingd force-pushed the bugfix_stalegroup branch 2 times, most recently from ec788e3 to 0359504 Compare May 14, 2025 03:28

wenyingd requested a review from antoninbas May 14, 2025 03:44

antoninbas previously approved these changes May 15, 2025

View reviewed changes

antoninbas added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels May 15, 2025

antoninbas added this to the Antrea v2.4 release milestone May 15, 2025

tnqn reviewed May 16, 2025

View reviewed changes

wenyingd dismissed antoninbas’s stale review via 7af7928 May 19, 2025 03:19

wenyingd force-pushed the bugfix_stalegroup branch from 0359504 to 7af7928 Compare May 19, 2025 03:19

wenyingd requested a review from tnqn May 19, 2025 03:56

tnqn previously approved these changes May 19, 2025

View reviewed changes

wenyingd dismissed tnqn’s stale review via b8b9277 May 19, 2025 07:59

wenyingd force-pushed the bugfix_stalegroup branch from 7af7928 to b8b9277 Compare May 19, 2025 07:59

wenyingd requested a review from tnqn May 19, 2025 08:00

tnqn approved these changes May 19, 2025

View reviewed changes

antoninbas merged commit fa5ba17 into antrea-io:main May 19, 2025
60 of 63 checks passed

wenyingd mentioned this pull request May 20, 2025

Automated cherry pick of #7154: Remove stale local members in the group cache #7179

Merged

wenyingd mentioned this pull request May 20, 2025

Automated cherry pick of #7154: Remove stale local members in the group cache #7180

Merged

		@@ -1393,6 +1461,6 @@ func createIGMPJoinMessage(groups []net.IP, version uint8) []util.Message {
		}

		func TestMain(m *testing.M) {

Conversation

wenyingd commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenyingd commented May 8, 2025

Uh oh!

antoninbas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antoninbas May 8, 2025

Choose a reason for hiding this comment

Uh oh!

wenyingd May 9, 2025

Choose a reason for hiding this comment

Uh oh!

wenyingd commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenyingd commented May 9, 2025

Uh oh!

wenyingd commented May 12, 2025

Uh oh!

antoninbas commented May 13, 2025

Uh oh!

antrea-bot commented May 13, 2025

Uh oh!

antoninbas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wenyingd commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenyingd commented May 14, 2025

Uh oh!

wenyingd commented May 15, 2025

Uh oh!

antoninbas left a comment

Choose a reason for hiding this comment

Uh oh!

tnqn left a comment

Choose a reason for hiding this comment

Uh oh!

tnqn May 16, 2025

Choose a reason for hiding this comment

Uh oh!

wenyingd May 19, 2025

Choose a reason for hiding this comment

Uh oh!

wenyingd commented May 19, 2025

Uh oh!

tnqn left a comment

Choose a reason for hiding this comment

Uh oh!

tnqn May 19, 2025

Choose a reason for hiding this comment

Uh oh!

wenyingd May 19, 2025

Choose a reason for hiding this comment

Uh oh!

tnqn left a comment

Choose a reason for hiding this comment

Uh oh!

tnqn commented May 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wenyingd commented May 8, 2025 •

edited

Loading

wenyingd commented May 9, 2025 •

edited

Loading

wenyingd commented May 14, 2025 •

edited

Loading