Skip to content

recalculate network policies for established connections#253

Merged
aojea merged 4 commits into
kubernetes-sigs:mainfrom
aojea:reevaluate_connections
Oct 8, 2025
Merged

recalculate network policies for established connections#253
aojea merged 4 commits into
kubernetes-sigs:mainfrom
aojea:reevaluate_connections

Conversation

@aojea

@aojea aojea commented Sep 24, 2025

Copy link
Copy Markdown
Contributor

Fixes: #246

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 24, 2025
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 24, 2025
@aojea aojea force-pushed the reevaluate_connections branch from 71ae57a to 4f32f27 Compare September 25, 2025 10:07
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 25, 2025
@aojea aojea force-pushed the reevaluate_connections branch 2 times, most recently from e39599d to dda56ff Compare September 25, 2025 10:17
@aojea aojea changed the title [WIP] recalculate network policies for established connections recalculate network policies for established connections Sep 25, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 25, 2025
@aojea aojea force-pushed the reevaluate_connections branch 2 times, most recently from f1f3f84 to 6e69983 Compare September 25, 2025 11:01

@danwinship danwinship left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was imagining something more like: "when policies/namespaces/pods change, figure out what IPs are no longer allowed, and then drop conntrack entries for those IPs". But I guess k-n-p doesn't work like that at all so that wouldn't make sense.

I don't remember all the details of how k-n-p currently works, but one thing to note is that some events that require updating NetworkPolicy enforcement are guaranteed not to require deleting conntrack entries. In particular, adding or deleting a Pod or Namespace does not require deleting conntrack entries.

Comment thread pkg/dataplane/controller.go Outdated
Comment thread pkg/dataplane/controller.go Outdated
Comment thread pkg/dataplane/controller.go Outdated
Comment thread tests/e2e_standard.bats
Comment thread pkg/dataplane/controller.go Outdated
Comment thread pkg/dataplane/controller.go Outdated
Comment thread pkg/dataplane/controller.go Outdated
@aojea aojea changed the title recalculate network policies for established connections [WIP] recalculate network policies for established connections Sep 25, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 25, 2025
@aojea

aojea commented Sep 25, 2025

Copy link
Copy Markdown
Contributor Author

it does not work flushing the conntrack entry, the connection is not sent through the queue, subsequent packets are treated as established

@aojea aojea force-pushed the reevaluate_connections branch from 6e69983 to dbb232f Compare September 27, 2025 09:32
Comment thread pkg/dataplane/controller_test.go Outdated
Comment on lines +285 to +289
chain forward-established {
type filter hook forward priority filter - 5; policy accept;
ct state established ip saddr . th sport . ip daddr . th dport . meta l4proto @block-tuple-v4 reject with icmp port-unreachable
ct state established ip6 saddr . th sport . ip6 daddr . th dport . meta l4proto @block-tuple-v6 reject with icmpv6 addr-unreachable
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to do something like this , is the only way I find out we can "break" existing connections https://lore.kernel.org/netfilter-devel/20250313231341.3040002-1-aojea@google.com/T/#m44c410cd56bdc62cc717798a0e28ee4dcaaab9ad

@aojea aojea force-pushed the reevaluate_connections branch 2 times, most recently from 135c13a to 54eeaef Compare September 27, 2025 10:16
@aojea

aojea commented Sep 28, 2025

Copy link
Copy Markdown
Contributor Author

Ok, I have a better understanding now of the problem, assume TCP and UDP only by now

For UDP just clean up the conntrack entry, next one will be sent to userspace with the nfqueu
For TCP, things get complicated here as there area a lot of ifs and deleting the conntrack entry directly does not work

  • if nf_conntrack_tcp_loose is enabled (default), conntrack picks up already established connections, so current logic to only evaluate the first packet we'll not work if we just flush the conntrack entry because it will be considered ESTABLISHED. Conntrack stores the tcp sequence numbers and windows to understand if a packet belongs to the connection
  • Good old TCP RST attack as in https://man.archlinux.org/man/extra/dsniff/tcpkill.8.en, it will require for us to have the sequence number and window to forge a packet, storing all the information is a no go, but we can try to get it from the conntrack , no idea if is already exposed or there is some ebpf magic that can return it, but injecting packets seems very intrusive.
  • Add nftables reject rules per tuple, but this does not look like scales well and also does not handle connections that are idle, so we do have to monitor the conntrack entry to understand for how long we should leave the nftables rule to reject that connection
  • sock_destroy , https://lwn.net/Articles/666220/ , this seems to be exactly what we are looking for, ss -K already implements it., the challenge will be to find the socket, and it is also exposed via netlink,

https://lwn.net/Articles/1037388/
https://lpc.events/event/16/contributions/1358/attachments/1015/2101/socket-termination-lpc2022.pdf
https://docs.ebpf.io/linux/kfuncs/bpf_sock_destroy/

It seems cilium folks went through this already and have pretty good description of the problem and are adding the necessary tooling to the stack cilium/cilium#25169

@danwinship

danwinship commented Sep 28, 2025

Copy link
Copy Markdown
Contributor

There's not always a socket though (eg, NodePort connection proxied to a pod on another node).

EDIT: oh, wait, that doesn't matter for NetworkPolicy purposes; you can postpone evaluation until the destination node

@aojea

aojea commented Oct 1, 2025

Copy link
Copy Markdown
Contributor Author

EDIT: oh, wait, that doesn't matter for NetworkPolicy purposes; you can postpone evaluation until the destination node

yeah, since we want to break existing connections, and it seems we need to be "invasive" we need to make local decisions only ... current code can optimize making decisions for the flow in both directions in the originating node ...

By the way, talked with one team mate working on the ebpf socket destroy problem internally, and he told me that there is no better solution than iterating through namespaces to find the sockets ... I need to check if we can get the information from the conntrack entry or we can store the relation Pod/IP to network namespace and use it, then the good news is that we can just use netlink, is what ss -K uses

@aojea aojea force-pushed the reevaluate_connections branch 3 times, most recently from 5c73879 to 49fcf83 Compare October 4, 2025 18:28
@aojea aojea force-pushed the reevaluate_connections branch 2 times, most recently from ae73d94 to 487a91a Compare October 4, 2025 23:22
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 4, 2025
@aojea aojea force-pushed the reevaluate_connections branch from 487a91a to 1eccd32 Compare October 5, 2025 20:18
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 5, 2025
@aojea aojea force-pushed the reevaluate_connections branch 9 times, most recently from e6dc0bf to 125a9d0 Compare October 6, 2025 09:03
aojea added 2 commits October 6, 2025 10:13
is it possible for a network policy or namespace or pod labels to change
during the lifetime of this objects.

Since these changes can impact the policy over the connectons already
established we need to reevaluate them, in order to do that, we just
check the existing conntrack entries and we clean them to force the
subsequent packets to go over the nfqueue process for non TCP
connections.

For TCP connections we try to find the socket and kill it.
@aojea aojea force-pushed the reevaluate_connections branch from 125a9d0 to 1ddc43a Compare October 6, 2025 10:13
@aojea aojea changed the title [WIP] recalculate network policies for established connections recalculate network policies for established connections Oct 6, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 6, 2025
@aojea

aojea commented Oct 6, 2025

Copy link
Copy Markdown
Contributor Author

@danwinship it took me more than I expected, found also some bugs on the libraries, but now is working as expected, as we can see in the logs of the e2e test added

2025-10-06T10:28:49.002179755Z stderr F I1006 10:28:49.002083       1 networkpolicy.go:302] "Pod has limited all ingress traffic" pod="dev/client" policy="prod/default-deny-ingress"
2025-10-06T10:28:49.002191927Z stderr F I1006 10:28:49.002116       1 engine.go:67] "Packet denied by ingress policy"
2025-10-06T10:28:49.002197247Z stderr F I1006 10:28:49.002129       1 controller.go:407] Connection no longer allowed by network policiespacket[0] 10.244.1.14:47600 10.244.1.13:8080 TCP
2025-10-06T10:28:49.002216743Z stderr F I1006 10:28:49.002139       1 logging.go:67] "Evaluating packet" direction="Egress" srcPod="prod/webserver" dstPod="dev/client" packet=<
2025-10-06T10:28:49.002219518Z stderr F 	[0] 10.244.1.13:8080 10.244.1.14:47600 TCP
2025-10-06T10:28:49.002221853Z stderr F  >
2025-10-06T10:28:49.002236761Z stderr F I1006 10:28:49.002153       1 logging.go:67] "Evaluating packet" direction="Ingress" srcPod="prod/webserver" dstPod="dev/client" packet=<
2025-10-06T10:28:49.002240167Z stderr F 	[0] 10.244.1.13:8080 10.244.1.14:47600 TCP
2025-10-06T10:28:49.002242962Z stderr F  >
2025-10-06T10:28:49.002245156Z stderr F I1006 10:28:49.002170       1 networkpolicy.go:190] "Ingress NetworkPolicies does not apply"
2025-10-06T10:28:49.002247581Z stderr F I1006 10:28:49.002175       1 engine.go:71] "Packet accepted by policy"
2025-10-06T10:28:49.002253492Z stderr F I1006 10:28:49.002184       1 controller.go:383] Skipping conntrack entry not involving managed IPsflowtcp	6 src=172.18.0.3 dst=172.18.0.2 sport=46466 dport=6443 packets=0 bytes=0	src=172.18.0.2 dst=172.18.0.3 sport=6443 dport=46466 packets=0 bytes=0 mark=0x0 start=1970-01-01 00:00:00 +0000 UTC stop=1970-01-01 00:00:00 +0000 UTC timeout=86397(sec)
2025-10-06T10:28:49.002256277Z stderr F I1006 10:28:49.002204       1 controller.go:443] "Destroying TCP connections" logger="firewall-enforcer" connections=1
2025-10-06T10:28:49.002298074Z stderr F I1006 10:28:49.002218       1 socketterminator.go:62] Finding route for IP: 10.244.1.14
2025-10-06T10:28:49.002407918Z stderr F I1006 10:28:49.002325       1 socketterminator.go:72] Found routes for IP 10.244.1.14: [{Ifindex: 15 Dst: 10.244.1.14/32 Src: 10.244.1.1 Gw: <nil> Flags: [] Table: 254 Realm: 0}]
2025-10-06T10:28:49.002561144Z stderr F I1006 10:28:49.002490       1 socketterminator.go:91] Found veth interface veth598fa711 with NetNsID 2
2025-10-06T10:28:49.002755444Z stderr F I1006 10:28:49.002693       1 socketterminator.go:124] Executing callback in namespace ID 2
2025-10-06T10:28:49.002760464Z stderr F I1006 10:28:49.002705       1 socketterminator.go:130] Attempting to destroy socket for packet [0] 10.244.1.14:47600 10.244.1.13:8080 TCP
2025-10-06T10:28:49.002763569Z stderr F I1006 10:28:49.002712       1 socketterminator.go:139] Attempting to destroy socket: 10.244.1.14:47600 -> 10.244.1.13:8080
2025-10-06T10:28:49.061181812Z stderr F I1006 10:28:49.061004       1 socketterminator.go:149] Successfully destroyed socket: 10.244.1.14:47600 -> 10.244.1.13:8080
2025-10-06T10:28:49.061205656Z stderr F I1006 10:28:49.061055       1 controller.go:342] "Enforcing firewall policies on existing connections" logger="firewall-enforcer" elapsed="68.161332ms"

please let me know if you want to review

@aojea aojea force-pushed the reevaluate_connections branch from 89c1d8d to 94ef330 Compare October 6, 2025 12:59
@aojea aojea force-pushed the reevaluate_connections branch from 94ef330 to 0ca0559 Compare October 6, 2025 13:01
@aojea

aojea commented Oct 8, 2025

Copy link
Copy Markdown
Contributor Author

talked offline, explained current algorithm and implementation and we are good on merging and iterating

@aojea aojea merged commit 46d939d into kubernetes-sigs:main Oct 8, 2025
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

re-evaluate active connections when policies change

3 participants