-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Open
Labels
Description
We have almost BGP-only calico setup (vxlan only to tunnel SVC net, no DSR) where each node peers with ToR-switch.
Suddently some of cluster nodes (8-13-23 nodes at once) start to panic every few days.
Expected Behavior
Avoid kernel panic on the node.
Current Behavior
[ 96.620212] ------------[ cut here ]------------
[ 96.620215] kernel BUG at net/core/skbuff.c:4306!
[ 96.620423] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 96.620650] CPU: 0 PID: 3089 Comm: napi/eth0-8276 Kdump: loaded Not tainted 6.2.0-34-generic #34~22.04.1-Ubuntu
[ 96.621105] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 03/08/2022
[ 96.621494] RIP: 0010:skb_segment+0xc80/0xeb0
[ 96.621695] Code: 00 44 89 6b 70 48 29 d1 89 c8 44 01 e9 89 8b bc 00 00 00 e9 a5 fe ff ff 39 44 24 7c 0f 87 ca fe ff ff 44 89 c2 e9 f5 fe ff ff <0f> 0b 48 8d 42 ff e9 e0 fb ff ff 0f 0b 0f 0b 83 f9 01 74 e4 31 d2
[ 96.622546] RSP: 0018:ffffa9291e433638 EFLAGS: 00010293
[ 96.622782] RAX: 00000000000001f4 RBX: ffff8eb9924fde00 RCX: 0000000000000011
[ 96.623108] RDX: 00000000000013ec RSI: ffff8eb985c3a700 RDI: 00000000000021a8
[ 96.623429] RBP: ffffa9291e433700 R08: 0000000000002176 R09: 0000000000000000
[ 96.623753] R10: 00000000000001c2 R11: 000000000000236a R12: 00000000ffffdecc
[ 96.624077] R13: 00000000000001c2 R14: ffff8eb9924fde00 R15: ffff8eb9924fcf00
[ 96.624404] FS: 0000000000000000(0000) GS:ffff8f35ff600000(0000) knlGS:0000000000000000
[ 96.624768] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 96.625024] CR2: 00007f2fd2b5fbb0 CR3: 000000be9de10006 CR4: 00000000007706f0
[ 96.625346] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 96.625671] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 96.625993] PKRU: 55555554
[ 96.626110] Call Trace:
[ 96.626220] <TASK>
[ 96.626312] ? show_regs+0x72/0x90
[ 96.626461] ? die+0x38/0xb0
[ 96.626592] ? do_trap+0xe3/0x100
[ 96.631513] ------------[ cut here ]------------
[ 96.634794] ? do_error_trap+0x75/0xb0
[ 96.643171] kernel BUG at net/core/skbuff.c:4306!
[ 96.651306] ? skb_segment+0xc80/0xeb0
[ 96.651309] ? exc_invalid_op+0x53/0x80
[ 96.651315] ? skb_segment+0xc80/0xeb0
[ 96.651317] ? asm_exc_invalid_op+0x1b/0x20
[ 96.691600] ? skb_segment+0xc80/0xeb0
[ 96.699308] ? skb_segment+0x7f7/0xeb0
[ 96.706914] tcp_gso_segment+0x104/0x540
[ 96.714264] ? mlx5e_txwqe_complete+0x9f/0x280 [mlx5_core]
[ 96.722277] tcp4_gso_segment+0x5f/0xf0
[ 96.729982] inet_gso_segment+0x168/0x3e0
[ 96.738006] skb_mac_gso_segment+0xa1/0x120
[ 96.746586] __skb_udp_tunnel_segment+0x1ef/0x530
[ 96.755290] skb_udp_tunnel_segment+0x74/0xc0
[ 96.763088] udp4_ufo_fragment+0x17f/0x1e0
[ 96.771060] inet_gso_segment+0x168/0x3e0
[ 96.778805] skb_mac_gso_segment+0xa1/0x120
[ 96.787034] __skb_gso_segment+0xc5/0x190
[ 96.794863] ? netif_skb_features+0x9c/0x2d0
[ 96.802980] validate_xmit_skb+0x177/0x2d0
[ 96.811174] __dev_queue_xmit+0x14e/0x6b0
[ 96.819502] __bpf_redirect+0x10b/0x1c0
[ 96.827631] skb_do_redirect+0x117/0x130
[ 96.835673] sch_handle_ingress.constprop.0+0x225/0x2b0
[ 96.843726] ? packet_rcv+0x54/0x4f0
[ 96.851210] __netif_receive_skb_core.constprop.0+0x60d/0xe20
[ 96.859001] ? __netif_receive_skb_core.constprop.0+0x60d/0xe20
[ 96.866392] __netif_receive_skb_list_core+0xfa/0x250
[ 96.873979] netif_receive_skb_list_internal+0x197/0x2c0
[ 96.881148] napi_gro_complete.constprop.0+0x130/0x180
[ 96.888530] dev_gro_receive+0x1eb/0x390
[ 96.896195] napi_gro_receive+0x70/0x210
[ 96.903456] mlx5e_handle_rx_cqe+0xd1/0x1d0 [mlx5_core]
[ 96.911347] mlx5e_rx_cq_process_basic_cqe_comp+0x27a/0x310 [mlx5_core]
[ 96.919211] mlx5e_poll_rx_cq+0x52/0xd0 [mlx5_core]
[ 96.927025] mlx5e_napi_poll+0xff/0x790 [mlx5_core]
[ 96.935769] ? __pfx_napi_threaded_poll+0x10/0x10
[ 96.943908] __napi_poll+0x30/0x1f0
[ 96.951134] ? __pfx_napi_threaded_poll+0x10/0x10
[ 96.958772] napi_threaded_poll+0x167/0x180
[ 96.965854] kthread+0xeb/0x120
[ 96.973070] ? __pfx_kthread+0x10/0x10
[ 96.980265] ret_from_fork+0x29/0x50
[ 96.987675] </TASK>
[ 96.994743] Modules linked in: ipt_REJECT nf_reject_ipv4 xt_CT ip_set cls_bpf sch_ingress xt_nat xt_tcpudp xt_addrtype nft_chain_nat xt_MASQUERADE nf_nat xt_mark veth wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel xt_conntrack xt_comment tcp_diag inet_diag 8021q garp mrp overlay xt_DSCP xt_multiport nft_compat nf_tables nfnetlink binfmt_misc intel_rapl_msr ipmi_ssif intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd cryptd rapl intel_cstate nls_iso8859_1 drm_shmem_helper drm_kms_helper i2c_algo_bit syscopyarea mei_me sysfillrect sysimgblt hpilo ioatdma mei acpi_ipmi intel_pch_thermal dca ipmi_si ipmi_devintf ipmi_msghandler acpi_tad mac_hid acpi_power_meter sch_fq_codel ip_vs_sh
[ 96.994797] ip_vs_wrr ip_vs_rr ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc drm efi_pstore ip_tables x_tables autofs4 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear mlx5_ib ib_uverbs ib_core raid10 ses enclosure uas usb_storage raid1 mlx5_core crc32_pclmul mlxfw psample smartpqi tls scsi_transport_sas pci_hyperv_intf tg3 ahci xhci_pci lpc_ich libahci xhci_pci_renesas wmi
Possible Solution
Unknown.
Steps to Reproduce (for bugs)
Unknown.
Context
One default felixconfiguration is follows (typical for our clusters):
bpfDataIfacePattern: ^((en|wl|ww|sl|ib)[Popsx].*|(wlan|wwan).*|tunl0$|vxlan.calico$|vxlan-v6.calico$|wireguard.cali$|wg-v6.cali$|egress.calico$|(eth|bond)[0-9]+.[0-9]+$)
bpfEnabled: true
bpfKubeProxyEndpointSlicesEnabled: true
bpfKubeProxyIptablesCleanupEnabled: false
bpfLogLevel: ""
bpfMapSizeConntrack: 6144000
bpfMapSizeNATBackend: 1048576
bpfRedirectToPeer: Disabled
floatingIPs: Disabled
logSeverityScreen: Warning
prometheusGoMetricsEnabled: false
prometheusProcessMetricsEnabled: false
reportingInterval: 0s
usageReportingEnabled: false
vxlanPort: 4790
xdpEnabled: false
Your Environment
- Calico version: 3.29.4
- Calico dataplane: ebpf
- Orchestrator version: kubernetes-1.31.10
- Operating System and version: Ubuntu 22.04 LTS
- Kernel version: 6.2.0-34-generic