Skip to content

Commit 3a8a670

Browse files
committed
Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking changes from Jakub Kicinski: "WiFi 7 and sendpage changes are the biggest pieces of work for this release. The latter will definitely require fixes but I think that we got it to a reasonable point. Core: - Rework the sendpage & splice implementations Instead of feeding data into sockets page by page extend sendmsg handlers to support taking a reference on the data, controlled by a new flag called MSG_SPLICE_PAGES Rework the handling of unexpected-end-of-file to invoke an additional callback instead of trying to predict what the right combination of MORE/NOTLAST flags is Remove the MSG_SENDPAGE_NOTLAST flag completely - Implement SCM_PIDFD, a new type of CMSG type analogous to SCM_CREDENTIALS, but it contains pidfd instead of plain pid - Enable socket busy polling with CONFIG_RT - Improve reliability and efficiency of reporting for ref_tracker - Auto-generate a user space C library for various Netlink families Protocols: - Allow TCP to shrink the advertised window when necessary, prevent sk_rcvbuf auto-tuning from growing the window all the way up to tcp_rmem[2] - Use per-VMA locking for "page-flipping" TCP receive zerocopy - Prepare TCP for device-to-device data transfers, by making sure that payloads are always attached to skbs as page frags - Make the backoff time for the first N TCP SYN retransmissions linear. Exponential backoff is unnecessarily conservative - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO) - Avoid waking up applications using TLS sockets until we have a full record - Allow using kernel memory for protocol ioctl callbacks, paving the way to issuing ioctls over io_uring - Add nolocalbypass option to VxLAN, forcing packets to be fully encapsulated even if they are destined for a local IP address - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure in-kernel ECMP implementation (e.g. Open vSwitch) select the same link for all packets. Support L4 symmetric hashing in Open vSwitch - PPPoE: make number of hash bits configurable - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client (ipconfig) - Add layer 2 miss indication and filtering, allowing higher layers (e.g. ACL filters) to make forwarding decisions based on whether packet matched forwarding state in lower devices (bridge) - Support matching on Connectivity Fault Management (CFM) packets - Hide the "link becomes ready" IPv6 messages by demoting their printk level to debug - HSR: don't enable promiscuous mode if device offloads the proto - Support active scanning in IEEE 802.15.4 - Continue work on Multi-Link Operation for WiFi 7 BPF: - Add precision propagation for subprogs and callbacks. This allows maintaining verification efficiency when subprograms are used, or in fact passing the verifier at all for complex programs, especially those using open-coded iterators - Improve BPF's {g,s}setsockopt() length handling. Previously BPF assumed the length is always equal to the amount of written data. But some protos allow passing a NULL buffer to discover what the output buffer *should* be, without writing anything - Accept dynptr memory as memory arguments passed to helpers - Add routing table ID to bpf_fib_lookup BPF helper - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark maps as read-only) - Show target_{obj,btf}_id in tracing link fdinfo - Addition of several new kfuncs (most of the names are self-explanatory): - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(), bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size() and bpf_dynptr_clone(). - bpf_task_under_cgroup() - bpf_sock_destroy() - force closing sockets - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs Netfilter: - Relax set/map validation checks in nf_tables. Allow checking presence of an entry in a map without using the value - Increase ip_vs_conn_tab_bits range for 64BIT builds - Allow updating size of a set - Improve NAT tuple selection when connection is closing Driver API: - Integrate netdev with LED subsystem, to allow configuring HW "offloaded" blinking of LEDs based on link state and activity (i.e. packets coming in and out) - Support configuring rate selection pins of SFP modules - Factor Clause 73 auto-negotiation code out of the drivers, provide common helper routines - Add more fool-proof helpers for managing lifetime of MDIO devices associated with the PCS layer - Allow drivers to report advanced statistics related to Time Aware scheduler offload (taprio) - Allow opting out of VF statistics in link dump, to allow more VFs to fit into the message - Split devlink instance and devlink port operations New hardware / drivers: - Ethernet: - Synopsys EMAC4 IP support (stmmac) - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches - Marvell 88E6250 7 port switches - Microchip LAN8650/1 Rev.B0 PHYs - MediaTek MT7981/MT7988 built-in 1GE PHY driver - WiFi: - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps - Realtek RTL8723DS (SDIO variant) - Realtek RTL8851BE - CAN: - Fintek F81604 Drivers: - Ethernet NICs: - Intel (100G, ice): - support dynamic interrupt allocation - use meta data match instead of VF MAC addr on slow-path - nVidia/Mellanox: - extend link aggregation to handle 4, rather than just 2 ports - spawn sub-functions without any features by default - OcteonTX2: - support HTB (Tx scheduling/QoS) offload - make RSS hash generation configurable - support selecting Rx queue using TC filters - Wangxun (ngbe/txgbe): - add basic Tx/Rx packet offloads - add phylink support (SFP/PCS control) - Freescale/NXP (enetc): - report TAPRIO packet statistics - Solarflare/AMD: - support matching on IP ToS and UDP source port of outer header - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6 - add devlink dev info support for EF10 - Virtual NICs: - Microsoft vNIC: - size the Rx indirection table based on requested configuration - support VLAN tagging - Amazon vNIC: - try to reuse Rx buffers if not fully consumed, useful for ARM servers running with 16kB pages - Google vNIC: - support TCP segmentation of >64kB frames - Ethernet embedded switches: - Marvell (mv88e6xxx): - enable USXGMII (88E6191X) - Microchip: - lan966x: add support for Egress Stage 0 ACL engine - lan966x: support mapping packet priority to internal switch priority (based on PCP or DSCP) - Ethernet PHYs: - Broadcom PHYs: - support for Wake-on-LAN for BCM54210E/B50212E - report LPI counter - Microsemi PHYs: support RGMII delay configuration (VSC85xx) - Micrel PHYs: receive timestamp in the frame (LAN8841) - Realtek PHYs: support optional external PHY clock - Altera TSE PCS: merge the driver into Lynx PCS which it is a variant of - CAN: Kvaser PCIEcan: - support packet timestamping - WiFi: - Intel (iwlwifi): - major update for new firmware and Multi-Link Operation (MLO) - configuration rework to drop test devices and split the different families - support for segmented PNVM images and power tables - new vendor entries for PPAG (platform antenna gain) feature - Qualcomm 802.11ax (ath11k): - Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID Advertisement (EMA) support in AP mode - support factory test mode - RealTek (rtw89): - add RSSI based antenna diversity - support U-NII-4 channels on 5 GHz band - RealTek (rtl8xxxu): - AP mode support for 8188f - support USB RX aggregation for the newer chips" * tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits) net: scm: introduce and use scm_recv_unix helper af_unix: Skip SCM_PIDFD if scm->pid is NULL. net: lan743x: Simplify comparison netlink: Add __sock_i_ino() for __netlink_diag_dump(). net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses Revert "af_unix: Call scm_recv() only after scm_set_cred()." phylink: ReST-ify the phylink_pcs_neg_mode() kdoc libceph: Partially revert changes to support MSG_SPLICE_PAGES net: phy: mscc: fix packet loss due to RGMII delays net: mana: use vmalloc_array and vcalloc net: enetc: use vmalloc_array and vcalloc ionic: use vmalloc_array and vcalloc pds_core: use vmalloc_array and vcalloc gve: use vmalloc_array and vcalloc octeon_ep: use vmalloc_array and vcalloc net: usb: qmi_wwan: add u-blox 0x1312 composition perf trace: fix MSG_SPLICE_PAGES build error ipvlan: Fix return value of ipvlan_queue_xmit() netfilter: nf_tables: fix underflow in chain reference counter netfilter: nf_tables: unbind non-anonymous set if rule construction fails ...
2 parents 6a8cbd9 + ae23064 commit 3a8a670

File tree

1,491 files changed

+98688
-25412
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,491 files changed

+98688
-25412
lines changed

Documentation/ABI/testing/sysfs-class-led-trigger-netdev

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@ Description:
1313
Specifies the duration of the LED blink in milliseconds.
1414
Defaults to 50 ms.
1515

16+
With hw_control ON, the interval value MUST be set to the
17+
default value and cannot be changed.
18+
Trying to set any value in this specific mode will return
19+
an EINVAL error.
20+
1621
What: /sys/class/leds/<led>/link
1722
Date: Dec 2017
1823
KernelVersion: 4.16
@@ -39,6 +44,9 @@ Description:
3944
If set to 1, the LED will blink for the milliseconds specified
4045
in interval to signal transmission.
4146

47+
With hw_control ON, the blink interval is controlled by hardware
48+
and won't reflect the value set in interval.
49+
4250
What: /sys/class/leds/<led>/rx
4351
Date: Dec 2017
4452
KernelVersion: 4.16
@@ -50,3 +58,84 @@ Description:
5058

5159
If set to 1, the LED will blink for the milliseconds specified
5260
in interval to signal reception.
61+
62+
With hw_control ON, the blink interval is controlled by hardware
63+
and won't reflect the value set in interval.
64+
65+
What: /sys/class/leds/<led>/hw_control
66+
Date: Jun 2023
67+
KernelVersion: 6.5
68+
69+
Description:
70+
Communicate whether the LED trigger modes are driven by hardware
71+
or software fallback is used.
72+
73+
If 0, the LED is using software fallback to blink.
74+
75+
If 1, the LED is using hardware control to blink and signal the
76+
requested modes.
77+
78+
What: /sys/class/leds/<led>/link_10
79+
Date: Jun 2023
80+
KernelVersion: 6.5
81+
82+
Description:
83+
Signal the link speed state of 10Mbps of the named network device.
84+
85+
If set to 0 (default), the LED's normal state is off.
86+
87+
If set to 1, the LED's normal state reflects the link state
88+
speed of 10MBps of the named network device.
89+
Setting this value also immediately changes the LED state.
90+
91+
What: /sys/class/leds/<led>/link_100
92+
Date: Jun 2023
93+
KernelVersion: 6.5
94+
95+
Description:
96+
Signal the link speed state of 100Mbps of the named network device.
97+
98+
If set to 0 (default), the LED's normal state is off.
99+
100+
If set to 1, the LED's normal state reflects the link state
101+
speed of 100Mbps of the named network device.
102+
Setting this value also immediately changes the LED state.
103+
104+
What: /sys/class/leds/<led>/link_1000
105+
Date: Jun 2023
106+
KernelVersion: 6.5
107+
108+
Description:
109+
Signal the link speed state of 1000Mbps of the named network device.
110+
111+
If set to 0 (default), the LED's normal state is off.
112+
113+
If set to 1, the LED's normal state reflects the link state
114+
speed of 1000Mbps of the named network device.
115+
Setting this value also immediately changes the LED state.
116+
117+
What: /sys/class/leds/<led>/half_duplex
118+
Date: Jun 2023
119+
KernelVersion: 6.5
120+
121+
Description:
122+
Signal the link half duplex state of the named network device.
123+
124+
If set to 0 (default), the LED's normal state is off.
125+
126+
If set to 1, the LED's normal state reflects the link half
127+
duplex state of the named network device.
128+
Setting this value also immediately changes the LED state.
129+
130+
What: /sys/class/leds/<led>/full_duplex
131+
Date: Jun 2023
132+
KernelVersion: 6.5
133+
134+
Description:
135+
Signal the link full duplex state of the named network device.
136+
137+
If set to 0 (default), the LED's normal state is off.
138+
139+
If set to 1, the LED's normal state reflects the link full
140+
duplex state of the named network device.
141+
Setting this value also immediately changes the LED state.

Documentation/admin-guide/sysctl/net.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -386,8 +386,8 @@ Default : 0 (for compatibility reasons)
386386
txrehash
387387
--------
388388

389-
Controls default hash rethink behaviour on listening socket when SO_TXREHASH
390-
option is set to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt).
389+
Controls default hash rethink behaviour on socket when SO_TXREHASH option is set
390+
to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt).
391391

392392
If set to 1 (default), hash rethink is performed on listening socket.
393393
If set to 0, hash rethink is not performed.

Documentation/bpf/bpf_iterators.rst

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -238,11 +238,8 @@ The following is the breakdown for each field in struct ``bpf_iter_reg``.
238238
that the kernel function cond_resched() is called to avoid other kernel
239239
subsystem (e.g., rcu) misbehaving.
240240
* - seq_info
241-
- Specifies certain action requests in the kernel BPF iterator
242-
infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
243-
that the kernel function cond_resched() is called to avoid other kernel
244-
subsystem (e.g., rcu) misbehaving.
245-
241+
- Specifies the set of seq operations for the BPF iterator and helpers to
242+
initialize/free the private data for the corresponding ``seq_file``.
246243

247244
`Click here
248245
<https://lore.kernel.org/bpf/[email protected]/>`_

Documentation/bpf/cpumasks.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -351,14 +351,15 @@ In addition to the above kfuncs, there is also a set of read-only kfuncs that
351351
can be used to query the contents of cpumasks.
352352

353353
.. kernel-doc:: kernel/bpf/cpumask.c
354-
:identifiers: bpf_cpumask_first bpf_cpumask_first_zero bpf_cpumask_test_cpu
354+
:identifiers: bpf_cpumask_first bpf_cpumask_first_zero bpf_cpumask_first_and
355+
bpf_cpumask_test_cpu
355356

356357
.. kernel-doc:: kernel/bpf/cpumask.c
357358
:identifiers: bpf_cpumask_equal bpf_cpumask_intersects bpf_cpumask_subset
358359
bpf_cpumask_empty bpf_cpumask_full
359360

360361
.. kernel-doc:: kernel/bpf/cpumask.c
361-
:identifiers: bpf_cpumask_any bpf_cpumask_any_and
362+
:identifiers: bpf_cpumask_any_distribute bpf_cpumask_any_and_distribute
362363

363364
----
364365

Documentation/bpf/instruction-set.rst

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -163,13 +163,13 @@ BPF_MUL 0x20 dst \*= src
163163
BPF_DIV 0x30 dst = (src != 0) ? (dst / src) : 0
164164
BPF_OR 0x40 dst \|= src
165165
BPF_AND 0x50 dst &= src
166-
BPF_LSH 0x60 dst <<= src
167-
BPF_RSH 0x70 dst >>= src
166+
BPF_LSH 0x60 dst <<= (src & mask)
167+
BPF_RSH 0x70 dst >>= (src & mask)
168168
BPF_NEG 0x80 dst = ~src
169169
BPF_MOD 0x90 dst = (src != 0) ? (dst % src) : dst
170170
BPF_XOR 0xa0 dst ^= src
171171
BPF_MOV 0xb0 dst = src
172-
BPF_ARSH 0xc0 sign extending shift right
172+
BPF_ARSH 0xc0 sign extending dst >>= (src & mask)
173173
BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below)
174174
======== ===== ==========================================================
175175

@@ -204,6 +204,9 @@ for ``BPF_ALU64``, 'imm' is first sign extended to 64 bits and the result
204204
interpreted as an unsigned 64-bit value. There are no instructions for
205205
signed division or modulo.
206206

207+
Shift operations use a mask of 0x3F (63) for 64-bit operations and 0x1F (31)
208+
for 32-bit operations.
209+
207210
Byte swap instructions
208211
~~~~~~~~~~~~~~~~~~~~~~
209212

Documentation/bpf/kfuncs.rst

Lines changed: 54 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ Hence, whenever a constant scalar argument is accepted by a kfunc which is not a
100100
size parameter, and the value of the constant matters for program safety, __k
101101
suffix should be used.
102102

103-
2.2.2 __uninit Annotation
103+
2.2.3 __uninit Annotation
104104
-------------------------
105105

106106
This annotation is used to indicate that the argument will be treated as
@@ -117,6 +117,27 @@ Here, the dynptr will be treated as an uninitialized dynptr. Without this
117117
annotation, the verifier will reject the program if the dynptr passed in is
118118
not initialized.
119119

120+
2.2.4 __opt Annotation
121+
-------------------------
122+
123+
This annotation is used to indicate that the buffer associated with an __sz or __szk
124+
argument may be null. If the function is passed a nullptr in place of the buffer,
125+
the verifier will not check that length is appropriate for the buffer. The kfunc is
126+
responsible for checking if this buffer is null before using it.
127+
128+
An example is given below::
129+
130+
__bpf_kfunc void *bpf_dynptr_slice(..., void *buffer__opt, u32 buffer__szk)
131+
{
132+
...
133+
}
134+
135+
Here, the buffer may be null. If buffer is not null, it at least of size buffer_szk.
136+
Either way, the returned buffer is either NULL, or of size buffer_szk. Without this
137+
annotation, the verifier will reject the program if a null pointer is passed in with
138+
a nonzero size.
139+
140+
120141
.. _BPF_kfunc_nodef:
121142

122143
2.3 Using an existing kernel function
@@ -206,23 +227,49 @@ absolutely no ABI stability guarantees.
206227

207228
As mentioned above, a nested pointer obtained from walking a trusted pointer is
208229
no longer trusted, with one exception. If a struct type has a field that is
209-
guaranteed to be valid as long as its parent pointer is trusted, the
210-
``BTF_TYPE_SAFE_NESTED`` macro can be used to express that to the verifier as
211-
follows:
230+
guaranteed to be valid (trusted or rcu, as in KF_RCU description below) as long
231+
as its parent pointer is valid, the following macros can be used to express
232+
that to the verifier:
233+
234+
* ``BTF_TYPE_SAFE_TRUSTED``
235+
* ``BTF_TYPE_SAFE_RCU``
236+
* ``BTF_TYPE_SAFE_RCU_OR_NULL``
237+
238+
For example,
239+
240+
.. code-block:: c
241+
242+
BTF_TYPE_SAFE_TRUSTED(struct socket) {
243+
struct sock *sk;
244+
};
245+
246+
or
212247

213248
.. code-block:: c
214249
215-
BTF_TYPE_SAFE_NESTED(struct task_struct) {
250+
BTF_TYPE_SAFE_RCU(struct task_struct) {
216251
const cpumask_t *cpus_ptr;
252+
struct css_set __rcu *cgroups;
253+
struct task_struct __rcu *real_parent;
254+
struct task_struct *group_leader;
217255
};
218256
219257
In other words, you must:
220258

221-
1. Wrap the trusted pointer type in the ``BTF_TYPE_SAFE_NESTED`` macro.
259+
1. Wrap the valid pointer type in a ``BTF_TYPE_SAFE_*`` macro.
222260

223-
2. Specify the type and name of the trusted nested field. This field must match
261+
2. Specify the type and name of the valid nested field. This field must match
224262
the field in the original type definition exactly.
225263

264+
A new type declared by a ``BTF_TYPE_SAFE_*`` macro also needs to be emitted so
265+
that it appears in BTF. For example, ``BTF_TYPE_SAFE_TRUSTED(struct socket)``
266+
is emitted in the ``type_is_trusted()`` function as follows:
267+
268+
.. code-block:: c
269+
270+
BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct socket));
271+
272+
226273
2.4.5 KF_SLEEPABLE flag
227274
-----------------------
228275

Documentation/bpf/llvm_reloc.rst

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ the code with ``llvm-objdump -dr test.o``::
4848
14: 0f 10 00 00 00 00 00 00 r0 += r1
4949
15: 95 00 00 00 00 00 00 00 exit
5050

51-
There are four relations in the above for four ``LD_imm64`` instructions.
51+
There are four relocations in the above for four ``LD_imm64`` instructions.
5252
The following ``llvm-readelf -r test.o`` shows the binary values of the four
5353
relocations::
5454

@@ -79,14 +79,16 @@ The following is the symbol table with ``llvm-readelf -s test.o``::
7979
The 6th entry is global variable ``g1`` with value 0.
8080

8181
Similarly, the second relocation is at ``.text`` offset ``0x18``, instruction 3,
82-
for global variable ``g2`` which has a symbol value 4, the offset
83-
from the start of ``.data`` section.
84-
85-
The third and fourth relocations refers to static variables ``l1``
86-
and ``l2``. From ``.rel.text`` section above, it is not clear
87-
which symbols they really refers to as they both refers to
82+
has a type of ``R_BPF_64_64`` and refers to entry 7 in the symbol table.
83+
The second relocation resolves to global variable ``g2`` which has a symbol
84+
value 4. The symbol value represents the offset from the start of ``.data``
85+
section where the initial value of the global variable ``g2`` is stored.
86+
87+
The third and fourth relocations refer to static variables ``l1``
88+
and ``l2``. From the ``.rel.text`` section above, it is not clear
89+
to which symbols they really refer as they both refer to
8890
symbol table entry 4, symbol ``sec``, which has ``STT_SECTION`` type
89-
and represents a section. So for static variable or function,
91+
and represents a section. So for a static variable or function,
9092
the section offset is written to the original insn
9193
buffer, which is called ``A`` (addend). Looking at
9294
above insn ``7`` and ``11``, they have section offset ``8`` and ``12``.

Documentation/bpf/map_hash.rst

Lines changed: 52 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
.. SPDX-License-Identifier: GPL-2.0-only
22
.. Copyright (C) 2022 Red Hat, Inc.
3+
.. Copyright (C) 2022-2023 Isovalent, Inc.
34
45
===============================================
56
BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
@@ -29,7 +30,16 @@ will automatically evict the least recently used entries when the hash
2930
table reaches capacity. An LRU hash maintains an internal LRU list that
3031
is used to select elements for eviction. This internal LRU list is
3132
shared across CPUs but it is possible to request a per CPU LRU list with
32-
the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``.
33+
the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``. The
34+
following table outlines the properties of LRU maps depending on the a
35+
map type and the flags used to create the map.
36+
37+
======================== ========================= ================================
38+
Flag ``BPF_MAP_TYPE_LRU_HASH`` ``BPF_MAP_TYPE_LRU_PERCPU_HASH``
39+
======================== ========================= ================================
40+
**BPF_F_NO_COMMON_LRU** Per-CPU LRU, global map Per-CPU LRU, per-cpu map
41+
**!BPF_F_NO_COMMON_LRU** Global LRU, global map Global LRU, per-cpu map
42+
======================== ========================= ================================
3343

3444
Usage
3545
=====
@@ -206,3 +216,44 @@ Userspace walking the map elements from the map declared above:
206216
cur_key = &next_key;
207217
}
208218
}
219+
220+
Internals
221+
=========
222+
223+
This section of the document is targeted at Linux developers and describes
224+
aspects of the map implementations that are not considered stable ABI. The
225+
following details are subject to change in future versions of the kernel.
226+
227+
``BPF_MAP_TYPE_LRU_HASH`` and variants
228+
--------------------------------------
229+
230+
Updating elements in LRU maps may trigger eviction behaviour when the capacity
231+
of the map is reached. There are various steps that the update algorithm
232+
attempts in order to enforce the LRU property which have increasing impacts on
233+
other CPUs involved in the following operation attempts:
234+
235+
- Attempt to use CPU-local state to batch operations
236+
- Attempt to fetch free nodes from global lists
237+
- Attempt to pull any node from a global list and remove it from the hashmap
238+
- Attempt to pull any node from any CPU's list and remove it from the hashmap
239+
240+
This algorithm is described visually in the following diagram. See the
241+
description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of
242+
the corresponding operations:
243+
244+
.. kernel-figure:: map_lru_hash_update.dot
245+
:alt: Diagram outlining the LRU eviction steps taken during map update.
246+
247+
LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and
248+
variants. See the dot file source for kernel function name code references.
249+
250+
Map updates start from the oval in the top right "begin ``bpf_map_update()``"
251+
and progress through the graph towards the bottom where the result may be
252+
either a successful update or a failure with various error codes. The key in
253+
the top right provides indicators for which locks may be involved in specific
254+
operations. This is intended as a visual hint for reasoning about how map
255+
contention may impact update operations, though the map type and flags may
256+
impact the actual contention on those locks, based on the logic described in
257+
the table above. For instance, if the map is created with type
258+
``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and flags ``BPF_F_NO_COMMON_LRU`` then all map
259+
properties would be per-cpu.

0 commit comments

Comments
 (0)