Impact
The OCI runtime specification has a maskedPaths feature that allows for files
or directories to be "masked" by placing a mount on top of them to conceal
their contents. This is primarily intended to protect against privileged users
in non-user-namespaced from being able to write to files or access directories
that would either provide sensitive information about the host to containers or
allow containers to perform destructive or other privileged operations on the
host (examples include /proc/kcore, /proc/timer_list, /proc/acpi, and
/proc/keys).
maskedPaths can be used to either mask a directory or a file -- directories
are masked using a new read-only tmpfs instance that is mounted on top of the
masked path, while files are masked by bind-mounting the container's
/dev/null on top of the masked path.
In all known versions of runc, when using the container's /dev/null to mask
files, runc would not perform sufficient verification that the source of the
bind-mount (i.e., the container's /dev/null) was actually a real /dev/null
inode. While /dev/null is usually created by runc doing container creation,
it is possible for an attacker to create a /dev/null or modify the
/dev/null inode created by runc through race conditions with other containers
sharing mounts (we have also verified this attack is possible to exploit using
a standard Dockerfile with docker buildx build as that also permits
triggering parallel execution of containers with custom shared mounts
configured).
This could lead to two separate issues:
Attack 1: Arbitrary Mount Gadget (leading to Host Information Disclosure, Host Denial of Service, or Container Escape)
By replacing /dev/null with a symlink to an attacker-controlled path, an
attacker could cause runc to bind-mount an arbitrary source path to a path
inside the container. This could lead to:
- Host Denial of Service: By bind-mounting files such as
/proc/sysrq-trigger,
the attacker can gain access to a read-write version of files which can be
destructive to write to (/proc/sysrq-trigger would allow an attacker to
trigger a kernel panic, shutting down the machine, or causing the machine to
freeze without rebooting).
- Container Escape: By bind-mounting
/proc/sys/kernel/core_pattern, the
attacker can reconfigure a coredump helper -- as kernel upcalls are not
namespaced, the configured binary (which could be a container binary or a
host binary with a malicious command-line) will run with full privileges on
the host system. Thus, the attacker can simply trigger a coredump and gain
complete root privileges over the host.
Note that while config.json allows users to bind-mount arbitrary paths (and
thus an attacker that can modify config.json arbitrarily could gain the same
access as this exploit), because maskedPaths is applied by almost all
higher-level container runtimes (and thus provides a guaranteed mount source)
this flaw effectively allows any attacker that can spawn containers (with some
degree of control over what kinds of containers are being spawned) to achieve
the above goals.
This attack was analysed as having a CVSSv4 severity of 7.3 (High) using the
vector CVSS:4.0/AV:L/AC:L/AT:P/PR:L/UI:A/VC:H/VI:H/VA:H/SC:H/SI:H/SA:H.
Attack 2: Bypassing maskedPaths
While investigating Attack 1, we discovered that our validation mechanism when
bind-mounting /dev/null for maskedPaths would ignore ENOENT errors --
meaning that if an attacker deleted /dev/null before we did the bind-mount,
we would silently skip applying maskedPaths for the container. (The original
purpose of this ENOENT-ignore behaviour was to permit configurations where
maskedPaths references non-existent files, but we did not consider that the
source path could also not exist in this kind of race-attack scenario.)
With maskedPaths rendered inoperative, an attacker would be able to access
sensitive host information from files in /proc that would usually be masked
(such as /proc/kcore). However, note that /proc/sys and
/proc/sysrq-trigger are mounted read-only rather than being masked with
files, so this attack variant will not allow the same breakout or host denial
of service attacks as in Attack 1.
This attack was analysed as having a CVSSv4 severity of 5.6 (Moderate) using
the vector CVSS:4.0/AV:L/AC:L/AT:P/PR:L/UI:A/VC:H/VI:N/VA:N/SC:H/SI:N/SA:N.
Patches
This advisory is being published as part of a set of three advisories:
The patches fixing this issue have accordingly been combined into a single
patchset. The following patches from that patchset resolve the issues in this
advisory:
- db19bbe ("internal/sys: add VerifyInode helper")
- 8476df8 ("libct: add/use isDevNull, verifyDevNull")
- 1a30a8f ("libct: maskPaths: only ignore ENOENT on mount dest")
- 5d7b242 ("libct: maskPaths: don't rely on ENOTDIR for mount")
runc 1.2.8, 1.3.3, and 1.4.0-rc.3 have been released and all contain fixes for these
issues. As per our new release model, runc 1.1.x and earlier are
no longer supported and thus have not been patched.
Mitigations
-
Use containers with user namespaces (with the host root user not mapped into
the container's user namespace). This will block most of the most serious
aspects of these attacks, as the procfs files used for the container
breakout use Unix DAC permissions and user namespaced users will not have
access to the relevant files.
We would also like to take this opportunity to re-iterate that we
strongly recommend all users use user namespaced containers. They have
proven to be one of the best security hardening mechanisms against container
breakouts, and the kernel applies additional restrictions to user namespaced
containers above and beyond the user remapping functionality provided. With
the advent of id-mapped mounts (Linux 5.12), there is very little reason to
not use user namespaces for most applications. Note that using user
namespaces to configure your container does not mean you have to enable
unprivileged user namespace creation inside the container -- most
container runtimes apply a seccomp-bpf profile which blocks
unshare(CLONE_NEWUSER) inside containers regardless of whether the
container itself uses user namespaces.
Rootless containers can provide even more protection if your configuration
can use them -- by having runc itself be an unprivileged process, in general
you would expect the impact scope of a runc bug to be less severe as it
would only have the privileges afforded to the host user which spawned runc.
-
For non-user namespaced containers, configure all containers you spawn to
not permit processes to run with root privileges. In most cases this would
require configuring the container to use a non-root user and enabling
noNewPrivileges to disable any setuid or set-capability binaries. (Note
that this is our general recommendation for a secure container setup -- it
is very difficult, if not impossible, to run an untrusted program with root
privileges safely.) If you need to use ping in your containers, there is a
net.ipv4.ping_group_range sysctl that can be used to allow unprivileged
users to ping without requiring setuid or set-capability binaries.
-
Do not run untrusted container images from unknown or unverified sources.
-
Depending on the configuration of maskedPaths, an AppArmor profile (such
as the default one applied by higher level runtimes including Docker and
Podman) can block write attempts to most of /proc and /sys. This means
that even with a procfs file maliciously bind-mounted to a maskedPaths
target, all of the targets of maskedPaths in the default configuration of
runtimes such as Docker or Podman will still not permit write access to said
files. However, if a container is configured with a maskedPaths that is
not protected by AppArmor then the same attack can be carried out.
Please note that CVE-2025-52881 allows an attacker to bypass LSM labels,
and so this mitigation is not that helpful when considered in combination
with CVE-2025-52881.
-
Based on our analysis, SELinux policies have a lmited effect when
trying to protect against this attack. The reason is that the /dev/null
bind-mount gets implicitly relabelled with context=... set to the
container's SELinux context, and thus the container process will have access
to the source of the bind-mount even if they otherwise wouldn't.
Other Runtimes
As this vulnerability boils down to a fairly easy-to-make logic bug, we have
provided information to other OCI (crun, youki) and non-OCI (LXC) container
runtimes about this vulnerability.
Based on discussions with other runtimes, it seems that crun and youki may have
similar security issues and will release a co-ordinated security release along
with runc. LXC appears to also be vulnerable in some aspects, but their
security stance is (understandably) that non-user-namespaced
containers are fundamentally insecure by design.
References
Credits
Thanks to Lei Wang (@ssst0n3 from Huawei) for finding and reporting the
original vulnerability (Attack 1), and Li Fubang (@lifubang from acmcoder.com,
CIIC) for discovering another attack vector (Attack 2) based on @ssst0n3's
initial findings.
Impact
The OCI runtime specification has a
maskedPathsfeature that allows for filesor directories to be "masked" by placing a mount on top of them to conceal
their contents. This is primarily intended to protect against privileged users
in non-user-namespaced from being able to write to files or access directories
that would either provide sensitive information about the host to containers or
allow containers to perform destructive or other privileged operations on the
host (examples include
/proc/kcore,/proc/timer_list,/proc/acpi, and/proc/keys).maskedPathscan be used to either mask a directory or a file -- directoriesare masked using a new read-only
tmpfsinstance that is mounted on top of themasked path, while files are masked by bind-mounting the container's
/dev/nullon top of the masked path.In all known versions of runc, when using the container's
/dev/nullto maskfiles, runc would not perform sufficient verification that the source of the
bind-mount (i.e., the container's
/dev/null) was actually a real/dev/nullinode. While
/dev/nullis usually created by runc doing container creation,it is possible for an attacker to create a
/dev/nullor modify the/dev/nullinode created by runc through race conditions with other containerssharing mounts (we have also verified this attack is possible to exploit using
a standard Dockerfile with
docker buildx buildas that also permitstriggering parallel execution of containers with custom shared mounts
configured).
This could lead to two separate issues:
Attack 1: Arbitrary Mount Gadget (leading to Host Information Disclosure, Host Denial of Service, or Container Escape)
By replacing
/dev/nullwith a symlink to an attacker-controlled path, anattacker could cause runc to bind-mount an arbitrary source path to a path
inside the container. This could lead to:
/proc/sysrq-trigger,the attacker can gain access to a read-write version of files which can be
destructive to write to (
/proc/sysrq-triggerwould allow an attacker totrigger a kernel panic, shutting down the machine, or causing the machine to
freeze without rebooting).
/proc/sys/kernel/core_pattern, theattacker can reconfigure a coredump helper -- as kernel upcalls are not
namespaced, the configured binary (which could be a container binary or a
host binary with a malicious command-line) will run with full privileges on
the host system. Thus, the attacker can simply trigger a coredump and gain
complete root privileges over the host.
Note that while
config.jsonallows users to bind-mount arbitrary paths (andthus an attacker that can modify
config.jsonarbitrarily could gain the sameaccess as this exploit), because
maskedPathsis applied by almost allhigher-level container runtimes (and thus provides a guaranteed mount source)
this flaw effectively allows any attacker that can spawn containers (with some
degree of control over what kinds of containers are being spawned) to achieve
the above goals.
This attack was analysed as having a CVSSv4 severity of 7.3 (High) using the
vector
CVSS:4.0/AV:L/AC:L/AT:P/PR:L/UI:A/VC:H/VI:H/VA:H/SC:H/SI:H/SA:H.Attack 2: Bypassing
maskedPathsWhile investigating Attack 1, we discovered that our validation mechanism when
bind-mounting
/dev/nullformaskedPathswould ignoreENOENTerrors --meaning that if an attacker deleted
/dev/nullbefore we did the bind-mount,we would silently skip applying
maskedPathsfor the container. (The originalpurpose of this
ENOENT-ignore behaviour was to permit configurations wheremaskedPathsreferences non-existent files, but we did not consider that thesource path could also not exist in this kind of race-attack scenario.)
With
maskedPathsrendered inoperative, an attacker would be able to accesssensitive host information from files in
/procthat would usually be masked(such as
/proc/kcore). However, note that/proc/sysand/proc/sysrq-triggerare mounted read-only rather than being masked withfiles, so this attack variant will not allow the same breakout or host denial
of service attacks as in Attack 1.
This attack was analysed as having a CVSSv4 severity of 5.6 (Moderate) using
the vector
CVSS:4.0/AV:L/AC:L/AT:P/PR:L/UI:A/VC:H/VI:N/VA:N/SC:H/SI:N/SA:N.Patches
This advisory is being published as part of a set of three advisories:
The patches fixing this issue have accordingly been combined into a single
patchset. The following patches from that patchset resolve the issues in this
advisory:
runc 1.2.8, 1.3.3, and 1.4.0-rc.3 have been released and all contain fixes for these
issues. As per our new release model, runc 1.1.x and earlier are
no longer supported and thus have not been patched.
Mitigations
Use containers with user namespaces (with the host root user not mapped into
the container's user namespace). This will block most of the most serious
aspects of these attacks, as the
procfsfiles used for the containerbreakout use Unix DAC permissions and user namespaced users will not have
access to the relevant files.
We would also like to take this opportunity to re-iterate that we
strongly recommend all users use user namespaced containers. They have
proven to be one of the best security hardening mechanisms against container
breakouts, and the kernel applies additional restrictions to user namespaced
containers above and beyond the user remapping functionality provided. With
the advent of id-mapped mounts (Linux 5.12), there is very little reason to
not use user namespaces for most applications. Note that using user
namespaces to configure your container does not mean you have to enable
unprivileged user namespace creation inside the container -- most
container runtimes apply a seccomp-bpf profile which blocks
unshare(CLONE_NEWUSER)inside containers regardless of whether thecontainer itself uses user namespaces.
Rootless containers can provide even more protection if your configuration
can use them -- by having runc itself be an unprivileged process, in general
you would expect the impact scope of a runc bug to be less severe as it
would only have the privileges afforded to the host user which spawned runc.
For non-user namespaced containers, configure all containers you spawn to
not permit processes to run with root privileges. In most cases this would
require configuring the container to use a non-root user and enabling
noNewPrivilegesto disable any setuid or set-capability binaries. (Notethat this is our general recommendation for a secure container setup -- it
is very difficult, if not impossible, to run an untrusted program with root
privileges safely.) If you need to use
pingin your containers, there is anet.ipv4.ping_group_rangesysctl that can be used to allow unprivilegedusers to ping without requiring setuid or set-capability binaries.
Do not run untrusted container images from unknown or unverified sources.
Depending on the configuration of
maskedPaths, an AppArmor profile (suchas the default one applied by higher level runtimes including Docker and
Podman) can block write attempts to most of
/procand/sys. This meansthat even with a procfs file maliciously bind-mounted to a
maskedPathstarget, all of the targets of
maskedPathsin the default configuration ofruntimes such as Docker or Podman will still not permit write access to said
files. However, if a container is configured with a
maskedPathsthat isnot protected by AppArmor then the same attack can be carried out.
Please note that CVE-2025-52881 allows an attacker to bypass LSM labels,
and so this mitigation is not that helpful when considered in combination
with CVE-2025-52881.
Based on our analysis, SELinux policies have a lmited effect when
trying to protect against this attack. The reason is that the
/dev/nullbind-mount gets implicitly relabelled with
context=...set to thecontainer's SELinux context, and thus the container process will have access
to the source of the bind-mount even if they otherwise wouldn't.
Other Runtimes
As this vulnerability boils down to a fairly easy-to-make logic bug, we have
provided information to other OCI (crun, youki) and non-OCI (LXC) container
runtimes about this vulnerability.
Based on discussions with other runtimes, it seems that crun and youki may have
similar security issues and will release a co-ordinated security release along
with runc. LXC appears to also be vulnerable in some aspects, but their
security stance is (understandably) that non-user-namespaced
containers are fundamentally insecure by design.
References
Credits
Thanks to Lei Wang (@ssst0n3 from Huawei) for finding and reporting the
original vulnerability (Attack 1), and Li Fubang (@lifubang from acmcoder.com,
CIIC) for discovering another attack vector (Attack 2) based on @ssst0n3's
initial findings.