Skip to content

Commit b5b3ba9

Browse files
hikerockiesvijay-suman
authored andcommitted
mm: Allow userspace to reserve VA range for use by userspace only
Add support for ELF binaries to reserve address ranges. Address range can be reserved at load time by adding an ELF NOTE section, or at run time with mprotect() with PROT_RESERVED flag. Reserved ranges can be allocated with mmap(..... MAP_FIXED...) and shmat(...., SHM_REMAP) later. Any reserved address ranges are annotated with "[rsvd]" in /proc/<pid>/maps output. A binary can check if the kernel supports VA range reservation by checking the value of auxiliary vector AT_VA_RESERVATION. VA reservation is done by adding a special NOTE section to binary using declarations similar to following: .section .note.rsvd_range, "a", @note .p2align 2 .long 1f - 0f # name size (not including padding) .long 3f - 2f # desc size (not including padding) .long 0x07c10001 0: .asciz "Reserved VA" # name 1: .p2align 2 2: .quad 0x7f2000000000 .quad 0x7f2000e00000 .quad 0x7f5000200000 .quad 0x7f500d000000 3: .p2align 2 Each reserved range is specified as pair of addresses (start and end). This note section is read by kernel elf loader and address ranges are reserved for the lifetime of process. A maximum of 64 such entries can be made in NOTE section. Execution of a binary file with more than 64 pairs of addresses in this note section will be terminated with ENOEXEC. NOTE: Kernel can not guarantee all VA ranges in the NOTE section will be reserved. If the address range is valid but is already in use (possibly by a shared library loaded earlier), execution of binary will be terminated with ENOMEM. NOTE: This feature needs two VMA flag bits. There are no free bits available in lower 32 bits. As a result this feature can only be supported on architectures that support high VMA flag bits (bits 32-63). NOTE: Due to limitations in the implementation, when mapping a range over over one or more reserved ranges the range must be entirely contained within a reserved range or a contiguous set of reserved ranges. mmap() will fail and set errno to EINVAL if the range to map is only partly reserved. ----------------------------- Upstream status of this patch ----------------------------- This patch will not be submitted upstream. It solves a specific problem for database in a way that mostly works for DB. It will be very difficult to get this patch accepted upstream. There are two issues that make it difficult - (1) These changes do not solve the problem fully, (2) There are other ways to solve this problem without kernel changes. The reason these changes do not solve this problem fully is kernel does not get called to load ELF binary until after the loader has already loaded all the libraries. It is possible that one of the libraries might get loaded at the address we want to reserve and by the time kernel gets a chance to reserve the address range, it is already too late. There are three other ways to solve this problem besides modifying the kernel: 1. Create a binary that gets started before DB starts, reserves address ranges using mmap(MAP_FIXED) and then launches DB as child process with address ranges reserved. This still leaves open the possibility address ranges were already consumed by libraries but I believe this is roughly the solution used on Solaris. 2. Use LD_PRELOAD to preload a special library which reserves address ranges in its init routine. Now when DB starts, it can call into this special library and get addresses for all the address ranges that have been reserved. This can work conceptually but it needs to be prototyped and tested to see if LD_PRELOAD libraries get loaded before other libraries and if address reservation using mmap in special library survives. 3. A custom loader which reserves address ranges first before loading any other libraries. This is the only solution that can guarantee DB will get the address ranges it wants to reserve. This feature became more or less a requirement for DB to be able to enable ASLR which customers were asking for. A custom loader can provide a potential solution even with ASLR. Orabug: 30135230 Signed-off-by: Khalid Aziz <[email protected]> Signed-off-by: Anthony Yznaga <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: William Kucharski <[email protected]> Reviewed-by: Khalid Aziz <[email protected]>
1 parent 23406d7 commit b5b3ba9

File tree

7 files changed

+377
-2
lines changed

7 files changed

+377
-2
lines changed

arch/x86/Kconfig

+1
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ config X86_64
3535
select ARCH_HAS_ELFCORE_COMPAT
3636
select ZONE_DMA32
3737
select EXECMEM if DYNAMIC_FTRACE
38+
select ARCH_USES_HIGH_VMA_FLAGS
3839

3940
config FORCE_DYNAMIC_FTRACE
4041
def_bool y

fs/binfmt_elf.c

+121
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,8 @@ static int elf_core_dump(struct coredump_params *cprm);
9090
#define ELF_MIN_ALIGN PAGE_SIZE
9191
#endif
9292

93+
#define MAX_FILE_NOTE_SIZE (4*1024*1024)
94+
9395
#ifndef ELF_CORE_EFLAGS
9496
#define ELF_CORE_EFLAGS 0
9597
#endif
@@ -274,6 +276,7 @@ create_elf_tables(struct linux_binprm *bprm, const struct elfhdr *exec,
274276
NEW_AUX_ENT(AT_RSEQ_FEATURE_SIZE, offsetof(struct rseq, end));
275277
NEW_AUX_ENT(AT_RSEQ_ALIGN, __alignof__(struct rseq));
276278
#endif
279+
NEW_AUX_ENT(AT_VA_RESERVATION, 1);
277280
#undef NEW_AUX_ENT
278281
/* AT_NULL is zero; clear the rest too */
279282
memset(elf_info, 0, (char *)mm->saved_auxv +
@@ -816,6 +819,106 @@ static int parse_elf_properties(struct file *f, const struct elf_phdr *phdr,
816819
return ret == -ENOENT ? 0 : ret;
817820
}
818821

822+
#define MAX_RSVD_VA_RANGES 64
823+
#define RSVD_VA_STRING "Reserved VA"
824+
#define SZ_RSVD_VA_STRING sizeof(RSVD_VA_STRING)
825+
826+
static int reserve_va_range(struct elf_phdr *elf_ppnt,
827+
struct linux_binprm *bprm)
828+
{
829+
char *note_seg = NULL;
830+
struct elf_note *note;
831+
loff_t pos = elf_ppnt->p_offset;
832+
int retval = 0;
833+
size_t note_size = elf_ppnt->p_filesz;
834+
835+
note_seg = kvmalloc(note_size, GFP_KERNEL);
836+
if (!note_seg) {
837+
retval = -ENOMEM;
838+
return retval;
839+
}
840+
841+
retval = kernel_read(bprm->file, note_seg, note_size, &pos);
842+
if (retval != note_size) {
843+
if (retval >= 0)
844+
retval = -EIO;
845+
goto out;
846+
}
847+
848+
note = (struct elf_note *)note_seg;
849+
while ((char *)note + sizeof(struct elf_note) <
850+
(char *)(note_seg + note_size)) {
851+
char *name;
852+
unsigned long *val;
853+
unsigned long nentry, i;
854+
855+
if (note->n_type != 0x07c10001)
856+
goto cont_loop;
857+
858+
/* Sanity check for malformed note entry */
859+
if (note->n_namesz > SZ_RSVD_VA_STRING) {
860+
retval = -ENOEXEC;
861+
goto out;
862+
}
863+
864+
name = (char *)note + sizeof(struct elf_note);
865+
if (strncmp(name, RSVD_VA_STRING, SZ_RSVD_VA_STRING) == 0) {
866+
nentry = note->n_descsz/sizeof(void *);
867+
val = (unsigned long *)(name +
868+
roundup(note->n_namesz, 4));
869+
/*
870+
* Check if right number of address
871+
* entries exist in note section
872+
*/
873+
if (((nentry % 2) != 0) ||
874+
((nentry / 2) > MAX_RSVD_VA_RANGES)) {
875+
retval = -ENOEXEC;
876+
goto out;
877+
}
878+
for (i = 0 ; i < nentry; i += 2) {
879+
unsigned long range1, range2, size;
880+
struct mm_struct *mm = current->mm;
881+
882+
/*
883+
* Ensure we can access two address entries
884+
* in this note segment safely
885+
*/
886+
if ((char *)(val + 1) >=
887+
((char *)note_seg + note_size)) {
888+
retval = -ENOEXEC;
889+
goto out;
890+
}
891+
range1 = PAGE_ALIGN((*val++) - PAGE_SIZE + 1);
892+
range2 = PAGE_ALIGN(*val++);
893+
size = range2 - range1;
894+
895+
/* Validate the address range being reserved */
896+
if ((range2 <= range1) ||
897+
(!access_ok((void *)range1, size))) {
898+
retval = -ENOEXEC;
899+
goto out;
900+
}
901+
902+
mmap_write_lock(mm);
903+
retval = install_rsvd_mapping(mm, range1,
904+
(range2-range1));
905+
mmap_write_unlock(mm);
906+
if (retval < 0)
907+
goto out;
908+
}
909+
}
910+
cont_loop:
911+
note = (struct elf_note *)((char *)note +
912+
sizeof(struct elf_note) +
913+
roundup(note->n_namesz, 4) +
914+
roundup(note->n_descsz, 4));
915+
}
916+
917+
out:
918+
kvfree(note_seg);
919+
return retval;
920+
}
921+
819922
static int load_elf_binary(struct linux_binprm *bprm)
820923
{
821924
struct file *interpreter = NULL; /* to shut gcc up */
@@ -1022,6 +1125,24 @@ static int load_elf_binary(struct linux_binprm *bprm)
10221125
start_data = 0;
10231126
end_data = 0;
10241127

1128+
/*
1129+
* Read the notes segment to find notes to reserve address space
1130+
*/
1131+
elf_ppnt = elf_phdata;
1132+
for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++)
1133+
if (elf_ppnt->p_type == PT_NOTE) {
1134+
/* Sanity check for bogus note segment */
1135+
if ((elf_ppnt->p_filesz > MAX_FILE_NOTE_SIZE) ||
1136+
(elf_ppnt->p_filesz < sizeof(struct elf_note))) {
1137+
retval = -ENOEXEC;
1138+
goto out_free_ph;
1139+
}
1140+
retval = reserve_va_range(elf_ppnt, bprm);
1141+
if (retval < 0)
1142+
goto out_free_ph;
1143+
}
1144+
1145+
10251146
/* Now we do a little grungy work by mmapping the ELF image into
10261147
the correct location in memory. */
10271148
for(i = 0, elf_ppnt = elf_phdata;

include/linux/mm.h

+17
Original file line numberDiff line numberDiff line change
@@ -321,12 +321,16 @@ extern unsigned int kobjsize(const void *objp);
321321
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
322322
#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
323323
#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */
324+
#define VM_HIGH_ARCH_BIT_16 48 /* bit only usable on 64-bit architectures */
325+
#define VM_HIGH_ARCH_BIT_17 49 /* bit only usable on 64-bit architectures */
324326
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
325327
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
326328
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
327329
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
328330
#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
329331
#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
332+
#define VM_HIGH_ARCH_16 BIT(VM_HIGH_ARCH_BIT_16)
333+
#define VM_HIGH_ARCH_17 BIT(VM_HIGH_ARCH_BIT_17)
330334
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
331335

332336
#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -381,6 +385,17 @@ extern unsigned int kobjsize(const void *objp);
381385
# define VM_MTE_ALLOWED VM_NONE
382386
#endif
383387

388+
#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
389+
# define VM_RSVD_VA VM_HIGH_ARCH_16 /* Reserved VA range */
390+
# define VM_RSVD_NORELINK VM_HIGH_ARCH_17 /* VA range unmapped by
391+
* userspace but still reserved
392+
* for use by userspace only
393+
*/
394+
#else
395+
# define VM_RSVD_VA VM_NONE
396+
# define VM_RSVD_NORELINK VM_NONE
397+
#endif
398+
384399
#ifndef VM_GROWSUP
385400
# define VM_GROWSUP VM_NONE
386401
#endif
@@ -3319,6 +3334,8 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
33193334
avc; avc = anon_vma_interval_tree_iter_next(avc, start, last))
33203335

33213336
/* mmap.c */
3337+
extern int install_rsvd_mapping(struct mm_struct *mm, unsigned long addr,
3338+
unsigned long len);
33223339
extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
33233340
extern int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
33243341
unsigned long start, unsigned long end, pgoff_t pgoff,

include/uapi/asm-generic/mman-common.h

+1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
#define PROT_NONE 0x0 /* page can not be accessed */
1717
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
1818
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
19+
#define PROT_RESERVED 0x10000000 /* Reserve this VA range */
1920

2021
/* 0x01 - 0x03 are defined in linux/mman.h */
2122
#define MAP_TYPE 0x0f /* Mask for type of mapping */

include/uapi/linux/auxvec.h

+2
Original file line numberDiff line numberDiff line change
@@ -41,4 +41,6 @@
4141
#define AT_MINSIGSTKSZ 51 /* minimal stack size for signal delivery */
4242
#endif
4343

44+
#define AT_VA_RESERVATION 71 /* VA reservation support */
45+
4446
#endif /* _UAPI_LINUX_AUXVEC_H */

0 commit comments

Comments
 (0)