|
| 1 | +MOTIVATION |
| 2 | + |
| 3 | +The idle page tracking feature allows to track which memory pages are being |
| 4 | +accessed by a workload and which are idle. This information can be useful for |
| 5 | +estimating the workload's working set size, which, in turn, can be taken into |
| 6 | +account when configuring the workload parameters, setting memory cgroup limits, |
| 7 | +or deciding where to place the workload within a compute cluster. |
| 8 | + |
| 9 | +It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. |
| 10 | + |
| 11 | +USER API |
| 12 | + |
| 13 | +The idle page tracking API is located at /sys/kernel/mm/page_idle. Currently, |
| 14 | +it consists of the only read-write file, /sys/kernel/mm/page_idle/bitmap. |
| 15 | + |
| 16 | +The file implements a bitmap where each bit corresponds to a memory page. The |
| 17 | +bitmap is represented by an array of 8-byte integers, and the page at PFN #i is |
| 18 | +mapped to bit #i%64 of array element #i/64, byte order is native. When a bit is |
| 19 | +set, the corresponding page is idle. |
| 20 | + |
| 21 | +A page is considered idle if it has not been accessed since it was marked idle |
| 22 | +(for more details on what "accessed" actually means see the IMPLEMENTATION |
| 23 | +DETAILS section). To mark a page idle one has to set the bit corresponding to |
| 24 | +the page by writing to the file. A value written to the file is OR-ed with the |
| 25 | +current bitmap value. |
| 26 | + |
| 27 | +Only accesses to user memory pages are tracked. These are pages mapped to a |
| 28 | +process address space, page cache and buffer pages, swap cache pages. For other |
| 29 | +page types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, |
| 30 | +and hence such pages are never reported idle. |
| 31 | + |
| 32 | +For huge pages the idle flag is set only on the head page, so one has to read |
| 33 | +/proc/kpageflags in order to correctly count idle huge pages. |
| 34 | + |
| 35 | +Reading from or writing to /sys/kernel/mm/page_idle/bitmap will return |
| 36 | +-EINVAL if you are not starting the read/write on an 8-byte boundary, or |
| 37 | +if the size of the read/write is not a multiple of 8 bytes. Writing to |
| 38 | +this file beyond max PFN will return -ENXIO. |
| 39 | + |
| 40 | +That said, in order to estimate the amount of pages that are not used by a |
| 41 | +workload one should: |
| 42 | + |
| 43 | + 1. Mark all the workload's pages as idle by setting corresponding bits in |
| 44 | + /sys/kernel/mm/page_idle/bitmap. The pages can be found by reading |
| 45 | + /proc/pid/pagemap if the workload is represented by a process, or by |
| 46 | + filtering out alien pages using /proc/kpagecgroup in case the workload is |
| 47 | + placed in a memory cgroup. |
| 48 | + |
| 49 | + 2. Wait until the workload accesses its working set. |
| 50 | + |
| 51 | + 3. Read /sys/kernel/mm/page_idle/bitmap and count the number of bits set. If |
| 52 | + one wants to ignore certain types of pages, e.g. mlocked pages since they |
| 53 | + are not reclaimable, he or she can filter them out using /proc/kpageflags. |
| 54 | + |
| 55 | +See Documentation/vm/pagemap.txt for more information about /proc/pid/pagemap, |
| 56 | +/proc/kpageflags, and /proc/kpagecgroup. |
| 57 | + |
| 58 | +IMPLEMENTATION DETAILS |
| 59 | + |
| 60 | +The kernel internally keeps track of accesses to user memory pages in order to |
| 61 | +reclaim unreferenced pages first on memory shortage conditions. A page is |
| 62 | +considered referenced if it has been recently accessed via a process address |
| 63 | +space, in which case one or more PTEs it is mapped to will have the Accessed bit |
| 64 | +set, or marked accessed explicitly by the kernel (see mark_page_accessed()). The |
| 65 | +latter happens when: |
| 66 | + |
| 67 | + - a userspace process reads or writes a page using a system call (e.g. read(2) |
| 68 | + or write(2)) |
| 69 | + |
| 70 | + - a page that is used for storing filesystem buffers is read or written, |
| 71 | + because a process needs filesystem metadata stored in it (e.g. lists a |
| 72 | + directory tree) |
| 73 | + |
| 74 | + - a page is accessed by a device driver using get_user_pages() |
| 75 | + |
| 76 | +When a dirty page is written to swap or disk as a result of memory reclaim or |
| 77 | +exceeding the dirty memory limit, it is not marked referenced. |
| 78 | + |
| 79 | +The idle memory tracking feature adds a new page flag, the Idle flag. This flag |
| 80 | +is set manually, by writing to /sys/kernel/mm/page_idle/bitmap (see the USER API |
| 81 | +section), and cleared automatically whenever a page is referenced as defined |
| 82 | +above. |
| 83 | + |
| 84 | +When a page is marked idle, the Accessed bit must be cleared in all PTEs it is |
| 85 | +mapped to, otherwise we will not be able to detect accesses to the page coming |
| 86 | +from a process address space. To avoid interference with the reclaimer, which, |
| 87 | +as noted above, uses the Accessed bit to promote actively referenced pages, one |
| 88 | +more page flag is introduced, the Young flag. When the PTE Accessed bit is |
| 89 | +cleared as a result of setting or updating a page's Idle flag, the Young flag |
| 90 | +is set on the page. The reclaimer treats the Young flag as an extra PTE |
| 91 | +Accessed bit and therefore will consider such a page as referenced. |
| 92 | + |
| 93 | +Since the idle memory tracking feature is based on the memory reclaimer logic, |
| 94 | +it only works with pages that are on an LRU list, other pages are silently |
| 95 | +ignored. That means it will ignore a user memory page if it is isolated, but |
| 96 | +since there are usually not many of them, it should not affect the overall |
| 97 | +result noticeably. In order not to stall scanning of the idle page bitmap, |
| 98 | +locked pages may be skipped too. |
0 commit comments