Avoid file cache trashing on Linux with mmapfs by using madvise() ?

With `mmapfs`, search queries load more data than necessary in the page cache. By default, every memory mapping done with `mmap()` (and `FileChannel.map()` in Java) on Linux is expected to be read (almost) sequentially. However, when a search request is done, the inverted index and other data structures seem to be read randomly so the system loads extra memory pages before and after the necessary page.
This results in a lot of I/O from the storage to warm up the file cache. In addition the cache is filling with unnecessary data and can evict more quickly the hot pages, slowing the next requests.
The problem is more visible with big indices  (~1TB in our case).

To avoid this, Linux provides the `madvise()` syscall to change the prefetching behavior of memory maps. You can tell the system to avoid loading pages by using this with the flag `MADV_RANDOM`.
Unfortunately, Java doesn't use this syscall. Lucene provides a native library to do this : [`org.apache.lucene.store.NativePosixUtil`](https://lucene.apache.org/core/7_0_0/misc/org/apache/lucene/store/NativePosixUtil.html) but it doesn't seem  to be used.

To illustrate this, I made some tests on readonly indices (~60GB) with a batch of search requests (bool requests on some fields with size=0, just document count). Each index have been optimized with `_forcemerge`:
```
                            Warm                  Cold
                       madv   mmap    nio   madv   mmap   nio

query  1               8276   9100   9422  13967  10487 10769
query  2                  9     10      9     95   1031    28
query  3                403    774    753   1019   1267   839
query  4                428    852    739    702   1025   857
query  5               4003   5591   5580   7970   6778  5947
query  6               1608   2237   2567   2611   2511  2594
query  7               5154   7193   7476   7890   7204  7943
query  8                438    705    707   1110   1211   793
query  9               2824   3922   4377   4143   4400  4237
query 10               2313   3235   3073   3086   3262  3471
average                2545   3361   3470   4259   3917  3747

consumed cache (Mio)      -      -      -   1607   7659  4687
storage I/O (Mio/s)       0      0      0    ~30   ~250  ~150
```
Each column represents a single test and results are in _ms_:
- "cold" is made after a fresh startup and empty caches (`echo 3 > /proc/sys/vm/drop_caches`)
- "warm" is the same test made right after the first one

The query cache and the request cache have been disabled.
The storage is made of spinning disks.
Elasticsearch version: 5.5.0.
OS: RHEL 7.3

You can see `mmapfs` is consuming more cache and IO than `niofs`.

In the `madv` column, I patched Lucene (`MMapDirectory`) to execute `madvise(MADV_RANDOM)` on each mapped file. This further improve the file cache and I/O consumption. In addition, the search are faster on warmed data.
To do this, I just add a single line in `MMapDirectory.java`:
```
final ByteBuffer[] map(String resourceDescription, FileChannel fc, long offset, long length) throws IOException {
...
  try {
    buffer = fc.map(MapMode.READ_ONLY, offset + bufferStart, bufSize);
    NativePosixUtil.madvise(buffer,NativePosixUtil.RANDOM); // here !
  } catch (IOException ioe) {
    throw convertMapFailedIOException(ioe, resourceDescription, bufSize);
  }
...
  return buffers;
}
```
Then I compile the shared native library `libNativePosixUtil.so` (with Lucene sources):
```
cd lucene-6.6.0/misc
ant build-native-unix
```
And finally, starts Elasticsearch with `-Djava.library.path=/.../lucene-6.6.0/build/native/NativePosixUtil.so` in `jvm.options`.

I didn't know if this solution can be applied in all cases and I didn't test all the cases (replication, merging, other queries...) but it could explain why `mmapfs` badly perform on large setups for searching. Some users reported a similar behavior like [here](https://discuss.elastic.co/t/performance-problems-when-upgrading-from-elasticsearch-1-7-4-to-5-4-0/93076/11), [here](https://discuss.elastic.co/t/problem-large-single-index-with-mmap/95585) or [here](https://thoughts.t37.net/designing-the-perfect-elasticsearch-cluster-the-almost-definitive-guide-e614eabc1a87).

I didn't know if there is a similar problem on Windows since it's memory management is different.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid file cache trashing on Linux with mmapfs by using madvise() ? #27748

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Avoid file cache trashing on Linux with mmapfs by using madvise() ? #27748

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions