Skip to content

Avoid file cache trashing on Linux with mmapfs by using madvise() ? #27748

Closed
@micoq

Description

@micoq

With mmapfs, search queries load more data than necessary in the page cache. By default, every memory mapping done with mmap() (and FileChannel.map() in Java) on Linux is expected to be read (almost) sequentially. However, when a search request is done, the inverted index and other data structures seem to be read randomly so the system loads extra memory pages before and after the necessary page.
This results in a lot of I/O from the storage to warm up the file cache. In addition the cache is filling with unnecessary data and can evict more quickly the hot pages, slowing the next requests.
The problem is more visible with big indices (~1TB in our case).

To avoid this, Linux provides the madvise() syscall to change the prefetching behavior of memory maps. You can tell the system to avoid loading pages by using this with the flag MADV_RANDOM.
Unfortunately, Java doesn't use this syscall. Lucene provides a native library to do this : org.apache.lucene.store.NativePosixUtil but it doesn't seem to be used.

To illustrate this, I made some tests on readonly indices (~60GB) with a batch of search requests (bool requests on some fields with size=0, just document count). Each index have been optimized with _forcemerge:

                            Warm                  Cold
                       madv   mmap    nio   madv   mmap   nio

query  1               8276   9100   9422  13967  10487 10769
query  2                  9     10      9     95   1031    28
query  3                403    774    753   1019   1267   839
query  4                428    852    739    702   1025   857
query  5               4003   5591   5580   7970   6778  5947
query  6               1608   2237   2567   2611   2511  2594
query  7               5154   7193   7476   7890   7204  7943
query  8                438    705    707   1110   1211   793
query  9               2824   3922   4377   4143   4400  4237
query 10               2313   3235   3073   3086   3262  3471
average                2545   3361   3470   4259   3917  3747

consumed cache (Mio)      -      -      -   1607   7659  4687
storage I/O (Mio/s)       0      0      0    ~30   ~250  ~150

Each column represents a single test and results are in ms:

  • "cold" is made after a fresh startup and empty caches (echo 3 > /proc/sys/vm/drop_caches)
  • "warm" is the same test made right after the first one

The query cache and the request cache have been disabled.
The storage is made of spinning disks.
Elasticsearch version: 5.5.0.
OS: RHEL 7.3

You can see mmapfs is consuming more cache and IO than niofs.

In the madv column, I patched Lucene (MMapDirectory) to execute madvise(MADV_RANDOM) on each mapped file. This further improve the file cache and I/O consumption. In addition, the search are faster on warmed data.
To do this, I just add a single line in MMapDirectory.java:

final ByteBuffer[] map(String resourceDescription, FileChannel fc, long offset, long length) throws IOException {
...
  try {
    buffer = fc.map(MapMode.READ_ONLY, offset + bufferStart, bufSize);
    NativePosixUtil.madvise(buffer,NativePosixUtil.RANDOM); // here !
  } catch (IOException ioe) {
    throw convertMapFailedIOException(ioe, resourceDescription, bufSize);
  }
...
  return buffers;
}

Then I compile the shared native library libNativePosixUtil.so (with Lucene sources):

cd lucene-6.6.0/misc
ant build-native-unix

And finally, starts Elasticsearch with -Djava.library.path=/.../lucene-6.6.0/build/native/NativePosixUtil.so in jvm.options.

I didn't know if this solution can be applied in all cases and I didn't test all the cases (replication, merging, other queries...) but it could explain why mmapfs badly perform on large setups for searching. Some users reported a similar behavior like here, here or here.

I didn't know if there is a similar problem on Windows since it's memory management is different.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions