Description
With mmapfs
, search queries load more data than necessary in the page cache. By default, every memory mapping done with mmap()
(and FileChannel.map()
in Java) on Linux is expected to be read (almost) sequentially. However, when a search request is done, the inverted index and other data structures seem to be read randomly so the system loads extra memory pages before and after the necessary page.
This results in a lot of I/O from the storage to warm up the file cache. In addition the cache is filling with unnecessary data and can evict more quickly the hot pages, slowing the next requests.
The problem is more visible with big indices (~1TB in our case).
To avoid this, Linux provides the madvise()
syscall to change the prefetching behavior of memory maps. You can tell the system to avoid loading pages by using this with the flag MADV_RANDOM
.
Unfortunately, Java doesn't use this syscall. Lucene provides a native library to do this : org.apache.lucene.store.NativePosixUtil
but it doesn't seem to be used.
To illustrate this, I made some tests on readonly indices (~60GB) with a batch of search requests (bool requests on some fields with size=0, just document count). Each index have been optimized with _forcemerge
:
Warm Cold
madv mmap nio madv mmap nio
query 1 8276 9100 9422 13967 10487 10769
query 2 9 10 9 95 1031 28
query 3 403 774 753 1019 1267 839
query 4 428 852 739 702 1025 857
query 5 4003 5591 5580 7970 6778 5947
query 6 1608 2237 2567 2611 2511 2594
query 7 5154 7193 7476 7890 7204 7943
query 8 438 705 707 1110 1211 793
query 9 2824 3922 4377 4143 4400 4237
query 10 2313 3235 3073 3086 3262 3471
average 2545 3361 3470 4259 3917 3747
consumed cache (Mio) - - - 1607 7659 4687
storage I/O (Mio/s) 0 0 0 ~30 ~250 ~150
Each column represents a single test and results are in ms:
- "cold" is made after a fresh startup and empty caches (
echo 3 > /proc/sys/vm/drop_caches
) - "warm" is the same test made right after the first one
The query cache and the request cache have been disabled.
The storage is made of spinning disks.
Elasticsearch version: 5.5.0.
OS: RHEL 7.3
You can see mmapfs
is consuming more cache and IO than niofs
.
In the madv
column, I patched Lucene (MMapDirectory
) to execute madvise(MADV_RANDOM)
on each mapped file. This further improve the file cache and I/O consumption. In addition, the search are faster on warmed data.
To do this, I just add a single line in MMapDirectory.java
:
final ByteBuffer[] map(String resourceDescription, FileChannel fc, long offset, long length) throws IOException {
...
try {
buffer = fc.map(MapMode.READ_ONLY, offset + bufferStart, bufSize);
NativePosixUtil.madvise(buffer,NativePosixUtil.RANDOM); // here !
} catch (IOException ioe) {
throw convertMapFailedIOException(ioe, resourceDescription, bufSize);
}
...
return buffers;
}
Then I compile the shared native library libNativePosixUtil.so
(with Lucene sources):
cd lucene-6.6.0/misc
ant build-native-unix
And finally, starts Elasticsearch with -Djava.library.path=/.../lucene-6.6.0/build/native/NativePosixUtil.so
in jvm.options
.
I didn't know if this solution can be applied in all cases and I didn't test all the cases (replication, merging, other queries...) but it could explain why mmapfs
badly perform on large setups for searching. Some users reported a similar behavior like here, here or here.
I didn't know if there is a similar problem on Windows since it's memory management is different.