unnecessary fsync in perf tracing code #13812

wfurt · 2019-11-21T18:31:35Z

I was trying to profile some networking code on Linux using perfollect. I noticed large chunk of time spent in fsync and related code. In my test code of 200 SslHandshake I saw 7000+ fsync calls.
After some digging this points to

runtime/src/coreclr/src/pal/src/misc/perfjitdump.cpp

Lines 225 to 281 in 7b408c1

    
               int LogMethod(void* pCode, size_t codeSize, const char* symbol, void* debugInfo, void* unwindInfo) 
        
               { 
        
                   int result = 0; 
        
                   if (enabled) 
        
                   { 
        
                       size_t symbolLen = strlen(symbol); 
        
                       JitCodeLoadRecord record; 
        
                       record.vma = (uint64_t) pCode; 
        
                       record.code_addr = (uint64_t) pCode; 
        
                       record.code_size = codeSize; 
        
                       record.code_index = ++codeIndex; 
        
                       record.header.total_size = sizeof(JitCodeLoadRecord) + symbolLen + 1 + codeSize; 
        
                       result = pthread_mutex_lock(&mutex); 
        
                       if (result != 0) 
        
                           return FatalError(false); 
        
                       if (!enabled) 
        
                           goto exit; 
        
                       // ToDo write debugInfo and unwindInfo immediately before the JitCodeLoadRecord (while lock is held). 
        
                       record.header.timestamp = GetTimeStampNS(); 
        
                       result = write(fd, &record, sizeof(JitCodeLoadRecord)); 
        
                       if (result == -1) 
        
                           return FatalError(true); 
        
                       result = write(fd, symbol, symbolLen + 1); 
        
                       if (result == -1) 
        
                           return FatalError(true); 
        
                       result = write(fd, pCode, codeSize); 
        
                       if (result == -1) 
        
                           return FatalError(true); 
        
                       result = fsync(fd); 
        
                       if (result == -1) 
        
                           return FatalError(true); 
        
           exit: 
        
                       result = pthread_mutex_unlock(&mutex); 
        
                       if (result != 0) 
        
                           return FatalError(false); 
        
                   } 
        
                   return 0; 
        
               }

    int LogMethod(void* pCode, size_t codeSize, const char* symbol, void* debugInfo, void* unwindInfo)
    {
        int result = 0;


        if (enabled)
        {
            size_t symbolLen = strlen(symbol);


            JitCodeLoadRecord record;


            record.vma = (uint64_t) pCode;
            record.code_addr = (uint64_t) pCode;
            record.code_size = codeSize;
            record.code_index = ++codeIndex;
            record.header.total_size = sizeof(JitCodeLoadRecord) + symbolLen + 1 + codeSize;


            result = pthread_mutex_lock(&mutex);


            if (result != 0)
                return FatalError(false);


            if (!enabled)
                goto exit;


            // ToDo write debugInfo and unwindInfo immediately before the JitCodeLoadRecord (while lock is held).


            record.header.timestamp = GetTimeStampNS();


            result = write(fd, &record, sizeof(JitCodeLoadRecord));


            if (result == -1)
                return FatalError(true);


            result = write(fd, symbol, symbolLen + 1);


            if (result == -1)
                return FatalError(true);


            result = write(fd, pCode, codeSize);


            if (result == -1)
                return FatalError(true);


            result = fsync(fd);


            if (result == -1)
                return FatalError(true);


exit:
            result = pthread_mutex_unlock(&mutex);


            if (result != 0)
                return FatalError(false);


        }
        return 0;
    }

fsync seems unnecessary as we don't expect correct behavior on abnormal exits https://github.com/dotnet/diagnostics/issues/570. When write() is done, data will be written even if .NET process crashes as OS will close all handles and flush all buffers.
To make this worse, the fsync is currently done under lock and that can significanly increase critical section.

There seems to be other issues with this code. We only check for negative value returned from write. However write can write less then asked for and return written bytes. We would miss such case and we could write corrupted entries unnoticed.

Since we write fixed number records we should be able to to have iovec on stack and use single writev() That would limit context switches to kernel and it would make this less impactful to traced code.

cc @sdmaclea @janvorli @noahfalk @jkotas

The text was updated successfully, but these errors were encountered:

sdmaclea · 2019-11-22T01:20:11Z

fsync seems unnecessary
OS will close all handles and flush all buffers.

When profiling a single app in development it is not necessary. When collecting a system wide profile of long-running processes it is more important, because the flush is not guaranteed to happen until the process exits. This means that when profiling a long-running process, I cannot completely analyze the system until I terminate the .NET Core apps.

use single writev() That would limit context switches

👍

Also the switch to fdatasync might reduce impact.

I suspect a dedicated jitdump file write thread might be the best answer for minimal perf impact. The iovec structs could easily be passed through a concurrent queue to the writing thread. The locking mechanism could be the atomic write to the queue. This would probably reduce the impact to the JIT threads. The perf thread id could be used to ignore the jitdump file io, fsync...

sdmaclea · 2019-11-22T01:28:49Z

There seems to be other issues with this code. We only check for negative value returned from write.

👍

If  a  write() is interrupted by a signal handler before any bytes are written, then 
the call fails with the error EINTR; if it is interrupted after at least one byte
has been written, the call succeeds, and returns the number of bytes written.

We need the retry loop for signal handlers here....

sdmaclea · 2019-11-22T01:51:24Z

In my test code of 200 SslHandshake I saw 7000+ fsync

This really should only impact app startup. To get a good profile of long term performance you may need a significantly longer profile.

#13539 would also help eliminate startup anomalies.

sdmaclea · 2019-11-22T03:07:42Z

To make this worse, the fsync is currently done under lock and that can significanly increase critical section.

Looking at the posix standard the write and writev operations are required to be atomic. So the fsync could easily move out of the critical section.

I was wondering if using writev we could eliminate the lock. Unfortunately given the provision of a partial write above, we would need a lock to handle the recoverable partial write case.

sdmaclea · 2019-11-22T20:08:44Z

fsync seems unnecessary

The spec is not immediately clear to me that the write and writev operations are required to be visible to another process. However as I think about the posix atomicity requirements of the write and writev operations... I strongly suspect the fsync is not necessary for other processes to see these writes in finite time.

I will remove the fsync.

wfurt · 2019-11-24T03:45:07Z

I think there is no reason for kernel to hold the buffers once they cross to kernel boundary. It may not happen immediately as kernel is trying to optimize writes to disk. AFAIK, data will be visible to other process as soon as they hit disk cache even if they are not on physical surface.

sdmaclea · 2019-11-26T16:39:09Z

Removing the fsync and other changes made perf record much faster. While user and system time increased slightly, the elapsed time decreased dramatically.

In the single benchmark the elapsed time dropped from about 2 minutes to about 2 seconds....

sdmaclea self-assigned this Nov 22, 2019

sdmaclea closed this as completed in #229 Nov 26, 2019

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 5.0 milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

unnecessary fsync in perf tracing code #13812

unnecessary fsync in perf tracing code #13812

wfurt commented Nov 21, 2019

sdmaclea commented Nov 22, 2019

Uh oh!

sdmaclea commented Nov 22, 2019 •

edited

Loading

Uh oh!

sdmaclea commented Nov 22, 2019

Uh oh!

sdmaclea commented Nov 22, 2019

Uh oh!

sdmaclea commented Nov 22, 2019

Uh oh!

wfurt commented Nov 24, 2019

Uh oh!

sdmaclea commented Nov 26, 2019

Uh oh!

unnecessary fsync in perf tracing code #13812

unnecessary fsync in perf tracing code #13812

Comments

wfurt commented Nov 21, 2019

sdmaclea commented Nov 22, 2019

Uh oh!

sdmaclea commented Nov 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdmaclea commented Nov 22, 2019

Uh oh!

sdmaclea commented Nov 22, 2019

Uh oh!

sdmaclea commented Nov 22, 2019

Uh oh!

wfurt commented Nov 24, 2019

Uh oh!

sdmaclea commented Nov 26, 2019

Uh oh!

sdmaclea commented Nov 22, 2019 •

edited

Loading