Skip to content

unnecessary fsync in perf tracing code #13812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wfurt opened this issue Nov 21, 2019 · 7 comments
Closed

unnecessary fsync in perf tracing code #13812

wfurt opened this issue Nov 21, 2019 · 7 comments
Assignees
Milestone

Comments

@wfurt
Copy link
Member

wfurt commented Nov 21, 2019

I was trying to profile some networking code on Linux using perfollect. I noticed large chunk of time spent in fsync and related code. In my test code of 200 SslHandshake I saw 7000+ fsync calls.
After some digging this points to

int LogMethod(void* pCode, size_t codeSize, const char* symbol, void* debugInfo, void* unwindInfo)
{
int result = 0;
if (enabled)
{
size_t symbolLen = strlen(symbol);
JitCodeLoadRecord record;
record.vma = (uint64_t) pCode;
record.code_addr = (uint64_t) pCode;
record.code_size = codeSize;
record.code_index = ++codeIndex;
record.header.total_size = sizeof(JitCodeLoadRecord) + symbolLen + 1 + codeSize;
result = pthread_mutex_lock(&mutex);
if (result != 0)
return FatalError(false);
if (!enabled)
goto exit;
// ToDo write debugInfo and unwindInfo immediately before the JitCodeLoadRecord (while lock is held).
record.header.timestamp = GetTimeStampNS();
result = write(fd, &record, sizeof(JitCodeLoadRecord));
if (result == -1)
return FatalError(true);
result = write(fd, symbol, symbolLen + 1);
if (result == -1)
return FatalError(true);
result = write(fd, pCode, codeSize);
if (result == -1)
return FatalError(true);
result = fsync(fd);
if (result == -1)
return FatalError(true);
exit:
result = pthread_mutex_unlock(&mutex);
if (result != 0)
return FatalError(false);
}
return 0;
}

    int LogMethod(void* pCode, size_t codeSize, const char* symbol, void* debugInfo, void* unwindInfo)
    {
        int result = 0;


        if (enabled)
        {
            size_t symbolLen = strlen(symbol);


            JitCodeLoadRecord record;


            record.vma = (uint64_t) pCode;
            record.code_addr = (uint64_t) pCode;
            record.code_size = codeSize;
            record.code_index = ++codeIndex;
            record.header.total_size = sizeof(JitCodeLoadRecord) + symbolLen + 1 + codeSize;


            result = pthread_mutex_lock(&mutex);


            if (result != 0)
                return FatalError(false);


            if (!enabled)
                goto exit;


            // ToDo write debugInfo and unwindInfo immediately before the JitCodeLoadRecord (while lock is held).


            record.header.timestamp = GetTimeStampNS();


            result = write(fd, &record, sizeof(JitCodeLoadRecord));


            if (result == -1)
                return FatalError(true);


            result = write(fd, symbol, symbolLen + 1);


            if (result == -1)
                return FatalError(true);


            result = write(fd, pCode, codeSize);


            if (result == -1)
                return FatalError(true);


            result = fsync(fd);


            if (result == -1)
                return FatalError(true);


exit:
            result = pthread_mutex_unlock(&mutex);


            if (result != 0)
                return FatalError(false);


        }
        return 0;
    }

fsync seems unnecessary as we don't expect correct behavior on abnormal exits https://github.com/dotnet/diagnostics/issues/570. When write() is done, data will be written even if .NET process crashes as OS will close all handles and flush all buffers.
To make this worse, the fsync is currently done under lock and that can significanly increase critical section.

There seems to be other issues with this code. We only check for negative value returned from write. However write can write less then asked for and return written bytes. We would miss such case and we could write corrupted entries unnoticed.

Since we write fixed number records we should be able to to have iovec on stack and use single writev() That would limit context switches to kernel and it would make this less impactful to traced code.

cc @sdmaclea @janvorli @noahfalk @jkotas

@sdmaclea
Copy link
Contributor

fsync seems unnecessary
OS will close all handles and flush all buffers.

When profiling a single app in development it is not necessary. When collecting a system wide profile of long-running processes it is more important, because the flush is not guaranteed to happen until the process exits. This means that when profiling a long-running process, I cannot completely analyze the system until I terminate the .NET Core apps.

use single writev() That would limit context switches

👍

Also the switch to fdatasync might reduce impact.

I suspect a dedicated jitdump file write thread might be the best answer for minimal perf impact. The iovec structs could easily be passed through a concurrent queue to the writing thread. The locking mechanism could be the atomic write to the queue. This would probably reduce the impact to the JIT threads. The perf thread id could be used to ignore the jitdump file io, fsync...

@sdmaclea
Copy link
Contributor

sdmaclea commented Nov 22, 2019

There seems to be other issues with this code. We only check for negative value returned from write.

👍

If  a  write() is interrupted by a signal handler before any bytes are written, then 
the call fails with the error EINTR; if it is interrupted after at least one byte
has been written, the call succeeds, and returns the number of bytes written.

We need the retry loop for signal handlers here....

@sdmaclea
Copy link
Contributor

In my test code of 200 SslHandshake I saw 7000+ fsync

This really should only impact app startup. To get a good profile of long term performance you may need a significantly longer profile.

#13539 would also help eliminate startup anomalies.

@sdmaclea
Copy link
Contributor

To make this worse, the fsync is currently done under lock and that can significanly increase critical section.

Looking at the posix standard the write and writev operations are required to be atomic. So the fsync could easily move out of the critical section.

I was wondering if using writev we could eliminate the lock. Unfortunately given the provision of a partial write above, we would need a lock to handle the recoverable partial write case.

@sdmaclea
Copy link
Contributor

fsync seems unnecessary

The spec is not immediately clear to me that the write and writev operations are required to be visible to another process. However as I think about the posix atomicity requirements of the write and writev operations... I strongly suspect the fsync is not necessary for other processes to see these writes in finite time.

I will remove the fsync.

@sdmaclea sdmaclea self-assigned this Nov 22, 2019
@wfurt
Copy link
Member Author

wfurt commented Nov 24, 2019

I think there is no reason for kernel to hold the buffers once they cross to kernel boundary. It may not happen immediately as kernel is trying to optimize writes to disk. AFAIK, data will be visible to other process as soon as they hit disk cache even if they are not on physical surface.

@sdmaclea
Copy link
Contributor

Removing the fsync and other changes made perf record much faster. While user and system time increased slightly, the elapsed time decreased dramatically.

In the single benchmark the elapsed time dropped from about 2 minutes to about 2 seconds....

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants