-
Notifications
You must be signed in to change notification settings - Fork 1.9k
virtio: sync data on virtio flush request #2185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When the virtio block device receives a flush (VIRTIO_BLK_T_FLUSH) it must ensure that the data in the host page cache is actually flushed out to media. Otherwise, data can be lost in the event of a power fail. This commit changes the underlying request from a flush() to a sync_all(), forcing the data to be flushed to disk. This has been verified by running a 'fio' test with sync_on_close in the VM and verifying via 'perf' that fsync() on the host is actually invoked. The bandwidth in the VM is also now in line with the underlying hardware. The invocation of sync_all() has been refactored to be in accordance with other function invocations in the request handler. This change also requires that the sys_fsync system call is in the list of approved seccomp calls. Signed-off-by: Peter Lawthers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also have an integration test for this?
RequestType::Flush => { | ||
diskfile.sync_all().map_err(ExecuteError::SyncAll)?; | ||
METRICS.block.flush_count.inc(); | ||
return Ok(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
What exactly are you referring to? The code itself, or the question about the integration test?
I did not do any additional tests yet, as they "should" have already been covered by the existing test framework. I will look at adding a specific test somehow for data loss between VM and host when the virtio flush fails or is non-existent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, no, I was referring to that return statement. We return Ok(0)
anyway at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this - see Functional Approach/Writing section - helps, but here is an example of how we can force flushing to disk within an integration test. Not sure though if this can be used as a regression test, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I did not understand. Yes, you are correct that is not needed. I will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With regard to a functional test, it will be a bit tricky. What has to happen is that without the sync, we have to read the data on the host from the device itself, bypassing the page cache and verify that it is different than what was written in the guest. I don't know if O_DIRECT
on an ext4 filesystem will cause dirty pages to be flushed before accessing, but that might be one way.
I think madvise(MADV_REMOVE)
will flush dirty pages before punching a hole, so that most likely will not work. madvise(MADV_FREE)
might work if it really drops dirty pages without writing; then the data could be refreshed from the physical media and compared to some known pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pclesr thanks for the PR, the fix looks good! I would also like we add a regression test - obviously our current tests missed this.
Let us know if you can help here. Maybe the fio
scenario you found this in could be turned into a test? Or any other test approach that fails before the fix and works after is good.
Hmm, it seems this increases boot-time, are there many VIRTIO_BLK_T_FLUSH requests during boot? |
There are a few -- here are the metrics from booting a minimal alpine-based VM
|
Since boot time is critical, is a mechanism needed so that the syncs are only done in a certain mode? In other words, the cache policy would be write-back by default and then if so desired it could be write-through. Also in my example metrics, obviously my root disk was not readonly in my config.json. |
@pclesr thanks, this looks like a good fix. I completely agree that the current implementation breaks the virtio spec since it does not fsync() on I think there are two options here:
The first looks like a clean cut to the heart of the technical issue, but based on the boot time degradation from just 6 fsync() calls I am concerned. I think the right approach is to run disk i/o performance test workloads against current implementation and this PR. Firecracker has a very simple device model with just a single emulation thread for all virtio devices. Since fsync() blocks until the data is flushed to the hardware, each call would stall emulation for all virtio devices, leading to high latencies and jitter in I/O bound workloads. The second would be an even simpler fix, but we still need to make sure that this is the right trade-off for Firecracker's mission. To summarize, I think we need more data on performance and data loss impact before we can commit to fix. |
@sandreim thank you for your comments. I understand your viewpoint. Given the workloads that firecracker is designed for, I/O is probably not a target priority. Thank you for the clarification about the thread model -- I did not know that there was only a single thread that is shared by all virtio devices. That shows that target VMs are not expected to do much, if any, IO. I will still work on developing a test that shows how data can be lost in the event of a power fail, since I think it would be helpful (and an interesting challenge). I will post any additional findings in the issue itself, not in this PR request. |
Respond to comments in PR firecracker-microvm#2185 Signed-off-by: Peter Lawthers <[email protected]>
I have another version that I will post next week that adds an optional field to the drive configuration named "want_flush". Only drives that have this field configured will advertise VIRTIO_BLK_F_FLUSH, and therefore receive flush commands. I have both the rust tests that verify the configuration sets the appropriate flag in the features vector, as well as python tests that verify no flushes are received when the flag is not present. I am using the flush_count metric as the indicator of receiving flushes, as it seems to be accurate. This approach should provide backward compatibility with the existing model and not affect performance. Only those environments that really want to ensure coherency between guest and host will need to set the flag. |
A new pull request at #2223 has been created that keeps the existing behaviour (no sync to disk) that should address the boot time performance issues. |
When the virtio block device receives a flush (VIRTIO_BLK_T_FLUSH)
it must ensure that the data in the host page cache is actually
flushed out to media. Otherwise, data can be lost in the event of
a power fail.
This commit changes the underlying request from a flush() to a
sync_all(), forcing the data to be flushed to disk. This has
been verified by running a 'fio' test with sync_on_close in the
VM and verifying via 'perf' that fsync() on the host is actually
invoked. The bandwidth in the VM is also now in line with the
underlying hardware.
The invocation of sync_all() has been refactored to be in accordance
with other function invocations in the request handler.
This change also requires that the sys_fsync system call is in the
list of approved seccomp calls.
Signed-off-by: Peter Lawthers [email protected]
Reason for This PR
[Author TODO: add issue # or explain reasoning.]
Description of Changes
[Author TODO: add description of changes.]
rust-vmm
.License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.
PR Checklist
[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]
git commit -s
).unsafe
code is properly documented.firecracker/swagger.yaml
.CHANGELOG.md
.