virtio: sync data on virtio flush request #2185

pclesr · 2020-10-15T07:21:36Z

When the virtio block device receives a flush (VIRTIO_BLK_T_FLUSH)
it must ensure that the data in the host page cache is actually
flushed out to media. Otherwise, data can be lost in the event of
a power fail.

This commit changes the underlying request from a flush() to a
sync_all(), forcing the data to be flushed to disk. This has
been verified by running a 'fio' test with sync_on_close in the
VM and verifying via 'perf' that fsync() on the host is actually
invoked. The bandwidth in the VM is also now in line with the
underlying hardware.

The invocation of sync_all() has been refactored to be in accordance
with other function invocations in the request handler.

This change also requires that the sys_fsync system call is in the
list of approved seccomp calls.

Signed-off-by: Peter Lawthers [email protected]

Reason for This PR

[Author TODO: add issue # or explain reasoning.]

Description of Changes

[Author TODO: add description of changes.]

This functionality can be added in rust-vmm.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.

PR Checklist

[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]

All commits in this PR are signed (git commit -s).
The reason for this PR is clearly provided (issue no. or explanation).
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this PR.
Any newly added unsafe code is properly documented.
Any API changes are reflected in firecracker/swagger.yaml.
Any user-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.

When the virtio block device receives a flush (VIRTIO_BLK_T_FLUSH) it must ensure that the data in the host page cache is actually flushed out to media. Otherwise, data can be lost in the event of a power fail. This commit changes the underlying request from a flush() to a sync_all(), forcing the data to be flushed to disk. This has been verified by running a 'fio' test with sync_on_close in the VM and verifying via 'perf' that fsync() on the host is actually invoked. The bandwidth in the VM is also now in line with the underlying hardware. The invocation of sync_all() has been refactored to be in accordance with other function invocations in the request handler. This change also requires that the sys_fsync system call is in the list of approved seccomp calls. Signed-off-by: Peter Lawthers <[email protected]>

iulianbarbu

Can we also have an integration test for this?

iulianbarbu · 2020-10-15T07:43:21Z

src/devices/src/virtio/block/request.rs

+            RequestType::Flush => {
+                diskfile.sync_all().map_err(ExecuteError::SyncAll)?;
+                METRICS.block.flush_count.inc();
+                return Ok(0);


Is this needed?

Is this needed?

What exactly are you referring to? The code itself, or the question about the integration test?

I did not do any additional tests yet, as they "should" have already been covered by the existing test framework. I will look at adding a specific test somehow for data loss between VM and host when the virtio flush fails or is non-existent.

Oh, no, I was referring to that return statement. We return Ok(0) anyway at the end.

Not sure if this - see Functional Approach/Writing section - helps, but here is an example of how we can force flushing to disk within an integration test. Not sure though if this can be used as a regression test, though.

Sorry, I did not understand. Yes, you are correct that is not needed. I will remove it.

With regard to a functional test, it will be a bit tricky. What has to happen is that without the sync, we have to read the data on the host from the device itself, bypassing the page cache and verify that it is different than what was written in the guest. I don't know if O_DIRECT on an ext4 filesystem will cause dirty pages to be flushed before accessing, but that might be one way.

I think madvise(MADV_REMOVE) will flush dirty pages before punching a hole, so that most likely will not work. madvise(MADV_FREE) might work if it really drops dirty pages without writing; then the data could be refreshed from the physical media and compared to some known pattern.

acatangiu

@pclesr thanks for the PR, the fix looks good! I would also like we add a regression test - obviously our current tests missed this.

Let us know if you can help here. Maybe the fio scenario you found this in could be turned into a test? Or any other test approach that fails before the fix and works after is good.

acatangiu · 2020-10-15T08:36:19Z

Hmm, it seems this increases boot-time, are there many VIRTIO_BLK_T_FLUSH requests during boot?

pclesr · 2020-10-15T08:43:40Z

Hmm, it seems this increases boot-time, are there many VIRTIO_BLK_T_FLUSH requests during boot?

There are a few -- here are the metrics from booting a minimal alpine-based VM

  "block": {
    "activate_fails": 0,
    "cfg_fails": 0,
    "no_avail_buffer": 15,
    "event_fails": 0,
    "execute_fails": 0,
    "invalid_reqs_count": 0,
    "flush_count": 6,
    "queue_event_count": 69,
    "rate_limiter_event_count": 0,
    "update_count": 0,
    "update_fails": 0,
    "read_bytes": 1463296,
    "write_bytes": 14336,
    "read_count": 278,
    "write_count": 14,
    "rate_limiter_throttled_events": 0
  },

pclesr · 2020-10-15T08:46:25Z

Since boot time is critical, is a mechanism needed so that the syncs are only done in a certain mode? In other words, the cache policy would be write-back by default and then if so desired it could be write-through.

Also in my example metrics, obviously my root disk was not readonly in my config.json.

sandreim · 2020-10-15T10:48:17Z

@pclesr thanks, this looks like a good fix.

I completely agree that the current implementation breaks the virtio spec since it does not fsync() on VIRTIO_BLK_T_FLUSH but still offers VIRTIO_BLK_F_FLUSH.

I think there are two options here:

the solution from this PR
not offering VIRTIO_BLK_F_FLUSH and making a statement that the device is not resilient to data loss in case of power failures.

The first looks like a clean cut to the heart of the technical issue, but based on the boot time degradation from just 6 fsync() calls I am concerned. I think the right approach is to run disk i/o performance test workloads against current implementation and this PR. Firecracker has a very simple device model with just a single emulation thread for all virtio devices. Since fsync() blocks until the data is flushed to the hardware, each call would stall emulation for all virtio devices, leading to high latencies and jitter in I/O bound workloads.

The second would be an even simpler fix, but we still need to make sure that this is the right trade-off for Firecracker's mission.

To summarize, I think we need more data on performance and data loss impact before we can commit to fix.

pclesr · 2020-10-15T12:21:31Z

@sandreim thank you for your comments. I understand your viewpoint. Given the workloads that firecracker is designed for, I/O is probably not a target priority.

Thank you for the clarification about the thread model -- I did not know that there was only a single thread that is shared by all virtio devices. That shows that target VMs are not expected to do much, if any, IO.

I will still work on developing a test that shows how data can be lost in the event of a power fail, since I think it would be helpful (and an interesting challenge).

I will post any additional findings in the issue itself, not in this PR request.

Respond to comments in PR firecracker-microvm#2185 Signed-off-by: Peter Lawthers <[email protected]>

pclesr · 2020-10-23T14:40:21Z

I have another version that I will post next week that adds an optional field to the drive configuration named "want_flush".

Only drives that have this field configured will advertise VIRTIO_BLK_F_FLUSH, and therefore receive flush commands. I have both the rust tests that verify the configuration sets the appropriate flag in the features vector, as well as python tests that verify no flushes are received when the flag is not present. I am using the flush_count metric as the indicator of receiving flushes, as it seems to be accurate.

This approach should provide backward compatibility with the existing model and not affect performance. Only those environments that really want to ensure coherency between guest and host will need to set the flag.

pclesr · 2020-10-27T10:04:43Z

A new pull request at #2223 has been created that keeps the existing behaviour (no sync to disk) that should address the boot time performance issues.

pclesr mentioned this pull request Oct 15, 2020

Virtio block flush (VIRTIO_BLK_T_FLUSH) not syncing data to host #2172

Closed

iulianbarbu reviewed Oct 15, 2020

View reviewed changes

acatangiu reviewed Oct 15, 2020

View reviewed changes

virtio: sync data on virtio flush request

418b59a

Respond to comments in PR firecracker-microvm#2185 Signed-off-by: Peter Lawthers <[email protected]>

alxiord added the Status: Author label Oct 20, 2020

dianpopa assigned pclesr Oct 23, 2020

pclesr closed this Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

virtio: sync data on virtio flush request #2185

virtio: sync data on virtio flush request #2185

pclesr commented Oct 15, 2020

iulianbarbu left a comment

iulianbarbu Oct 15, 2020

pclesr Oct 15, 2020 •

edited

Loading

iulianbarbu Oct 15, 2020

iulianbarbu Oct 15, 2020 •

edited

Loading

pclesr Oct 15, 2020

pclesr Oct 15, 2020

acatangiu left a comment

acatangiu commented Oct 15, 2020

pclesr commented Oct 15, 2020

pclesr commented Oct 15, 2020

sandreim commented Oct 15, 2020 •

edited

Loading

pclesr commented Oct 15, 2020

pclesr commented Oct 23, 2020

pclesr commented Oct 27, 2020

virtio: sync data on virtio flush request #2185

virtio: sync data on virtio flush request #2185

Conversation

pclesr commented Oct 15, 2020

Reason for This PR

Description of Changes

License Acceptance

PR Checklist

iulianbarbu left a comment

Choose a reason for hiding this comment

iulianbarbu Oct 15, 2020

Choose a reason for hiding this comment

pclesr Oct 15, 2020 • edited Loading

Choose a reason for hiding this comment

iulianbarbu Oct 15, 2020

Choose a reason for hiding this comment

iulianbarbu Oct 15, 2020 • edited Loading

Choose a reason for hiding this comment

pclesr Oct 15, 2020

Choose a reason for hiding this comment

pclesr Oct 15, 2020

Choose a reason for hiding this comment

acatangiu left a comment

Choose a reason for hiding this comment

acatangiu commented Oct 15, 2020

pclesr commented Oct 15, 2020

pclesr commented Oct 15, 2020

sandreim commented Oct 15, 2020 • edited Loading

pclesr commented Oct 15, 2020

pclesr commented Oct 23, 2020

pclesr commented Oct 27, 2020

pclesr Oct 15, 2020 •

edited

Loading

iulianbarbu Oct 15, 2020 •

edited

Loading

sandreim commented Oct 15, 2020 •

edited

Loading