Skip to content

virtio: sync data on virtio flush request #2185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

pclesr
Copy link

@pclesr pclesr commented Oct 15, 2020

When the virtio block device receives a flush (VIRTIO_BLK_T_FLUSH)
it must ensure that the data in the host page cache is actually
flushed out to media. Otherwise, data can be lost in the event of
a power fail.

This commit changes the underlying request from a flush() to a
sync_all(), forcing the data to be flushed to disk. This has
been verified by running a 'fio' test with sync_on_close in the
VM and verifying via 'perf' that fsync() on the host is actually
invoked. The bandwidth in the VM is also now in line with the
underlying hardware.

The invocation of sync_all() has been refactored to be in accordance
with other function invocations in the request handler.

This change also requires that the sys_fsync system call is in the
list of approved seccomp calls.

Signed-off-by: Peter Lawthers [email protected]

Reason for This PR

[Author TODO: add issue # or explain reasoning.]

Description of Changes

[Author TODO: add description of changes.]

  • This functionality can be added in rust-vmm.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.

PR Checklist

[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]

  • All commits in this PR are signed (git commit -s).
  • The reason for this PR is clearly provided (issue no. or explanation).
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • Any newly added unsafe code is properly documented.
  • Any API changes are reflected in firecracker/swagger.yaml.
  • Any user-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.

When the virtio block device receives a flush (VIRTIO_BLK_T_FLUSH)
it must ensure that the data in the host page cache is actually
flushed out to media. Otherwise, data can be lost in the event of
a power fail.

This commit changes the underlying request from a flush() to a
sync_all(), forcing the data to be flushed to disk. This has
been verified by running a 'fio' test with sync_on_close in the
VM and verifying via 'perf' that fsync() on the host is actually
invoked. The bandwidth in the VM is also now in line with the
underlying hardware.

The invocation of sync_all() has been refactored to be in accordance
with other function invocations in the request handler.

This change also requires that the sys_fsync system call is in the
list of approved seccomp calls.

Signed-off-by:	Peter Lawthers <[email protected]>
Copy link

@iulianbarbu iulianbarbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also have an integration test for this?

RequestType::Flush => {
diskfile.sync_all().map_err(ExecuteError::SyncAll)?;
METRICS.block.flush_count.inc();
return Ok(0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Author

@pclesr pclesr Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

What exactly are you referring to? The code itself, or the question about the integration test?

I did not do any additional tests yet, as they "should" have already been covered by the existing test framework. I will look at adding a specific test somehow for data loss between VM and host when the virtio flush fails or is non-existent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no, I was referring to that return statement. We return Ok(0) anyway at the end.

Copy link

@iulianbarbu iulianbarbu Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this - see Functional Approach/Writing section - helps, but here is an example of how we can force flushing to disk within an integration test. Not sure though if this can be used as a regression test, though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not understand. Yes, you are correct that is not needed. I will remove it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regard to a functional test, it will be a bit tricky. What has to happen is that without the sync, we have to read the data on the host from the device itself, bypassing the page cache and verify that it is different than what was written in the guest. I don't know if O_DIRECT on an ext4 filesystem will cause dirty pages to be flushed before accessing, but that might be one way.

I think madvise(MADV_REMOVE) will flush dirty pages before punching a hole, so that most likely will not work. madvise(MADV_FREE) might work if it really drops dirty pages without writing; then the data could be refreshed from the physical media and compared to some known pattern.

Copy link
Contributor

@acatangiu acatangiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pclesr thanks for the PR, the fix looks good! I would also like we add a regression test - obviously our current tests missed this.

Let us know if you can help here. Maybe the fio scenario you found this in could be turned into a test? Or any other test approach that fails before the fix and works after is good.

@acatangiu
Copy link
Contributor

Hmm, it seems this increases boot-time, are there many VIRTIO_BLK_T_FLUSH requests during boot?

@pclesr
Copy link
Author

pclesr commented Oct 15, 2020

Hmm, it seems this increases boot-time, are there many VIRTIO_BLK_T_FLUSH requests during boot?

There are a few -- here are the metrics from booting a minimal alpine-based VM

  "block": {
    "activate_fails": 0,
    "cfg_fails": 0,
    "no_avail_buffer": 15,
    "event_fails": 0,
    "execute_fails": 0,
    "invalid_reqs_count": 0,
    "flush_count": 6,
    "queue_event_count": 69,
    "rate_limiter_event_count": 0,
    "update_count": 0,
    "update_fails": 0,
    "read_bytes": 1463296,
    "write_bytes": 14336,
    "read_count": 278,
    "write_count": 14,
    "rate_limiter_throttled_events": 0
  },

@pclesr
Copy link
Author

pclesr commented Oct 15, 2020

Since boot time is critical, is a mechanism needed so that the syncs are only done in a certain mode? In other words, the cache policy would be write-back by default and then if so desired it could be write-through.

Also in my example metrics, obviously my root disk was not readonly in my config.json.

@sandreim
Copy link
Contributor

sandreim commented Oct 15, 2020

@pclesr thanks, this looks like a good fix.

I completely agree that the current implementation breaks the virtio spec since it does not fsync() on VIRTIO_BLK_T_FLUSH but still offers VIRTIO_BLK_F_FLUSH.

I think there are two options here:

  • the solution from this PR
  • not offering VIRTIO_BLK_F_FLUSH and making a statement that the device is not resilient to data loss in case of power failures.

The first looks like a clean cut to the heart of the technical issue, but based on the boot time degradation from just 6 fsync() calls I am concerned. I think the right approach is to run disk i/o performance test workloads against current implementation and this PR. Firecracker has a very simple device model with just a single emulation thread for all virtio devices. Since fsync() blocks until the data is flushed to the hardware, each call would stall emulation for all virtio devices, leading to high latencies and jitter in I/O bound workloads.

The second would be an even simpler fix, but we still need to make sure that this is the right trade-off for Firecracker's mission.

To summarize, I think we need more data on performance and data loss impact before we can commit to fix.

@pclesr
Copy link
Author

pclesr commented Oct 15, 2020

@sandreim thank you for your comments. I understand your viewpoint. Given the workloads that firecracker is designed for, I/O is probably not a target priority.

Thank you for the clarification about the thread model -- I did not know that there was only a single thread that is shared by all virtio devices. That shows that target VMs are not expected to do much, if any, IO.

I will still work on developing a test that shows how data can be lost in the event of a power fail, since I think it would be helpful (and an interesting challenge).

I will post any additional findings in the issue itself, not in this PR request.

Respond to comments in PR
firecracker-microvm#2185

Signed-off-by:  Peter Lawthers <[email protected]>
@pclesr
Copy link
Author

pclesr commented Oct 23, 2020

I have another version that I will post next week that adds an optional field to the drive configuration named "want_flush".

Only drives that have this field configured will advertise VIRTIO_BLK_F_FLUSH, and therefore receive flush commands. I have both the rust tests that verify the configuration sets the appropriate flag in the features vector, as well as python tests that verify no flushes are received when the flag is not present. I am using the flush_count metric as the indicator of receiving flushes, as it seems to be accurate.

This approach should provide backward compatibility with the existing model and not affect performance. Only those environments that really want to ensure coherency between guest and host will need to set the flag.

@pclesr pclesr closed this Oct 27, 2020
@pclesr
Copy link
Author

pclesr commented Oct 27, 2020

A new pull request at #2223 has been created that keeps the existing behaviour (no sync to disk) that should address the boot time performance issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants