Fix/vsock reset race#5882
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5882 +/- ##
==========================================
+ Coverage 82.88% 82.99% +0.11%
==========================================
Files 277 277
Lines 30086 30106 +20
==========================================
+ Hits 24937 24987 +50
+ Misses 5149 5119 -30
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
a16431d to
d7a3b0b
Compare
Manciukic
left a comment
There was a problem hiding this comment.
mostly LGTM, just a few minor issues.
d2bec53 to
afcfd09
Compare
e6a4d24 to
54173cb
Compare
Manciukic
left a comment
There was a problem hiding this comment.
LGTM, just one clarification needed on the enabling of notifications on the evq.
25f0b12 to
0cba3a6
Compare
JamesC1305
left a comment
There was a problem hiding this comment.
Few minor comments. The potential for unsoundness one is the one I'd like to clarify. The other two are just a suggestion and a nit (neither of which are blocking)
Add `__enter__`/`__exit__` to `HostEchoWorker` so callers can register workers with `contextlib.ExitStack` and have their UDS sockets closed deterministically on any exit path - including assertion failures and exceptions raised mid-loop. The exit hook joins the worker thread if still alive before closing the socket. Convert the existing call sites in `check_host_connections` and `test_vsock.py` to the `ExitStack` pattern. No behavior change on the happy path; the only observable difference is that a failure no longer leaks UDS sockets (and the `socat` echo server file descriptors they hold open inside the guest). Signed-off-by: Jack Thomson <jackabt@amazon.com>
Drop the leading underscore from `_vsock_connect_to_guest` so other test modules can call it directly to open a vsock connection to the guest echo server. Also harden the helper to close the socket if either `connect()` or the CONNECT/ACK handshake raises - so callers never receive a half-open descriptor that they then have to remember to close on error paths. The returned socket is still a plain `socket.socket`, so it works as a context manager out of the box and can be registered with `contextlib.ExitStack` for deterministic cleanup. Signed-off-by: Jack Thomson <jackabt@amazon.com>
`test_vsock_transport_reset_g2h` spawns a host-side `socat` listener with `subprocess.Popen` and kills it at the end of each loop iteration. If any of the assertions or API calls between `Popen` and the cleanup raise, the loop variable rebinds on the next iteration and the previous `socat` is orphaned - it keeps the listener bound to the chroot socket path and is reported by the post-test process leak check, masking the real assertion failure. Wrap the iteration body in `try/finally` so the `host_socat.kill()` and `new_vm.kill()` calls always run, even on failure. No happy-path behavior change. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add a method to unconditionally arm `avail_event` at the driver's
current `avail.idx`, so the next driver publish to the avail ring is
guaranteed to produce a notification when EVENT_IDX is negotiated.
This complements the existing `try_enable_notification`, which is
designed for the drain-then-arm pattern used by descriptor-consuming
queues: it only arms when the queue is empty, so it can pair with a
subsequent pop of any descriptors the driver added concurrently.
Some queues are not normally drained by the device — the vsock event
virtqueue is the motivating case: the device only writes to it, and
the driver replenishes descriptors as the device consumes them. There
the next driver publish, not the next-after-drain one, is what the
device must learn about. `try_enable_notification` does not fit:
* it returns early when `len() != 0`, leaving `avail_event`
unchanged, and
* it sets `avail_event = next_avail`, which on a partially-popped
queue is behind `avail.idx` and would suppress the upcoming
publish anyway.
The new method skips the drain check and writes the current
`avail.idx` directly. A Release fence orders the avail_event store
before any subsequent used-ring update or interrupt the caller
delivers; the caller is responsible for ensuring the driver does not
add to the avail ring until it has observed that update, since this
path does not recheck `avail.idx`.
Signed-off-by: Jack Thomson <jackabt@amazon.com>
The event virtqueue is only ever written by the device; avail descriptors are never popped except for an event publish, so `avail_event` is never advanced through the normal `pop_or_enable_notification` path. With VIRTIO_RING_F_EVENT_IDX, that leaves `avail_event = 0` and any guest kick after `avail.idx` first exceeds 1 evaluates `vring_need_event(0, new, old)` to false and is suppressed. The pre-existing logic does not depend on receiving the post-event refill kick, so the suppression has no observable effect today. The next commit gates RX delivery on that kick, at which point the suppression becomes a deadlock. Fix it at the source: after publishing the TRANSPORT_RESET event into the used ring, call the new `Queue::enable_notification` to set `avail_event` to the current `avail.idx` so the driver's refill of the consumed head reliably notifies the host. Per virtio 1.2 §5.10.6.3, the driver SHOULD replenish event virtqueue buffers promptly, so we expect this notification to arrive shortly. Signed-off-by: Jack Thomson <jackabt@amazon.com>
After a snapshot restore, `kick()` signals the event virtqueue so the
guest driver processes the queued VIRTIO_VSOCK_EVENT_TRANSPORT_RESET
event, which iterates the connected-socket list and resets every
transport-bound socket. With virtio-pci + MSI-X each virtqueue has its
own interrupt vector, so the RX vq IRQ and EVQ IRQ can be delivered to
different vCPUs and the guest's `virtio_transport_rx_work` and
`virtio_transport_event_work` end up running concurrently on different
per-CPU workers.
That permits the following interleaving:
rx_work: recv_listen
-> vsock_insert_connected(child) // takes lock
event_work: vsock_for_each_connected_socket(reset) // takes lock,
// sees the
// child
The child is set to TCP_CLOSE/ECONNRESET *after* the virtio handshake
has already returned VSOCK_OP_RESPONSE to the host. The host then sends
data on what it believes is an established connection, the guest looks
up the (now closed) child and replies with VSOCK_OP_RST -- the symptom
seen in production. With virtio-mmio there is a single shared interrupt
so both work_structs queue onto the same per-CPU worker and run
strictly in FIFO order, hiding the race.
Close the window from the device side: while a TRANSPORT_RESET is
in-flight, do not deliver any RX packets. The guest driver always
re-arms the event vq (and so issues a virtqueue_kick) at the end of
`virtio_transport_event_work`, after the connected-socket sweep is
complete. We use that kick as the "event handled" signal: clear the
gate in `handle_evq_event` and drain any RX that accumulated in the
muxer while the gate was up.
Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add `test_vsock_post_restore_connect_storm`, a regression test for the race fixed in the previous commit. Without the fix, on virtio-pci + MSI-X the guest's `virtio_transport_rx_work` and `virtio_transport_event_work` can run on different vCPUs; a fresh CONNECT inserted into the connected-socket table just before the TRANSPORT_RESET sweep iterates it gets reset to TCP_CLOSE/ECONNRESET *after* the device has already returned OP_RESPONSE to the host. The host then either sees a truncated read or a ConnectionResetError on a connection it just opened. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add unit tests for the vsock device testing the new behaviour with the locking on reset. Signed-off-by: Jack Thomson <jackabt@amazon.com>
0cba3a6 to
f99114c
Compare
Add changelog item for the vsock reset race fix. Signed-off-by: Jack Thomson <jackabt@amazon.com>
micz010
left a comment
There was a problem hiding this comment.
Approved, reviewed only the changelog change
There's a certain race which can happen on vsock on resume, with a storm of connections going through which can arrive before the device reset has completed leading to dropped connections.
Update FC to now block these pending connections until the guest has acked that the reset has been completed.
We were allocating an empty vec every call anyway, so no new allocations added other than when we fill it
Reason
...
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.PR Checklist
tools/devtool checkbuild --allto verify that the PR passesbuild checks on all supported architectures.
tools/devtool checkstyleto verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md.Runbook for Firecracker API changes.
integration tests.
TODO.rust-vmm.