[SCTP] Associtation lock contention #363

KillingSpark · 2022-12-14T14:37:37Z

As discussed in #360 the lock on the internal association is very contended in high send-bandwidth situations. This PR achieves two things:

Pull the marshalling of packets outside of the critical section thus reducing the time the lock is taken by the write loop
Schedule the marshalling and sending as a new tokio::task::Task which makes tokio schedule this on another thread allowing the read loop to make progress in parallel while the write loop is working on that

This in itself does not really increase the bandwidth but with improvements to the marshalling code itself gains can be head now, that were previously (at least partially) blocked by the lock contention.

It should also improve situations where both sides send a lot of data because then the write and read loops would both be very busy and fighting for the lock even more.

codecov · 2022-12-14T14:54:12Z

Codecov Report

Base: 60.18% // Head: 59.99% // Decreases project coverage by -0.18% ⚠️

Coverage data is based on head (afbd0bf) compared to base (20b59b7).
Patch coverage: 0.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #363      +/-   ##
==========================================
- Coverage   60.18%   59.99%   -0.19%     
==========================================
  Files         503      504       +1     
  Lines       47983    48073      +90     
  Branches    12509    12527      +18     
==========================================
- Hits        28879    28842      -37     
- Misses       9913    10009      +96     
- Partials     9191     9222      +31

Impacted Files	Coverage Δ
sctp/examples/throughput.rs	`0.00% <0.00%> (ø)`
media/src/track/constraint/numeric.rs	`80.45% <0.00%> (-6.90%)`	⬇️
data/src/message/message_channel_open.rs	`79.51% <0.00%> (-6.03%)`	⬇️
rtp/src/header.rs	`64.43% <0.00%> (-4.64%)`	⬇️
webrtc/src/mux/mux_func.rs	`75.00% <0.00%> (-3.58%)`	⬇️
sctp/src/timer/rtx_timer.rs	`80.00% <0.00%> (-3.52%)`	⬇️
rtcp/src/header.rs	`63.93% <0.00%> (-3.28%)`	⬇️
sdp/src/lexer/mod.rs	`66.66% <0.00%> (-3.04%)`	⬇️
ice/src/agent/agent_gather_test.rs	`62.88% <0.00%> (-1.54%)`	⬇️
turn/src/client/transaction.rs	`63.39% <0.00%> (-1.31%)`	⬇️
... and 10 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

KillingSpark · 2022-12-14T19:30:27Z

I really don't get how tokio schedules things.... I wanted to make the sending a third long running loop but that just results in the three loops occupying the same thread again... I would really like a cleaner solution than spawning a new task this frequently. Maybe someone with more tokio knowledge can help here.

But for now this seems to be an improvement performance wise :)

KillingSpark · 2022-12-16T19:57:05Z

Converted to draft. I just tried this out at home and here the scheduling of tokio is even worse with this PR.

I think I'll need to figure out a way to get these to cooperate better, this is apparently not a real solution :(

…single write_loop instance

KillingSpark · 2022-12-19T14:39:15Z

I don't really get why this test fails now. I can get it to succeed by limiting the semaphore to 1 permit, which is essentially the same as just awaiting the spawned task.

Maybe the test setup doesn't drive these tasks as fast as one would wish and that's why the association ends up in an unexpected state? The failure isn't always on the same check so something non-deterministic is happening...

Printline debugging to the rescue: it is apparently about the fact that this PR introduces a random reordering of the packets because (of course) the spawned tasks run concurrently/in parallel. This messes up this test.

This poses the question... Should it mess up the test? I'll need to look into how reordering is supposed to affect the SCTP state machine and protocol parameters. If this is fine, we could just limit the semaphore to 1 permit in the test scenario and let it have up to X permits in real scenarios to maximize performance?

…or normal builds

sctp/src/association/association_internal.rs

melekes · 2022-12-21T10:01:41Z

sctp/src/association/mod.rs

            //log::debug!("[{}] gather_outbound begin", name);
-            let (raw_packets, mut ok) = {
+            let (packets, continue_loop) = {
                let mut ai = association_internal.lock().await;


I feel like there has to be a way for us to remove this lock completely... we just need a bounded channel or lock-free queue to add and remove items. No need to lock entire association

I agree, and it doesn't even need to be bounded, the amount of packets that can be in this channel would be naturally limited by the amount of bytes allowed to be in flight by the normal SCTP mechanisms.

I just didn't want to change to architecture too much.

melekes

👍

KillingSpark added 3 commits December 14, 2022 11:08

improve lock contention around sctp Association::association_internal

a828afd

add explanatory comment in the sctp write_loop

cec16fc

handle errors instead of unwrapping

32036a4

changelog

c4e34c0

use right changelog

3463834

KillingSpark marked this pull request as draft December 16, 2022 19:55

fix weird jitter on linux by not awaiting the spawned tasks

e793fbf

KillingSpark marked this pull request as ready for review December 18, 2022 11:51

out of caution limit the amount of sending tasks that we spawn for a …

1f91bb9

…single write_loop instance

eliminate non-deterministic packet reordering in tests but leave it f…

7d1b270

…or normal builds

melekes reviewed Dec 21, 2022

View reviewed changes

fix accidental change in webrtc/changelog

afbd0bf

melekes approved these changes Dec 21, 2022

View reviewed changes

melekes merged commit 0f98498 into webrtc-rs:master Jan 2, 2023

Marcel-G mentioned this pull request Aug 15, 2023

fix(data-channel): stuck closing in Chrome #480

Merged

onnoowl mentioned this pull request Nov 14, 2023

Fixed throughput issues by fixing lack of guarentee on SCTP -> DTLS packet ordering #513

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[SCTP] Associtation lock contention #363

[SCTP] Associtation lock contention #363

Uh oh!

KillingSpark commented Dec 14, 2022

Uh oh!

codecov bot commented Dec 14, 2022 •

edited

Loading

Uh oh!

KillingSpark commented Dec 14, 2022

Uh oh!

KillingSpark commented Dec 16, 2022

Uh oh!

KillingSpark commented Dec 19, 2022 •

edited

Loading

Uh oh!

Uh oh!

melekes Dec 21, 2022

Uh oh!

KillingSpark Dec 21, 2022 •

edited

Loading

Uh oh!

melekes left a comment

Uh oh!

Uh oh!

Uh oh!

[SCTP] Associtation lock contention #363

[SCTP] Associtation lock contention #363

Uh oh!

Conversation

KillingSpark commented Dec 14, 2022

Uh oh!

codecov bot commented Dec 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

KillingSpark commented Dec 14, 2022

Uh oh!

KillingSpark commented Dec 16, 2022

Uh oh!

KillingSpark commented Dec 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

melekes Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

KillingSpark Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

melekes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Dec 14, 2022 •

edited

Loading

KillingSpark commented Dec 19, 2022 •

edited

Loading

KillingSpark Dec 21, 2022 •

edited

Loading