Skip to content

[SCTP] Associtation lock contention #363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 2, 2023

Conversation

KillingSpark
Copy link
Contributor

As discussed in #360 the lock on the internal association is very contended in high send-bandwidth situations. This PR achieves two things:

  1. Pull the marshalling of packets outside of the critical section thus reducing the time the lock is taken by the write loop
  2. Schedule the marshalling and sending as a new tokio::task::Task which makes tokio schedule this on another thread allowing the read loop to make progress in parallel while the write loop is working on that

This in itself does not really increase the bandwidth but with improvements to the marshalling code itself gains can be head now, that were previously (at least partially) blocked by the lock contention.

It should also improve situations where both sides send a lot of data because then the write and read loops would both be very busy and fighting for the lock even more.

@codecov
Copy link

codecov bot commented Dec 14, 2022

Codecov Report

Base: 60.18% // Head: 59.99% // Decreases project coverage by -0.18% ⚠️

Coverage data is based on head (afbd0bf) compared to base (20b59b7).
Patch coverage: 0.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #363      +/-   ##
==========================================
- Coverage   60.18%   59.99%   -0.19%     
==========================================
  Files         503      504       +1     
  Lines       47983    48073      +90     
  Branches    12509    12527      +18     
==========================================
- Hits        28879    28842      -37     
- Misses       9913    10009      +96     
- Partials     9191     9222      +31     
Impacted Files Coverage Δ
sctp/examples/throughput.rs 0.00% <0.00%> (ø)
media/src/track/constraint/numeric.rs 80.45% <0.00%> (-6.90%) ⬇️
data/src/message/message_channel_open.rs 79.51% <0.00%> (-6.03%) ⬇️
rtp/src/header.rs 64.43% <0.00%> (-4.64%) ⬇️
webrtc/src/mux/mux_func.rs 75.00% <0.00%> (-3.58%) ⬇️
sctp/src/timer/rtx_timer.rs 80.00% <0.00%> (-3.52%) ⬇️
rtcp/src/header.rs 63.93% <0.00%> (-3.28%) ⬇️
sdp/src/lexer/mod.rs 66.66% <0.00%> (-3.04%) ⬇️
ice/src/agent/agent_gather_test.rs 62.88% <0.00%> (-1.54%) ⬇️
turn/src/client/transaction.rs 63.39% <0.00%> (-1.31%) ⬇️
... and 10 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@KillingSpark
Copy link
Contributor Author

I really don't get how tokio schedules things.... I wanted to make the sending a third long running loop but that just results in the three loops occupying the same thread again... I would really like a cleaner solution than spawning a new task this frequently. Maybe someone with more tokio knowledge can help here.

But for now this seems to be an improvement performance wise :)

@KillingSpark KillingSpark marked this pull request as draft December 16, 2022 19:55
@KillingSpark
Copy link
Contributor Author

Converted to draft. I just tried this out at home and here the scheduling of tokio is even worse with this PR.

I think I'll need to figure out a way to get these to cooperate better, this is apparently not a real solution :(

@KillingSpark KillingSpark marked this pull request as ready for review December 18, 2022 11:51
@KillingSpark
Copy link
Contributor Author

KillingSpark commented Dec 19, 2022

I don't really get why this test fails now. I can get it to succeed by limiting the semaphore to 1 permit, which is essentially the same as just awaiting the spawned task.

Maybe the test setup doesn't drive these tasks as fast as one would wish and that's why the association ends up in an unexpected state? The failure isn't always on the same check so something non-deterministic is happening...

Printline debugging to the rescue: it is apparently about the fact that this PR introduces a random reordering of the packets because (of course) the spawned tasks run concurrently/in parallel. This messes up this test.

This poses the question... Should it mess up the test? I'll need to look into how reordering is supposed to affect the SCTP state machine and protocol parameters. If this is fine, we could just limit the semaphore to 1 permit in the test scenario and let it have up to X permits in real scenarios to maximize performance?

//log::debug!("[{}] gather_outbound begin", name);
let (raw_packets, mut ok) = {
let (packets, continue_loop) = {
let mut ai = association_internal.lock().await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there has to be a way for us to remove this lock completely... we just need a bounded channel or lock-free queue to add and remove items. No need to lock entire association

Copy link
Contributor Author

@KillingSpark KillingSpark Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and it doesn't even need to be bounded, the amount of packets that can be in this channel would be naturally limited by the amount of bytes allowed to be in flight by the normal SCTP mechanisms.

I just didn't want to change to architecture too much.

Copy link
Contributor

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants