The bug
While looking at HyStart++ performance I saw that most of our connections exit slow start via loss with a very low cwnd (75th percentile at 25k, ca. 2x initial cwnd), i.e. one where no heuristic could've ever kicked in because the exit is so early.
https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_slow_start_exit_cwnd/explore (this only records connections that grow past the initial cwnd, i.e. are NOT app-limited at some point)
Looking further we also have ca. 18% of our connections never grow their cwnd BUT exit slow start. This doesn't necessarily have an impact (because the connection didn't sent much anyways), but points at something being wrong. We see loss despite never leaving app-limited state, meaning this can't really be loss due to congestion.
https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_congestion_window_growth/explore?activeBuckets=%5B%22no_growth%22%2C%22no_growth_but_exit%22%2C%22had_growth%22%2C%22no_growth_then_exit_then_growth%22%5D (no_growth_but_exit label)
Fix and remaining issues
I already identified one bug in #3520. We were triggering congestion events on non-ack-eliciting "lost" packets. This should hopefully make a big difference, but I wanted to open this issue to track progress and impact beyond that PR.
While test-driving said patch (I'm writing this issue from a Firefox build including it) I couldn't observe those very low cwnd exits anymore. What I am still seeing is some no_growth_but_exit, though at a much lower ratio than in live telemetry (2-3% on my current instance).
I was able to get a log of those and digged into it and as far as I can see those are all real losses on the network, most likely just random wifi loss.
Strategy suggestion
Right now I suggest shipping with #3520 and keeping an eye on telemetry. I expect both metrics above to improve. Maybe we can even see an impact in loss_ratio, but because the fixed losses are often just a single packet I suspect that won't have much of an impact compared to the bulk losses we see on real congestion. I don't expect an impact in higher level aggregated telemetry because just such a little subset of our connections ever experiences growth at all.
I'm thinking about adding another probe congestion_event_while_underutilized capturing ce's while bif < cwnd/10 or something like that. Supposedly a loss with so little in-flight will not be because of actual congestion, so if we see a lot of those it might be worthwhile to think about doing something about that. E.g. quiche has a heuristic where they roll back losses (similar to spurious recovery) when it was only a few packets being lost, under the assumption that random loss is always just a few packets while real congestion loss is many packets in bulk. Thoughts?
The bug
While looking at HyStart++ performance I saw that most of our connections exit slow start via loss with a very low cwnd (75th percentile at 25k, ca.
2x initial cwnd), i.e. one where no heuristic could've ever kicked in because the exit is so early.https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_slow_start_exit_cwnd/explore (this only records connections that grow past the initial cwnd, i.e. are NOT app-limited at some point)
Looking further we also have ca. 18% of our connections never grow their cwnd BUT exit slow start. This doesn't necessarily have an impact (because the connection didn't sent much anyways), but points at something being wrong. We see loss despite never leaving app-limited state, meaning this can't really be loss due to congestion.
https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_congestion_window_growth/explore?activeBuckets=%5B%22no_growth%22%2C%22no_growth_but_exit%22%2C%22had_growth%22%2C%22no_growth_then_exit_then_growth%22%5D (
no_growth_but_exitlabel)Fix and remaining issues
I already identified one bug in #3520. We were triggering congestion events on non-ack-eliciting "lost" packets. This should hopefully make a big difference, but I wanted to open this issue to track progress and impact beyond that PR.
While test-driving said patch (I'm writing this issue from a Firefox build including it) I couldn't observe those very low cwnd exits anymore. What I am still seeing is some
no_growth_but_exit, though at a much lower ratio than in live telemetry (2-3% on my current instance).I was able to get a log of those and digged into it and as far as I can see those are all real losses on the network, most likely just random wifi loss.
Strategy suggestion
Right now I suggest shipping with #3520 and keeping an eye on telemetry. I expect both metrics above to improve. Maybe we can even see an impact in loss_ratio, but because the fixed losses are often just a single packet I suspect that won't have much of an impact compared to the bulk losses we see on real congestion. I don't expect an impact in higher level aggregated telemetry because just such a little subset of our connections ever experiences growth at all.
I'm thinking about adding another probe
congestion_event_while_underutilizedcapturing ce's whilebif < cwnd/10or something like that. Supposedly a loss with so little in-flight will not be because of actual congestion, so if we see a lot of those it might be worthwhile to think about doing something about that. E.g. quiche has a heuristic where they roll back losses (similar to spurious recovery) when it was only a few packets being lost, under the assumption that random loss is always just a few packets while real congestion loss is many packets in bulk. Thoughts?