Investigate early slow start exits, random loss and loss while underutilized

## The bug

While looking at HyStart++ performance I saw that **most of our connections exit slow start via loss with a very low cwnd** (75th percentile at 25k, ca. `2x initial cwnd`), i.e. one where no heuristic could've ever kicked in because the exit is so early.

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_slow_start_exit_cwnd/explore (this only records connections that grow past the initial cwnd, i.e. are NOT app-limited at some point)

Looking further we also have **ca. 18% of our connections never grow their cwnd BUT exit slow start.** This doesn't necessarily have an impact (because the connection didn't sent much anyways), but points at something being wrong. We see loss despite never leaving app-limited state, meaning this can't really be loss due to congestion.

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_congestion_window_growth/explore?activeBuckets=%5B%22no_growth%22%2C%22no_growth_but_exit%22%2C%22had_growth%22%2C%22no_growth_then_exit_then_growth%22%5D (`no_growth_but_exit` label)

## Fix and remaining issues

**I already identified one bug in #3520**. We were triggering congestion events on non-ack-eliciting "lost" packets. **This should hopefully make a big difference, but I wanted to open this issue to track progress and impact beyond that PR.**

While test-driving said patch (I'm writing this issue from a Firefox build including it) I couldn't observe those very low cwnd exits anymore. What I am still seeing is some `no_growth_but_exit`, though at a much lower ratio than in live telemetry (2-3% on my current instance).

I was able to get a log of those and digged into it and as far as I can see those are all real losses on the network, most likely just random wifi loss.

## Strategy suggestion

**Right now I suggest shipping with #3520 and keeping an eye on telemetry.** I expect both metrics above to improve. Maybe we can even see an impact in [loss_ratio](https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_loss_ratio/explore?timeHorizon=QUARTER&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D), but because the fixed losses are often just a single packet I suspect that won't have much of an impact compared to the bulk losses we see on real congestion. I don't expect an impact in higher level aggregated telemetry because just such a little subset of our connections ever experiences growth at all.

**I'm thinking about adding another probe `congestion_event_while_underutilized`** capturing ce's while `bif < cwnd/10` or something like that. Supposedly a loss with so little in-flight will not be because of actual congestion, so if we see a lot of those it might be worthwhile to think about doing something about that. E.g. quiche has a heuristic where they roll back losses (similar to spurious recovery) when it was only a few packets being lost, under the assumption that random loss is always just a few packets while real congestion loss is many packets in bulk. **Thoughts?**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate early slow start exits, random loss and loss while underutilized #3526

The bug

Fix and remaining issues

Strategy suggestion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Investigate early slow start exits, random loss and loss while underutilized #3526

Description

The bug

Fix and remaining issues

Strategy suggestion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions