Skip to content

GroupBy "time gap" Issue #844

Closed
Closed
@benjchristensen

Description

@benjchristensen

The groupBy operator has a "time gap" issue when used with subscribeOn and observeOn. This exists in Rx.Net as well and was written about at http://blogs.msdn.com/b/rxteam/archive/2012/06/14/testing-rx-queries-using-virtual-time-scheduling.aspx

However, if you introduce asynchrony in the pipeline – e.g. by adding an ObserveOn operator to the mix – you’re effectively introducing a time gap during which we’ve handed out the sequence to you, control has been released on the OnNext channel, but subscription happens at a later point in time, causing you to miss elements. We can’t do any caching of elements because we don’t know when – if ever – someone will subscribe to the inner sequence, so the cache could grow in an unbounded fashion.

In discussion with @headinthebox I have decided to alter the behavior to remove this "time gap" issue so that non-deterministic data loss does not happen for the common use cases of using observeOn and subscribeOn with GroupedObservables from groupBy.

Why? It is common to want to use observeOn or subscribeOn with GroupedObservable do process different groups in parallel.

It comes with a trade-off though: all GroupedObservable instances emitted by groupBy must be subscribed to otherwise it will block. The reason for this is that to solve the "time gap" one of two things must be done:

a) use unbounded buffering (such as ReplaySubject)
b) block the onNext calls until GroupedObservable is subscribed to and receiving the data

We can not choose (a) for the reasons given in the Rx.Net blog post because it breaks backpressure and could buffer bloat until the system fails.

In general it is an appropriate thing to expect people to subscribe to all groups, except in one case where it will be expected to work – using filter.

In this case we can solve the common case by special-casing filter to be aware of GroupedObservable. It's not decoupled or elegant, but it solves the common problem.

Thus, the trade-offs are:

  1. Allow for non-deterministic data loss if observeOn/subscribeOn are used and expect people to learn about this by reading docs.

  2. Behave deterministically when observeOn/subscribeOn are used but block if groups are manually skipped.

Option 2 seems to be easier for developers to run into during dev and solve than option 1 which could often show up randomly – in prod – and be difficult to figure out and solve.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions