Skip to content

Improve plan shutdown speed on cancel (improve performance on the cancellation benchmark) #15314

Closed
@carols10cents

Description

@carols10cents

Describe the bug

Sometimes datafusion queries can't be cancelled in a reasonable amount of time because datafusion is doing CPU-bound work for too long without yielding.

Ideally, as documented on ExecutionPlan::execute, datafusion should yield every so often in a manner appropriate for the situation to enable queries to be cancelled.

The current behavior is captured in the cancellation benchmark, and the benchmark shows there are improvements to be made.

To Reproduce

Run the benchmark (and create the needed data) with the default wait time of 100ms (the time after starting a query that the benchmark waits before attempting to cancel) that measures the time between requesting cancellation and the tokio runtime is actually dropped:

./benchmarks/bench.sh run cancellation
[...]
     Running `target/release/dfbench cancellation --iterations 5 --path benchmarks/data/cancellation -o benchmarks/results/main/cancellation.json`
Using 7 files found on disk
Starting to load data into in-memory object store
Done loading data into in-memory object store
in main, sleeping
Starting spawned
Creating logical plan...
Creating physical plan...
Executing physical plan...
Getting results...
cancelling thread
done dropping runtime in 31.251709ms
Iteration 0 cancelled in 31.251709 ms
[...]

These results show datafusion is not yielding regularly; ideally the 31ms would be lower.

Vary the wait time to try cancelling at different points in the query's execution to see that there are likely multiple places that need to be yielding more often:

Waiting 200ms:

./target/release/dfbench cancellation --iterations 5 --path benchmarks/data/cancellation -o benchmarks/results/main/cancellation.json --wait-time 200
Using 7 files found on disk
Starting to load data into in-memory object store
Done loading data into in-memory object store
Starting spawned
in main, sleeping
Creating logical plan...
Creating physical plan...
Executing physical plan...
Getting results...
cancelling thread
done dropping runtime in 68.639709ms
Iteration 0 cancelled in 68.639709 ms
[...]

Waiting 300 ms:

./target/release/dfbench cancellation --iterations 5 --path benchmarks/data/cancellation -o benchmarks/results/main/cancellation.json --wait-time 300

Using 7 files found on disk
Starting to load data into in-memory object store
Done loading data into in-memory object store
in main, sleeping
Starting spawned
Creating logical plan...
Creating physical plan...
Executing physical plan...
Getting results...
cancelling thread
done dropping runtime in 100.60675ms
Iteration 0 cancelled in 100.60674999999999 ms
[...]

Expected behavior

Ideally, any query cancelled at any point in the query should stop within ~1ms.

Additional context

Related issues:

This PR may be a fix, but there's some disagreement about whether this is the right way to fix this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions