Skip to content

Improve perfomance of extremely large datasets #5641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pepijndevos opened this issue May 7, 2021 · 6 comments
Closed

Improve perfomance of extremely large datasets #5641

pepijndevos opened this issue May 7, 2021 · 6 comments

Comments

@pepijndevos
Copy link

pepijndevos commented May 7, 2021

When I search around for plotting libraries for a large number of data points, people are talking 50k, I'm talking 10M. I made a simple React demo that generates 10 random sinusoids of 1M points each. 10 * 100k works fine, but 10 * 1M becomes unusable.

  var numpoints = 1e6;
  var time: number[] = [];
  for (let i=0; i<numpoints; i++) {
    time.push(i);
  }
  var traces: object[] = [];
  for (let i=0; i<10; i++) {
    let points: number[] = [];
    let freq = Math.random()/1000;
    for (let j=0; j<numpoints; j++) {
      points.push(Math.sin(j*freq));
    }
    traces.push({
      x: time,
      y: points,
      type: 'scatter',
      mode: 'lines',
      yaxis: `y${i+1}`,
      xaxis: `x${i+1}`,
    })
  }
// ...
<Plot
        data={traces}
        layout={ {
          width: 2000,
          height: 1000,
          grid: {rows:5, columns: 2, pattern: "independent"},
          title: 'A Fancy Plot'} }
      />

I did a bit of profiling, and there are two issues. The first on is relatively simple, and the comment already describes what needs to be done to make hovering over the plot not lag because it's looping over every single data point.

// apply the distance function to each data point
// this is the longest loop... if this bogs down, we may need
// to create pre-sorted data (by x or y), not sure how to
// do this for 'closest'

The second issue seems to be just the drawing of the plot itself after a drag or zoom action. It spends all its time in plot, plotOne and linePoints.
What's interesting is that even if you zoom in, where it would only have to draw a small subset of the line, it's still just as slow.

// loop over ALL points in this trace

So it seems like both problems could be solved with some sort of index to avoid looping over all the datapoints. Some suggestions to jumpstart the discussion:

  • For the common case of a monotonous x axis, implement a simple binary/interpolation search (monotonicity could be detected or specified)
  • Store points in a quadtree, to allow fast spatial indexing for any type of data. (such as https://github.com/plotly/point-cluster )
  • Automatic downsampling. If I have 10M points, when zoomed out all the detail is lost anyway, but I still want to be able to zoom in and inspect it.
  • Offload operations to a webworker. At some point you're going to need to do a thing 10M times, but don't freeze the UI to do it.

If I end up using Plotly in production I'd be happy to try and contribute towards this, but for now just some suggestions to see how the maintainers feel about the issue and what their preferred approach would be.

@nicolaskruchten
Copy link
Contributor

Thanks for this issue! We'd love any help to improve performance (in a backwards-compatible way) anywhere in the library.

One quick note re the code above is that you're using the scatter trace, which we don't recommend past a few thousand points, partly because browsers don't handle hundreds of thousands/millions of SVG elements very well. At this scale we recommend the scattergl trace, which is WebGL based and scales much better.

@pepijndevos
Copy link
Author

Thanks for the recommendation! When using scattergl on my example panning and zooming is in fact a lot smoother. The hover behavior is still a problem though, but with hovermode: false it becomes entirely usable.

It'd still be nice to have faster hover though. Do you have any preferred solution?

@szkjn
Copy link

szkjn commented Jan 21, 2022

Following up on this.

Is there a straighforward way to perform dynamic downsampling depending on zoom range ? So far, selectedData (selection tool) provides both points and range but relayoutData (zoom/pan tool) only returns the latter.

This has been mentioned in thread #145 but not yet solved as far as I know.

Would appreciate any lead on this !

🙏

@pepijndevos
Copy link
Author

For extremely large datasets it would be ideal if data could be provided on demand, so you don't end up downloading GBs of data to the client. There is no use of having more data points than there are pixels on the screen.

@jonasvdd
Copy link

jonasvdd commented Dec 4, 2022

Hi, We created the functionality that @szkjn states, available for plotly.py, through the plotly-resampler toolkit!

@gvwilson
Copy link
Contributor

gvwilson commented Jul 4, 2024

Hi - we are trying to tidy up the stale issues and PRs in Plotly's public repositories so that we can focus on things that are still important to our community. Since this one has been sitting for several years, I'm going to close it; if it is still a concern, please add a comment letting us know what recent version of our software you've checked it with so that I can reopen it and add it to our backlog. Thanks for your help - @gvwilson

@gvwilson gvwilson closed this as completed Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants