Skip to content

Ploting POINT sf with geom_sf() is slow #2718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yutannihilation opened this issue Jun 26, 2018 · 21 comments
Closed

Ploting POINT sf with geom_sf() is slow #2718

yutannihilation opened this issue Jun 26, 2018 · 21 comments

Comments

@yutannihilation
Copy link
Member

yutannihilation commented Jun 26, 2018

In my environment, it takes 10 seconds for plotting 1000 POINTs by geom_sf().
I feel this is a bit too long for exploring the data.

https://gist.github.com/yutannihilation/479cb5f254826915a7a3d36ff84b4b43

I suspect this is because draw_panel() applies sf_as_grob() on every row. Is it possible to do this more efficiently, e.g. st_combine() per group so that it plots one MULTIPOINT that contains 1000 points?

@clauswilke
Copy link
Member

Could you check whether there is any relation to issue #2655?

@yutannihilation
Copy link
Member Author

Thanks, I roughly think this is a different issue because I'm using Windows (sorry I forgot to write this). But, I will check further.

@yutannihilation
Copy link
Member Author

yutannihilation commented Jun 27, 2018

Here's the result of profvis of the code that plots 1000 POINTs. This shows st_grob() takes most of the time.

image

@yutannihilation
Copy link
Member Author

Here's a reprex. This shows st_combine() significantly save time.

library(ggplot2)
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, proj.4 4.9.3

set.seed(1)
n <- 1000

d <- data.frame(
  x = runif(n) * 360,
  y = runif(n) * 180
)

d_sf <- st_multipoint(as.matrix(d)) %>%
  st_sfc() %>% 
  st_cast("POINT")

system.time({
  p <- ggplot(d_sf) +
    geom_sf()
  print(p)
})

#>    user  system elapsed 
#>    7.36    0.10    7.71


system.time({
  d_sf_combined <- st_combine(d_sf)
  p <- ggplot(d_sf_combined) +
    geom_sf()
  print(p)
})

#>    user  system elapsed 
#>    0.27    0.05    0.31

Created on 2018-06-28 by the reprex package (v0.2.0).

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  Japanese_Japan.932          
#>  tz       Asia/Tokyo                  
#>  date     2018-06-28                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version     date       source                            
#>  assertthat    0.2.0       2017-04-11 CRAN (R 3.5.0)                    
#>  backports     1.1.2       2017-12-13 CRAN (R 3.5.0)                    
#>  bindr         0.1.1       2018-03-13 CRAN (R 3.5.0)                    
#>  bindrcpp      0.2.2       2018-03-29 CRAN (R 3.5.0)                    
#>  callr         2.0.4       2018-05-15 CRAN (R 3.5.0)                    
#>  class         7.3-14      2015-08-30 CRAN (R 3.5.0)                    
#>  classInt      0.2-3       2018-04-16 CRAN (R 3.5.0)                    
#>  cli           1.0.0       2017-11-05 CRAN (R 3.5.0)                    
#>  clisymbols    1.2.0       2017-05-21 CRAN (R 3.5.0)                    
#>  colorspace    1.3-2       2016-12-14 CRAN (R 3.5.0)                    
#>  crayon        1.3.4       2017-09-16 CRAN (R 3.5.0)                    
#>  curl          3.2         2018-03-28 CRAN (R 3.5.0)                    
#>  DBI           1.0.0       2018-05-02 CRAN (R 3.5.0)                    
#>  debugme       1.1.0       2017-10-22 CRAN (R 3.5.0)                    
#>  desc          1.2.0       2018-05-01 CRAN (R 3.5.0)                    
#>  devtools      1.13.5.9000 2018-05-17 Github (r-lib/devtools@13ee56b)   
#>  digest        0.6.15      2018-01-28 CRAN (R 3.5.0)                    
#>  dplyr         0.7.5       2018-05-19 CRAN (R 3.5.0)                    
#>  e1071         1.6-8       2017-02-02 CRAN (R 3.5.0)                    
#>  evaluate      0.10.1      2017-06-24 CRAN (R 3.5.0)                    
#>  ggplot2     * 2.2.1.9000  2018-06-26 Github (tidyverse/ggplot2@348b26f)
#>  glue          1.2.0       2017-10-29 CRAN (R 3.5.0)                    
#>  gtable        0.2.0       2016-02-26 CRAN (R 3.5.0)                    
#>  htmltools     0.3.6       2017-04-28 CRAN (R 3.5.0)                    
#>  httr          1.3.1       2017-08-20 CRAN (R 3.5.0)                    
#>  knitr         1.20.2      2018-05-10 local                             
#>  lazyeval      0.2.1       2017-10-29 CRAN (R 3.5.0)                    
#>  magrittr      1.5         2014-11-22 CRAN (R 3.5.0)                    
#>  memoise       1.1.0       2018-06-13 Github (hadley/memoise@06d16ec)   
#>  mime          0.5         2016-07-07 CRAN (R 3.5.0)                    
#>  munsell       0.5.0       2018-06-12 CRAN (R 3.5.0)                    
#>  pillar        1.2.3       2018-05-25 CRAN (R 3.5.0)                    
#>  pkgbuild      1.0.0       2018-05-17 Github (r-lib/pkgbuild@0457039)   
#>  pkgconfig     2.0.1       2017-03-21 CRAN (R 3.5.0)                    
#>  pkgload       1.0.0       2018-05-17 Github (r-lib/pkgload@35efedd)    
#>  plyr          1.8.4       2016-06-08 CRAN (R 3.5.0)                    
#>  processx      3.1.0       2018-05-15 CRAN (R 3.5.0)                    
#>  purrr         0.2.5       2018-05-29 CRAN (R 3.5.0)                    
#>  R6            2.2.2       2017-06-17 CRAN (R 3.5.0)                    
#>  Rcpp          0.12.17     2018-05-18 CRAN (R 3.5.0)                    
#>  rlang         0.2.1       2018-05-30 CRAN (R 3.5.0)                    
#>  rmarkdown     1.10        2018-06-11 CRAN (R 3.5.0)                    
#>  rprojroot     1.3-2       2018-01-03 CRAN (R 3.5.0)                    
#>  scales        0.5.0.9000  2018-06-26 Github (hadley/scales@9e5e4d4)    
#>  sessioninfo   1.0.0       2017-06-21 CRAN (R 3.5.0)                    
#>  sf          * 0.6-3       2018-05-17 CRAN (R 3.5.0)                    
#>  spData        0.2.9.0     2018-06-17 CRAN (R 3.5.0)                    
#>  stringi       1.2.3       2018-06-12 CRAN (R 3.5.0)                    
#>  stringr       1.3.1       2018-05-10 CRAN (R 3.5.0)                    
#>  testthat      2.0.0       2017-12-13 CRAN (R 3.5.0)                    
#>  tibble        1.4.2       2018-01-22 CRAN (R 3.5.0)                    
#>  tidyselect    0.2.4       2018-02-26 CRAN (R 3.5.0)                    
#>  units         0.6-0       2018-06-09 CRAN (R 3.5.0)                    
#>  usethis       1.3.0       2018-02-24 CRAN (R 3.5.0)                    
#>  withr         2.1.2       2018-06-26 Github (jimhester/withr@fe56f20)  
#>  xml2          1.2.0       2018-01-24 CRAN (R 3.5.0)                    
#>  yaml          2.1.19      2018-05-01 CRAN (R 3.5.0)

@yutannihilation
Copy link
Member Author

Here's a rough implementation for this. Does this make sense? If so, I will try to make the code ready to merge.

(Sorry, I just wanted to make a PR on my repo to show codes, but I mistakenly created here... 🙄 )

https://github.com/tidyverse/ggplot2/pull/2722/files

@jschoeley
Copy link
Contributor

jschoeley commented Jun 28, 2018

Furthermore ggplot(xy_points_dataframe) + geom_point() + coord_map() renders massively faster on my system compared to ggplot(point_sf) + geom_sf(). Dot-density maps of population counts are completely out of the question using geom_sf() on my system. They work wonderfully with geom_point() though...

# Draw a point at the centroid coordinates of each European region

library(tidyverse)
library(eurostat)
library(sf)

# using geom_sf
system.time({
  p1 <-
    eurostat_geodata_60 %>%
    filter(LEVL_CODE == 3) %>%
    st_centroid() %>%
    ggplot() + geom_sf() + coord_sf(crs = st_crs(3785))
  print(p1)
})

##   user  system elapsed 
## 10.634   0.179   6.710 

# using geom_point
system.time({
  p2 <-
    eurostat_geodata_60 %>%
    filter(LEVL_CODE == 3) %>%
    st_centroid() %>% st_coordinates() %>% as_data_frame() %>%
    ggplot() + geom_point(aes(x = X, y = Y)) + coord_map(projection = 'mercator')
  print(p2)
})

##   user  system elapsed 
##  0.889   0.004   0.896 

@thomasp85
Copy link
Member

Personally I’m not sure if it is the goal of ggplot2 to optimise the input data. It might lead to unexpected results, e.g. in the stacking order of the different features...

If a fail-safe approach could be found it might be worth it but otherwise just documenting that multiple POINT will render slowly...

There’s no problem combining geom_sf and geom_point btw

@yutannihilation
Copy link
Member Author

Basically I agree with @thomasp85, but I feel this is unacceptably slow and should be fixed.
I don't want to let ggplot2 be the tool that only those who know how to "optimize" the data can use.

In the case of #2655, it was not the ggplot2's fault and it seems there's nothing we can do. But, in this case, considering plot.sfc_POINT() and plot.sfc_MULTIPOINT() can plot this 10+ times faster, I think something is wrong with the current implementation.

That said, I don't think my PR is a good way to fix this. Maybe is it better to implement st_as_grob() methods for sfc classes first on sf's side...?

@clauswilke
Copy link
Member

At some point, we may have to bite the bullet and see if we can speed up grid. It seems absurd that we cannot draw thousands of points as individual grobs without experiencing extreme speed penalties. The same applies to drawing many individual text labels. Otherwise we will have to keep making design choices in ggplot2 that are driven by limitations of grid rather than by what would be a clean design.

E.g., I find this comment on geom_label() quite frustrating:
https://ggplot2.tidyverse.org/reference/geom_text.html#geom-label
We shouldn't need two different geoms to draw text labels, one that is fast and one that can draw a colored background. It's not that complicated a problem.

@thomasp85
Copy link
Member

I agree - as I’ve stated before I really hope to get some time to review the whole stack with performance in mind

My problem in combining POINT to multipoint is that it will result in all points being the same colour - also if a line or polygon were mixed in with the points, it would loose its ordering in the drawing stack

@yutannihilation
Copy link
Member Author

Hmm, thanks. I didn't notice the point that we can speed up grid. It's reasonable to wait for it and avoid adding complex tweaks. OK, please close this issue anytime.

One thing, this is NOT true, if you are talking about combining POINTs in ggplot2:

combining POINT to multipoint is that it will result in all points being the same colour

If this is about combining them outside ggplot2, you are right. That's why I felt this should happen inside draw_panel() to map aesthetics like colour.... (About the other concerns, I think you are right.)

@yutannihilation
Copy link
Member Author

Sorry, may I confirm one more thing? I wonder if this is really true.

It seems absurd that we cannot draw thousands of points as individual grobs without experiencing extreme speed penalties.

As I showed above (#2718 (comment)), most of the time is spent for sf_grob(), which is to build grobs, not to draw grobs. IIUC, improving grid can save time for draw.grid(), but this looks not the case to me.

@jschoeley
Copy link
Contributor

I‘ve experienced the performance hit in the drawing, not the building stage.

@clauswilke
Copy link
Member

clauswilke commented Jun 28, 2018

I think many things are slow. If I look at the code for utils::modifyList() I want to cry. I think ggplot2 needs a clean and fast replacement of it for the purpose for which it is used. grid::gpar() also does tons of things that slow it down.

Considering that ultimately all gpar() does is create a list this is pretty awful:

microbenchmark::microbenchmark(
  grid::gpar(color = "blue"), list(color = "blue")
)
#> Unit: nanoseconds
#>                        expr   min    lq     mean  median      uq      max
#>  grid::gpar(color = "blue") 18715 19526 220562.0 20031.0 21431.5 19759145
#>        list(color = "blue")   143   161    308.2   185.5   217.0     8740
#>  neval
#>    100
#>    100

Created on 2018-06-28 by the reprex package (v0.2.0).

@clauswilke
Copy link
Member

clauswilke commented Jun 28, 2018

Another example. I realize that my code reorders the list, but that may or may not matter. For aesthetics or other graphical parameters that are always named it doesn't matter. And I bet by implementing the entire function in C++ we could have similar or better speed without the reordering problem.

(Not sure why the maximum is so high though for update_list(). Maybe I'm missing something.)

update_list <- function(old, new) {
  names_new <- names(new)
  names_old <- names(old)
  old[intersect(names_old, names_new)] <- NULL
  c(old, new)
}

l1 <- list(a = 1, b = 2, c = 3)
l2 <- list(a = 5, c = 3, d = 4)
update_list(l1, l2)
#> $b
#> [1] 2
#> 
#> $a
#> [1] 5
#> 
#> $c
#> [1] 3
#> 
#> $d
#> [1] 4
utils::modifyList(l1, l2)
#> $a
#> [1] 5
#> 
#> $b
#> [1] 2
#> 
#> $c
#> [1] 3
#> 
#> $d
#> [1] 4

microbenchmark::microbenchmark(
  update_list(l1, l2),
  utils::modifyList(l1, l2)
)
#> Unit: microseconds
#>                       expr    min     lq      mean  median      uq
#>        update_list(l1, l2)  7.202  8.393 215.79596  9.3245 10.4400
#>  utils::modifyList(l1, l2) 33.227 35.609  38.01464 37.4455 38.8585
#>        max neval
#>  20604.255   100
#>     77.789   100

Created on 2018-06-28 by the reprex package (v0.2.0).

@dpseidel
Copy link
Collaborator

I just wanted to loop @tmastny in here. We're both interested in improving performance and have been recently profiling ggplot2 and grid code. Helpfully Tim has more background in compiled code than I do. Likely this thread is not the place to organize them but Tim and I would be happy to talk more about known slowpoints or a wishlist of things to investigate.

@clauswilke
Copy link
Member

clauswilke commented Jun 28, 2018

One more example. Just by circumventing grid error checking and creating grobs directly we can get a ~3 fold speed increase for points.

library(grid)
library(purrr)
library(dplyr)

# fast version of gpar
gpar_fast <- function(...) {
  structure(
    list(...),
    class = "gpar"
  )  
}

# fast version of pointsGrob()
pointsGrob_fast <- function(xu = unit(0.5, "native"), yu = unit(0.5, "native"),
                            pch = 1, size = unit(1, "char"),
                            name = "none", gp = gpar_fast(), vp = NULL) {
  structure(
    list(
      x = xu,
      y = yu,
      pch = as.integer(pch),
      size = size,
      name = name,
      gp = gp,
      vp = vp
    ),
    class = c("points", "grob", "gDesc")
  )
}

# fast and slow grobs are identical
identical(
  pointsGrob_fast(unit(0.5, "npc"), unit(0.5, "npc"), name = "none"),
  pointsGrob(unit(0.5, "npc"), unit(0.5, "npc"), name = "none")
)
#> [1] TRUE

n <- 50
df <- tibble(
  x = runif(n),
  y = runif(n),
  color = sample(c("red", "green", "blue"), n, replace = TRUE)
) %>% mutate(
  xu = map(x, ~unit(.x, "native")),
  yu = map(y, ~unit(.x, "native")),
)

# standard grid
f1 <- function() {
  pmap(
    df,
    function(x, y, color, ...) {
      gp <- gpar(col = color)
      pointsGrob(x, y, gp = gp)    
    }
  )
}

# fast points, slow gpar
f2 <- function() {
  pmap(
    df,
    function(xu, yu, color, ...) {
      gp <- gpar(col = color)
      pointsGrob_fast(xu, yu, gp = gp)    
    }
  )
}

# fast points, fast gpar
f3 <- function() {
  pmap(
    df,
    function(xu, yu, color, ...) {
      gp <- gpar_fast(col = color)
      pointsGrob_fast(xu, yu, gp = gp)    
    }
  )
}

# plot output is the same
grid.newpage()
pushViewport(viewport())
grid.draw(do.call(gList, f1()))

grid.newpage()
pushViewport(viewport())
grid.draw(do.call(gList, f3()))

microbenchmark::microbenchmark(
  f1(),
  f2(),
  f3(),
  times = 50
)
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq       max neval
#>  f1() 6.785814 6.999505 8.152342 7.623538 9.108686 12.428292    50
#>  f2() 2.305470 2.485765 3.349848 2.674786 3.282183 11.510736    50
#>  f3() 1.866726 1.994537 2.521616 2.113111 2.633927  6.555309    50

Created on 2018-06-28 by the reprex package (v0.2.0).

@yutannihilation
Copy link
Member Author

yutannihilation commented Jun 28, 2018

Thanks so much for the nice codes and explanations!

If I look at the code for utils::modifyList() I want to cry.

I didn't notice that, but you are right. In fact, my profvis shows utils::modifyList() takes 3/4 of the time of lapply(..., sf_grob()) (The rest 1/4 is the overhead of the loop by lapply()...?).

image

image

But, for gpar() and pointsGrob(), which is called inside st_as_grob.POINT(), it seems to matter little here. So, to me, this is rather the problem on ggplot2's side for this particular issue. Generally saying, I agree with you in that grid can be improved significantly and it does benefit the performance of ggplot2 :)

Likely this thread is not the place to organize them

I'm not sure if this is the typical performance issue of ggplot2, but I'm happy to keep this issue open for that kind of discussion. 👍

@lgtm70b
Copy link

lgtm70b commented Aug 18, 2018

I understand less about ggplot2 internals than the other members of this thread, but just in case this says anything useful about implementation / communication with graphics devices, I ran a recent performance benchmark comparing ggplot2 rendering to tmap --

Plotting the same plot using the same spatial point data & graphics device (RStudioGD macOS / Quartz), rendering the tmap plot was 80x faster but also used an 80x larger object.
Is there any non-obvious difference between the two package implementations that could explain this outcome? Reprex here: https://stackoverflow.com/questions/51126014/why-does-tmap-render-80-times-faster-than-ggplot2-plotting-shapefiles-in-r-wit

@yutannihilation
Copy link
Member Author

I believe this is closed by #3164 👍

@lock
Copy link

lock bot commented Sep 2, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Sep 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants