Skip to content

Documentation for plotting bin proportion in geom_histogram #3522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kuriwaki opened this issue Sep 7, 2019 · 12 comments · Fixed by #4130
Closed

Documentation for plotting bin proportion in geom_histogram #3522

kuriwaki opened this issue Sep 7, 2019 · 12 comments · Fixed by #4130
Labels
documentation feature a feature request or enhancement layers 📈

Comments

@kuriwaki
Copy link

kuriwaki commented Sep 7, 2019

Showing proportions (instead of counts or density) of a bin in a histogram is a common exercise, but this is not documented. Most stackoverflow suggestions are either conflicting or outdated.

The solution below by @clauswilke in a now-locked thread seems like the best practice. Could this trick as well as the width stat be documented in the geom_histogram manual and the ggplot website?


library(ggplot2)

# instead of plotting the count statistic in geom_histogram, 
# plot the proportion of a bin, which is equivalent to the product of the 
# density and the width statistic
ggplot(diamonds, aes(x = carat, y = stat(density*width))) +
  geom_histogram(binwidth = 0.01)

Created on 2019-09-07 by the reprex package (v0.3.0)

The solution is to use aes(y = stat(width*density)). This converts the density back into a percentage. Reprex follows below.

Originally posted by @clauswilke in #2499 (comment)

@yutannihilation
Copy link
Member

Hmm... I'm wondering if using stat(width) is really a part of ggplot2's API; it's not a calculated variable of StatBin. I didn't notice stat() can access not only to official calculated variables, but also to other internal variables. For other example:

library(ggplot2)

ggplot(data.frame(x = 1, id = factor(1:3)), aes(x, x)) +
  geom_text(aes(label = stat(PANEL)), size = 20) +
  facet_wrap(vars(id))

Created on 2019-09-11 by the reprex package (v0.3.0)

In my opinion, width is just an internal variable and should not be recommended to use because it can be renamed in future.

@clauswilke
What do you think?

@clauswilke
Copy link
Member

I'm not sure I understand what "it's not a calculated variable of StatBin" means.

These are all the variables that StatBin calculates:

ggplot2/R/bin.R

Lines 168 to 175 in 7f317d4

count = count,
x = x,
xmin = xmin,
xmax = xmax,
width = width,
density = density,
ncount = count / max(abs(count)),
ndensity = density / max(abs(density))

I don't see how width is in any way different than, say, density.

@yutannihilation
Copy link
Member

Yes, width is a calculated variable in that sense. The reason I thought it's not exposed officially was

  1. it's not documented in the Computed Variables section, and

    ggplot2/R/stat-bin.r

    Lines 30 to 36 in 23e3241

    #' @section Computed variables:
    #' \describe{
    #' \item{count}{number of points in bin}
    #' \item{density}{density of points in bin, scaled to integrate to 1}
    #' \item{ncount}{count, scaled to maximum of 1}
    #' \item{ndensity}{density, scaled to maximum of 1}
    #' }

  2. it's not included in the data layer_data() returns.

library(ggplot2)

p <- ggplot(diamonds, aes(carat)) +
  geom_histogram()

head(layer_data(p))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#>       y count         x       xmin      xmax    density     ncount
#> 1   573   573 0.1658621 0.08293103 0.2487931 0.06404668 0.03769737
#> 2 15200 15200 0.3317241 0.24879310 0.4146552 1.69896944 1.00000000
#> 3  8165  8165 0.4975862 0.41465517 0.5805172 0.91263720 0.53717105
#> 4  6096  6096 0.6634483 0.58051724 0.7463793 0.68137617 0.40105263
#> 5  4138  4138 0.8293103 0.74637931 0.9122414 0.46252208 0.27223684
#> 6  7465  7465 0.9951724 0.91224138 1.0781034 0.83439519 0.49111842
#>     ndensity PANEL group ymin  ymax colour   fill size linetype alpha
#> 1 0.03769737     1    -1    0   573     NA grey35  0.5        1    NA
#> 2 1.00000000     1    -1    0 15200     NA grey35  0.5        1    NA
#> 3 0.53717105     1    -1    0  8165     NA grey35  0.5        1    NA
#> 4 0.40105263     1    -1    0  6096     NA grey35  0.5        1    NA
#> 5 0.27223684     1    -1    0  4138     NA grey35  0.5        1    NA
#> 6 0.49111842     1    -1    0  7465     NA grey35  0.5        1    NA

Created on 2019-09-12 by the reprex package (v0.3.0)

But, I don't check this in detail yet. Maybe I'm just confused with the case of width as an aesthetic, where we want to hide width from users to avoid confusion... (c.f. #3194)

@clauswilke
Copy link
Member

Weird, I have no idea why width doesn't make it into the layer data.

@yutannihilation
Copy link
Member

IIRC, this is somewhat related to the effort we are trying to treat width as a param.

@thomasp85
Copy link
Member

width is not present because it gets dropped between the stat and geom AFAIK.

@kuriwaki
Copy link
Author

Apologies for not following all the nuances, but does this imply y = stat(density*width) is a safe way to generate proportion histograms moving forward?

@thomasp85
Copy link
Member

It should be safe, but I think it may make sense to provide it explicitly by stat_bin()... @clauswilke do you have any input to this

@thomasp85 thomasp85 added documentation feature a feature request or enhancement layers 📈 labels Jan 21, 2020
@clauswilke
Copy link
Member

I think we should document width and also add a unit test that uses it to make sure it doesn't randomly disappear at some point in the future.

@yutannihilation
Copy link
Member

width is not present because it gets dropped between the stat and geom AFAIK.

Sorry, now I think I see the rule behind this a bit clearer now. Let me confirm.

Is my understanding correct?

@clauswilke
Copy link
Member

Yes, I think that's correct.

@yutannihilation
Copy link
Member

Thanks, then I agree with documenting width and adding tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation feature a feature request or enhancement layers 📈
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants