Skip to content

Computing stats between groups #6476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kieran-mace opened this issue May 22, 2025 · 7 comments
Closed

Computing stats between groups #6476

kieran-mace opened this issue May 22, 2025 · 7 comments

Comments

@kieran-mace
Copy link

kieran-mace commented May 22, 2025

In situations when we want to calculate a group stat that requires knowledge of other groups, it would be useful for compute_group to have access to the rest of the data

I would like to be able to create a new property, bin_prop, applied to StatBin, that returns the proportion of data in that bin, that belongs to the group.

In the example below, I want to analyze the number of plays, by each player in the lakers. I will use geom_freqpoly to show the counts, but what I really want is the proportion of plays per player within the bin.

Set up data

library(lubridate)
library(ggplot2) 
library(dplyr)


# set up data
laker_player_plays = lakers |> 
  tibble::as_tibble() |> 
  filter(team == 'LAL', stringr::str_length(player) > 0) |> 
  mutate(date = ymd(date))

Just counts, close to what I want, but I would love to use a after_stat(bin_prop) instead.

# I'd like to do this, but instead cerate a new property `bin_prop` that shows the percentage of plays by that player
ggplot(laker_player_plays) +
  geom_freqpoly(aes(x = date,
                    color = player,
                    y = after_stat(count)
  ),
  binwidth = 31)

Side note

I do see that something equivalent can be done with geom_histogram + position = 'fill' - but I do not believe this is being done by the stat layer, but maybe by the scales layer?

# I do notice this is done to some extent using geom_histogram + position = fill, but I believe this position is not computed during the stat step
ggplot(laker_player_plays) +
  geom_histogram(aes(x = date, fill = player), position = 'fill', binwidth = 31)

<!-- →

Desired output

Here is an example of what I'd like to achieve, but by using stats instead of precomputing the proportion_of_plays ahed of time`

# This is the type of plot I think we should be able to create, without having to pre-calculate the proportions (should be computed in StatBin)
# calculate breaks, for solutions that can't use stat_bin

breaks = seq(min(laker_player_plays$date), max(laker_player_plays$date)+31, by = 31)

laker_player_plays |> 
  mutate(date_group = cut(date, breaks = breaks, )) |>
  group_by(player, date_group) |> 
  count(name = 'plays') |> 
  group_by(date_group) |> 
  mutate(proportion_of_plays = plays/sum(plays)) |> 
  ggplot(aes(x = date_group, 
             y = proportion_of_plays,
             color = player,
             group = player)) +
  geom_point() +
  geom_line() +
  scale_y_continuous(labels=scales::percent)

Created on 2025-05-22 with reprex v2.1.1

Suggested API

ggplot(laker_player_plays) +
  geom_freqpoly(aes(x = date,
                    color = player,
                    y = after_stat(bin_prop)
  ),
  binwidth = 31)

I've attempted to create a PR for this, but noticed that each group is calculated independently. Is there a solution, or workaround that you propose to create a PR that enables the calculation of bin_prop in StatBin that requires calculation of proportions between groups? I do see that after_stat(prop) is available for geom_bar so I suspect this pattern has been solved for before?

@kieran-mace
Copy link
Author

I'd love to write the PR to add this to ggplot2 but could use guidance on how to calculate such a stat within StatBin

@teunbrand
Copy link
Collaborator

Thanks for the report! If you want to compute stats per panel or per layer, you can use the compute_layer() or compute_panel() methods instead of the compute_group() method. These exist for the express purpose of offering this level of granularity. Do these not work out for your case?

@clauswilke
Copy link
Member

Kieran, as an example, stat_density_2d() implements its own compute_layer() function as it needs to do some things that can't happen at the group level: https://github.com/tidyverse/ggplot2/blob/main/R/stat-density-2d.R

I'm going to close this issue because what you request is already possible. This may not be widely known, even among people that write ggplot2 extensions, but there is no fundamental limit that you're running up against.

@kieran-mace
Copy link
Author

kieran-mace commented May 22, 2025

@clauswilke that is true. I'm converging on the opinion that stat_bin is a special case of stat_count (where binning is applied to a continuous variable to construct a categorical one that is then passed to stat_count), and therefore properties like after_stat(prop) should actually be available there too.

I do see that they have completely different implementations, and maybe that's for the best, but it seems non-DRY to me.

@teunbrand I do see your PR on moving things to S7, and this could potentially be a use case for a class extension (stat_bin is an extension of stat_count). But I definitely don't understand ggplot2 well enough to know this for sure.

@clauswilke
Copy link
Member

@kieran-mace It's important to be precise with what exactly an issue is about. This issue started out as being about computing stats between groups. If you think stat_bin needs some additional features then please file an issue specifically about that.

And then I would also suggest to not mix feature requests with proposed implementation details. Whether stat_bin should be implemented as a special case of stat_count is again a completely separate point.

@teunbrand
Copy link
Collaborator

In the S7 PR, I don't replace the ggproto classes that ggplot2 and its extensions are based on. I replace the most prominent S3 classes with S7 classes. If you like help with implementing an extension (besides stack overflow), the discussions in https://github.com/ggplot2-extenders/ggplot-extension-club/discussions are open

@kieran-mace
Copy link
Author

@clauswilke thank you very much for the feedback.

New issue opened: #6478
New PR submitted #6477

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants