Skip to content

Add 'row.names' into ggplot_build(...)$data very useful for grouped geom_boxplot #4912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DiegoJArg opened this issue Jul 22, 2022 · 7 comments

Comments

@DiegoJArg
Copy link

Hi.
After seeing in my boxplot that I may have some uncomplete groups of data under certain customer_ids, I wanted to see how filtering them out would change the behavior of all the rest.

My first idea was to gather the already calculated values in the drawn boxplot and filter the dataframe in order to draw it again.

Having successfully gathered the $data, I realized that my customer_ids are not listed on any column of the resulting dataframe. Instead, I saw a $y column with numeric values, which just guessing represents the order of the grouping variable.

Probably the plot object contains the grouping labels, but the "data" structure does not and it would help to have it.


Next code is meant to confirm that no labels are shown in $data, and finally assign them.

I had to confirm that the sequence of $y matched the order of the group, which is a factor() type.
I did that by fixing the seed and inspecting the resulting order in Rstudio.
However, this is not optimum as I don't have knowledge on how sorting and data types are handled internally.
The ideal is to preserve the original grouping names.

set.seed(111)
DF = data.frame(
  id  = factor( rep(LETTERS[1:5], 100), levels=LETTERS[1:5] ),
  COL = sample(1:20, 100, replace=TRUE)
)

# A     B       C     D      E
# 6.0   12.0    8.0   12.5   10.0  <-- xmiddle / ¿median?

bp = ggplot( DF , aes( COL, id ) ) + geom_boxplot();   
bp

# === SEARCH FOR Categorical-labels + Median values ===

# Getting boxplot data
Qggbp  = ggplot_build( bp )$data;         
typeof(Qggbp)    # list
Qggbp # gets converted into DF
row.names(Qggbp) # -> (nothing)
Qggbp$y          # -> null

# Getting boxplot data
Qggbp  = Qggbp[[1]]
typeof(Qggbp)    # list
Qggbp # gets converted into DF
row.names(Qggbp) # -> [1] "1" "2" "3" "4" "5"

# Realising tha they are numbered instead of labeled
Qggbp$y          # -> [1] 1 2 3 4 5   /  attr(,"class")  /  [1] "mapped_discrete" "numeric"
Qggbp$y %>% as.numeric # -> [1] 1 2 3 4 5

# Setting the names row-names to which they are associated.
row.names(Qggbp) <- levels( DF$id )
Qggbp

When writing this I found this question of 5 years ago

@yutannihilation
Copy link
Member

I might not understand your request, but, as the last example of geom_boxplot()'s document shows, you can calculate the necessary statistics beforehand if you need more control over the process. ggplot_build(...)$data just shows the internal data mainly for debugging, not for usability.

https://ggplot2.tidyverse.org/reference/geom_boxplot.html#ref-examples

@DiegoJArg
Copy link
Author

Hi @yutannihilation,
I am not sure why you close this, nothing changed.
ggplot_build(...)$data is the core of the feature request.

The feature request is just to add the row names by default to ggplot_build(...)$data according to the grouping variable. Or a column with associated grouping values. It isn't an intrusive feature.

BPdata = ggplot_build(...)$data[[1]]                   # BPdata has rows without grouping labels
row.names(BPdata ) <- levels( group_factor )    # Now its workarounded, heach row has its respective group label

Also, help page doesn't mention "only debugging, not for usability".

@yutannihilation
Copy link
Member

I meant, ggplot_build(...)$data is the as-is data that is used internally. It's not where we modify or add features. If the internal data has no row names, we don't add row names (c.f. #4868 (comment)), sorry.

@DiegoJArg
Copy link
Author

I have to say that I don't understand the reasoning.
And it won't hurt me if the feature is not accepted. :)

But you already have ggplot_build(...)$data $y , which, just guessing, it is a row identifier as scalar.
Probably meant for ordering, and ordering is probably gotten from grouping variable.
Grouping variable is also there at the coordinates.

I checked out if $y and default $data order matches the grouping variable order levels( group_factor ) and it did.

My main objective was to get a guarantied identifier of rows at $data matching back to the initial variable group_factor.

Right not, row.names(BPdata ) <- levels( group_factor ) is my simplest way found, but hoping that $data won't get a different order.

@thomasp85
Copy link
Member

$y is the y aesthetic nothing more nothing less. The internal data structure of ggplot2 is without row names because they are 100% unneeded and gives a huge performance penalty in R. So, we won't add this to support a niche case like this

@DiegoJArg
Copy link
Author

My aesthetic "y", is a factor not a double.
But it's ok, as long as, it is related to it.
Thanks for answering

image

image

@yutannihilation
Copy link
Member

Again, it's an internal data. Numeric is the internal representation. You might feel ok or not ok, but it is what it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants