Skip to content

Use vctrs internally #4868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jul 5, 2022
Merged

Use vctrs internally #4868

merged 25 commits into from
Jul 5, 2022

Conversation

thomasp85
Copy link
Member

@thomasp85 thomasp85 commented Jun 10, 2022

Supersedes #4342

Fixes #4175

This PR aims to bring the internals of ggplot2 into a state that is compliant with vctrs data types, as well as clean up the implementation different places. The aim is not to convert every base function that has a vctrs equivalent - only that which is necessary or beneficial.

Current state includes:

  • removal of rbind_dfs() in favour of vec_rbind()
  • Use vec_cbind() instead of cbind() in the places it makes sense
  • Use vec_interleave() in our own interleave() function
  • Convert mapped_discrete class into a vctrs based class
  • Change all do.call() to rlang::inject() (not strictly vctrs related but in line with our move to splicing lists into calls)

@thomasp85 thomasp85 marked this pull request as ready for review June 15, 2022 07:19
@thomasp85
Copy link
Member Author

I've come to believe that it would be most beneficial to have this PR merged in with the current scope and only bring in more vctrs functions as we find need for it.

Copy link
Member

@DavisVaughan DavisVaughan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my main comment is that I want to make sure you know about vctrs::new_data_frame(). You switch from your custom new_data_frame() over to data_frame() in a large number of places, but I wonder if you might have been happier just getting the benefits of vctrs::new_data_frame() instead.

In particular:

  • In cases where you return an empty data frame, like data_frame(.name_repair = "minimal"), I think new_data_frame() is cleaner
  • new_data_frame() doesn't do any name repair, so you don't pay that cost at all and there is no .name_repair = "minimal" to set. So I think you could have kept some of your old code that looked like new_data_frame(list(...)) and been pretty happy

@lionel-
Copy link
Member

lionel- commented Jun 16, 2022

@DavisVaughan Regarding the switch of new_data_frame() to data_frame(), I think it's the right move if there is no performance issue because the former can easily cause internal errors in the vctrs code. In general it seems that we should disable input checking and validation only in performance critical code (with comments describing why the assumptions are safe) or in very low level data creation code.

@thomasp85
Copy link
Member Author

Thanks for the thorough review. Regarding the data_frame()choice over new_data_frame() this was chosen mainly because we have a lot of places where the data frame is constructed from scalars and vectors and new_data_frame() would then require it be wrapped in df_list(). While there are places where it may not be needed (e.g. for the empty data frame construction) It seems cleaner for the code base to use a single constructor function. It may make sense to create a thin wrapper around data_frame(..., .name_repair = "minimal") as suggested. I'm definitely open to that.

Is for the comments on the ggplot2_mapped_discrete class, the intent is really to make it behave like a numeric vector with all the quirks it may entail (for instance the weird coercion rules of factor and character). The only purpose of this class is to carry around information that this column was once a discrete data type.

@thomasp85
Copy link
Member Author

This PR prompted a review of the use of unrowname(). During this it occurred to me that it would make sense to remove row names at the point of injection i.e. in layer_data() in order to make sure that row names are not present internally.

This means that layer_data() no longer returns an unmodified version of the input data.frame hence the changed unit tests, but I feel this is a fair bargain... @clauswilke @yutannihilation @hadley please raise your voice if this is contentious to you:-)

@yutannihilation
Copy link
Member

I think that's fine. The conclusion of a discussion on layer_data() (#3018 (comment)) was that layer_data() is for debugging purposes and should return what's actually used internally. So, if we use unnamed data frames internally, we should just return it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ggplot breaks when facet variable is POSIXct with tz attribute specified
5 participants