-
Notifications
You must be signed in to change notification settings - Fork 73
flattening out the parse table #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flattening out the parse table #111
Conversation
Uses a specialised visitor-like approach.
Codecov Report
@@ Coverage Diff @@
## master #111 +/- ##
=========================================
Coverage ? 88.29%
=========================================
Files ? 18
Lines ? 658
Branches ? 0
=========================================
Hits ? 581
Misses ? 77
Partials ? 0
Continue to review full report at Codecov.
|
feb2873
to
069b81a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'm not in love with the readr import (need to declare it in DESCRIPTION
, too), but we can postpone changing the recursive calls to visitors until later if it helps.
R/visit.R
Outdated
#' @return An updated parse table. | ||
#' @seealso context_to_terminals | ||
context_towards_terminals <- function(pd_nested, | ||
passed_lag_newlines, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe outer_
or parent_
instead of passed_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
#' relative in `pd_nested`) will be converted into absolute. | ||
#' @inherit context_towards_terminals | ||
#' @seealso context_towards_terminals visitors | ||
context_to_terminals <- function(pd_nested, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we need to change to use this with pre_visit()
instead of a "manual" recursive call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that but I could not find an easy way to do that. In contrast to pre_visit()
, we pass scalars from one level to the other, not functions. Also, unlike in pre_visit()
, we do not map over the children only, but simultaneously over other columns too (pmap()
instead of map()
). You could probably create a function that can accommodate both (a pmap
visitor and for the usual case, we just have p = 1), but I felt it would not make things clearer since the tasks are not very similar in their implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
visit_context_to_terminals <- function(pd) {
pd <- context_towards_terminals(pd, pd$outer_lag_newlines, ...)
pd$child <- map2(pd$child, ..., function(x, y) {
x[["outer_lag_newlines"]] <- y
})
...
pd
}
This stores the information for the next stage of the visitor in the children, where it is then picked up as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this:
context_to_terminals <- function(pd_nested) {
if (is.null(pd_nested)) return()
pd_transformed <- context_towards_terminals(
pd_nested,
pd_nested$outer_lag_newlines,
pd_nested$outer_indent,
pd_nested$outer_spaces
)
pd_transformed$child <- pmap(
list(
pd_transformed$child,
pd_transformed$lag_newlines,
pd_transformed$indent,
pd_transformed$spaces),
function(child, lag_newlines, indent, spaces) {
if (is.null(child)) return(NULL)
child[1, "outer_lag_newlines"] <- lag_newlines
child[["outer_indent"]] <- indent
child[nrow(child), "outer_spaces"] <- spaces
child
})
pd_transformed
}
And changed context_towards_terminals
to
context_towards_terminals <- function(pd_nested,
outer_lag_newlines,
outer_indent,
outer_spaces) {
pd_nested$indent <- pd_nested$indent + outer_indent
pd_nested$lag_newlines <- pd_nested$lag_newlines + outer_lag_newlines
pd_nested$spaces <-
pd_nested$spaces + outer_spaces
pd_nested
}
Which could probably done more elegantly with a map_if()
.
In addition, we need to initialise the new columns in create_filler
pd_flat$outer_lag_newlines <- 0
pd_flat$outer_indent <- 0
pd_flat$outer_spaces <- 0
And in all functions that create new tokens such as add_brackets_in_pipe
.
All tests pass.
What do you think? I think initial version was more clear, did not need three new columns and had less (almost-)code duplication. Maybe there is some more simplification I missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would only add these columns temporarily in this visitor, and then remove them from the final result.
Anyway, I thought we should keep using visitors instead of recursive calls, but if the code is more difficult to understand with the visitor, let's keep the recursive calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another advantage of using visitors though is that we can allow the user to take control over those transformations if the transformers are passed via the transformers
argument. Maybe we can change it later if we consider this to be a manipulation the user should control.
R/visit.R
Outdated
function(terminal, token, text, lag_newlines, spaces, indent, id, | ||
parent, line1, child) { | ||
if (terminal) { | ||
c(lag_newlines, indent, token, text, spaces, id, parent, line1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use list()
here to preserve types. (Don't need it at all if we switch to a post visitor.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, then we can forget about the type conversion with readr
.
R/visit.R
Outdated
#' Helper to extract terminals | ||
#' | ||
#' @param pd_nested A nested parse table. | ||
extract_terminals_helper <- function(pd_nested) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think a "post" visitor might do a good job here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried that and it works.
extract_terminals <- function(pd_nested) {
flattened_pd <- post_visit(pd_nested, list(extract_terminals_helper))
flattened_pd
}
extract_terminals_helper <- function(pd_nested) {
is_terminal <- pd_nested$terminal
if (!any(pd_nested$terminal)) return(bind_rows(pd_nested$child))
bound <- bind_rows(
pd_nested$child[!is_terminal],
pd_nested[is_terminal,]
)
child_terminal <- map(pd_nested$child, function(x) {
if (is.null(x)) return(FALSE)
rep(TRUE, nrow(x))
})%>% flatten_lgl()
bound[order(c(which(child_terminal), which(!child_terminal))),]
}
As you can see, the problem is the sorting. We cannot sort according to line1
and col1
anymore because we have a rule that adds ()
to pipe calls, so in particular col1
become outdated (and I don't think it's worth updating all line and column information since we do that once flattened out anyways (and there it's easy).
Do you see a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found a bug in add_brackets_in_pipe
. We can arrange according to line1
and col1
if there is no colision, ie.
a %>%
b
Will become
a %>%
b()
And since there is no token after ()
on that line, there are not two tokens with the same col1
and col2
value. If there were we might run into trouble. So I think it's not really a good idea to do so. I think that's more of a hack than anything else...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of sorting by (line1, col1), we could create our own sequential sort key; new tokens could then get intermediate values (e.g., for the closing parens, key + 1/3 and key + 2/3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's true. If you consider that to be more elegant than the initial solution, I can implement that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave it for now and file a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Now a post visitor is implemented. Should i still create the issue just that we have a sorting key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this would be useful, but low prio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference (#112)
R/visit.R
Outdated
cumsum(flattened_pd$lag_newlines) + flattened_pd$line1[1] | ||
|
||
flattened_pd$newlines <- lead(flattened_pd$lag_newlines, default = 0) | ||
flattened_pd$nchar <- nchar(flattened_pd$text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to use nchar(..., type = "width")
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
R/visit.R
Outdated
#' `line1`. The same applies for `col1` and `col2`. | ||
#' @inheritParams choose_indention | ||
enrich_terminals <- function(flattened_pd, use_raw_indention = FALSE) { | ||
flattened_pd$lag_spaces <- lag(flattened_pd$spaces, default = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps = 0L
to remain in the domain of integers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uups, changed it elsewhere. Now I think all default = 0
are changed to default = 0L
.
R/visit.R
Outdated
#' @importFrom readr type_convert col_integer cols | ||
extract_terminals <- function(pd_nested) { | ||
flat_vec <- extract_terminals_helper(pd_nested) %>% | ||
unlist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the helper uses list()
, this result can be simply combined using bind_rows()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether I understand you correctly. I used list()
instead of c()
in the helper in if (terminal)
. This still gives me some nested list, and if I call bind_rows()
on that, I get an error:
'getCharCE' must be called on a CHARSXP
Which I think is due to the fact that we try to bind nested lists. I tried to use purrr::transpose()
but I could not find a quick solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely remember problems with bind_rows()
that might be resolved in the dev version, but that's not the main point.
Maybe a visitor solution along the following lines would work around the need to sort:
visit_extract_terminals <- function(pd_nested) {
pd_split <- split(pd_nested, seq_len(nrow(pd_nested)))
bind_rows(ifelse(pd_nested$terminal, pd_split, pd_nested$child))
}
It may be slow, but we can optimize later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's a good idea
Use visitor for terminal extraction instead of complicated and pmap / matrix construct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good. Agree to your last comment about converting to a visitor, can you please file an issue?
NAMESPACE
Outdated
@@ -21,4 +21,7 @@ importFrom(purrr,pmap) | |||
importFrom(purrr,pwalk) | |||
importFrom(purrr,reduce) | |||
importFrom(purrr,when) | |||
importFrom(readr,col_integer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ups no.
R/serialize.R
Outdated
@@ -66,6 +68,28 @@ serialize_parse_data_flat <- function(pd_flat) { | |||
collapse = "")) %>% | |||
.[["text_ws"]] %>% | |||
strsplit("\n", fixed = TRUE) %>% | |||
.[[1L]] | |||
.[[1L]] %>% | |||
trimws(which = "right") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to trim whitespace after serializing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we don't do it elsewhere anymore and I think it's easiest done here. I think we once had a transformer for it but that does not work if we do not want to touch indention (next topic after this PR is merged) because there, the spaces after the token are the spaces after the token and the line breaks (so spaces = indention for last token on a line). In your initial version of styler, we trimmed the initial text and replaced tabs with two white spaces. Why did we remove it there again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did the trimming in the initial text exclusively for the roundtrip test, which we currently don't have. For serialization, we're inserting spaces ourselves, and I'd rather avoid inserting extra space (by adapting newlines_and_spaces()
if necessary) instead of trimming after the fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, you are right. In fact, it already works like that, just not for one edge case: empty comments. This is because we have the rule that adds a space at the beginning of each comment.
#'
So now I changed the rule to only add a space if the comment is non-empty (see adapted help file) and reverted 3b193f0.
R/serialize.R
Outdated
.[["text_ws"]] %>% | ||
strsplit("\n", fixed = TRUE) %>% | ||
.[[1L]] %>% | ||
trimws(which = "right") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
R/visit.R
Outdated
#' `line1`. The same applies for `col1` and `col2`. | ||
#' @inheritParams choose_indention | ||
enrich_terminals <- function(flattened_pd, use_raw_indention = FALSE) { | ||
flattened_pd$lag_spaces <- lag(flattened_pd$spaces, default = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
R/visit.R
Outdated
#' @importFrom readr type_convert col_integer cols | ||
extract_terminals <- function(pd_nested) { | ||
if (is.null(pd_nested)) return(pd) | ||
pd_splitted <- split(pd_nested, seq_len(nrow(pd_nested))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitted -> split
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uuups
Yes, issue #113 filed regarding the advantage of using visitors instead of recursive function calls for enhanced user control. |
8ddc1c1
to
bcc2218
Compare
This is the first part of #106.
The nested parse table is turned into a flat parse table (called
flattened_pd
so it can be distinguished frompd_flat
) by propagating all important attributes to the terminals first and then extracting the terminals from the nested structure. Serialization is done on that flat table, very similar to the serialization of the flat approach.