Skip to content

flattening out the parse table #111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Aug 7, 2017

Conversation

lorenzwalthert
Copy link
Collaborator

This is the first part of #106.

The nested parse table is turned into a flat parse table (called flattened_pd so it can be distinguished from pd_flat) by propagating all important attributes to the terminals first and then extracting the terminals from the nested structure. Serialization is done on that flat table, very similar to the serialization of the flat approach.

@lorenzwalthert lorenzwalthert requested a review from krlmlr August 3, 2017 17:29
@codecov
Copy link

codecov bot commented Aug 3, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@93e529b). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #111   +/-   ##
=========================================
  Coverage          ?   88.29%           
=========================================
  Files             ?       18           
  Lines             ?      658           
  Branches          ?        0           
=========================================
  Hits              ?      581           
  Misses            ?       77           
  Partials          ?        0
Impacted Files Coverage Δ
R/visit.R 100% <100%> (ø)
R/serialize.R 41.3% <100%> (ø)
R/nested.R 98.27% <100%> (ø)
R/serialized_tests.R 100% <100%> (ø)
R/transform.R 86.66% <100%> (ø)
R/rules-other.R 100% <100%> (ø)
R/parsed.R 97.72% <100%> (ø)
R/rules-spacing.R 94.33% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93e529b...bcc2218. Read the comment docs.

@lorenzwalthert lorenzwalthert force-pushed the pr106_1_flattening_out branch from feb2873 to 069b81a Compare August 3, 2017 18:17
Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm not in love with the readr import (need to declare it in DESCRIPTION, too), but we can postpone changing the recursive calls to visitors until later if it helps.

R/visit.R Outdated
#' @return An updated parse table.
#' @seealso context_to_terminals
context_towards_terminals <- function(pd_nested,
passed_lag_newlines,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe outer_ or parent_ instead of passed_?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

#' relative in `pd_nested`) will be converted into absolute.
#' @inherit context_towards_terminals
#' @seealso context_towards_terminals visitors
context_to_terminals <- function(pd_nested,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we need to change to use this with pre_visit() instead of a "manual" recursive call?

Copy link
Collaborator Author

@lorenzwalthert lorenzwalthert Aug 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that but I could not find an easy way to do that. In contrast to pre_visit(), we pass scalars from one level to the other, not functions. Also, unlike in pre_visit(), we do not map over the children only, but simultaneously over other columns too (pmap() instead of map()). You could probably create a function that can accommodate both (a pmap visitor and for the usual case, we just have p = 1), but I felt it would not make things clearer since the tasks are not very similar in their implementation.

Copy link
Member

@krlmlr krlmlr Aug 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

visit_context_to_terminals <- function(pd) {
  pd <- context_towards_terminals(pd, pd$outer_lag_newlines, ...)
  pd$child <- map2(pd$child, ..., function(x, y) { 
    x[["outer_lag_newlines"]] <- y
  })
  ...
  pd
}

This stores the information for the next stage of the visitor in the children, where it is then picked up as needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this:

context_to_terminals <- function(pd_nested) {

  if (is.null(pd_nested)) return()

  pd_transformed <- context_towards_terminals(
    pd_nested,
    pd_nested$outer_lag_newlines,
    pd_nested$outer_indent,
    pd_nested$outer_spaces
  )

  pd_transformed$child <- pmap(
    list(
      pd_transformed$child,
      pd_transformed$lag_newlines,
      pd_transformed$indent,
      pd_transformed$spaces),
    function(child, lag_newlines, indent, spaces) {
      if (is.null(child)) return(NULL)
      child[1, "outer_lag_newlines"] <- lag_newlines
      child[["outer_indent"]] <- indent
      child[nrow(child), "outer_spaces"] <- spaces
      child
    })
  pd_transformed
}

And changed context_towards_terminals to

context_towards_terminals <- function(pd_nested,
                                      outer_lag_newlines,
                                      outer_indent,
                                      outer_spaces) {
  pd_nested$indent <- pd_nested$indent + outer_indent
  pd_nested$lag_newlines <- pd_nested$lag_newlines + outer_lag_newlines
  pd_nested$spaces <-
    pd_nested$spaces + outer_spaces
  pd_nested
}

Which could probably done more elegantly with a map_if().
In addition, we need to initialise the new columns in create_filler

pd_flat$outer_lag_newlines <- 0
  pd_flat$outer_indent <- 0
  pd_flat$outer_spaces <- 0

And in all functions that create new tokens such as add_brackets_in_pipe.
All tests pass.
What do you think? I think initial version was more clear, did not need three new columns and had less (almost-)code duplication. Maybe there is some more simplification I missed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only add these columns temporarily in this visitor, and then remove them from the final result.

Anyway, I thought we should keep using visitors instead of recursive calls, but if the code is more difficult to understand with the visitor, let's keep the recursive calls.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another advantage of using visitors though is that we can allow the user to take control over those transformations if the transformers are passed via the transformers argument. Maybe we can change it later if we consider this to be a manipulation the user should control.

R/visit.R Outdated
function(terminal, token, text, lag_newlines, spaces, indent, id,
parent, line1, child) {
if (terminal) {
c(lag_newlines, indent, token, text, spaces, id, parent, line1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use list() here to preserve types. (Don't need it at all if we switch to a post visitor.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, then we can forget about the type conversion with readr.

R/visit.R Outdated
#' Helper to extract terminals
#'
#' @param pd_nested A nested parse table.
extract_terminals_helper <- function(pd_nested) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think a "post" visitor might do a good job here as well.

Copy link
Collaborator Author

@lorenzwalthert lorenzwalthert Aug 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried that and it works.

extract_terminals <- function(pd_nested) {
  flattened_pd <- post_visit(pd_nested, list(extract_terminals_helper))
  flattened_pd
}
extract_terminals_helper <- function(pd_nested) {
  is_terminal <- pd_nested$terminal
  if (!any(pd_nested$terminal)) return(bind_rows(pd_nested$child))
  bound <- bind_rows(
    pd_nested$child[!is_terminal],
    pd_nested[is_terminal,]
  )
  
  child_terminal <- map(pd_nested$child, function(x) {
    if (is.null(x)) return(FALSE)
    rep(TRUE, nrow(x))
  })%>% flatten_lgl()
  bound[order(c(which(child_terminal), which(!child_terminal))),]
}

As you can see, the problem is the sorting. We cannot sort according to line1 and col1 anymore because we have a rule that adds () to pipe calls, so in particular col1 become outdated (and I don't think it's worth updating all line and column information since we do that once flattened out anyways (and there it's easy).
Do you see a better way?

Copy link
Collaborator Author

@lorenzwalthert lorenzwalthert Aug 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found a bug in add_brackets_in_pipe. We can arrange according to line1 and col1 if there is no colision, ie.

a %>%
  b

Will become

a %>%
  b()

And since there is no token after () on that line, there are not two tokens with the same col1 and col2 value. If there were we might run into trouble. So I think it's not really a good idea to do so. I think that's more of a hack than anything else...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of sorting by (line1, col1), we could create our own sequential sort key; new tokens could then get intermediate values (e.g., for the closing parens, key + 1/3 and key + 2/3).

Copy link
Collaborator Author

@lorenzwalthert lorenzwalthert Aug 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true. If you consider that to be more elegant than the initial solution, I can implement that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave it for now and file a separate issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Now a post visitor is implemented. Should i still create the issue just that we have a sorting key?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this would be useful, but low prio.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference (#112)

R/visit.R Outdated
cumsum(flattened_pd$lag_newlines) + flattened_pd$line1[1]

flattened_pd$newlines <- lead(flattened_pd$lag_newlines, default = 0)
flattened_pd$nchar <- nchar(flattened_pd$text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to use nchar(..., type = "width") here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

R/visit.R Outdated
#' `line1`. The same applies for `col1` and `col2`.
#' @inheritParams choose_indention
enrich_terminals <- function(flattened_pd, use_raw_indention = FALSE) {
flattened_pd$lag_spaces <- lag(flattened_pd$spaces, default = 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps = 0L to remain in the domain of integers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uups, changed it elsewhere. Now I think all default = 0 are changed to default = 0L.

R/visit.R Outdated
#' @importFrom readr type_convert col_integer cols
extract_terminals <- function(pd_nested) {
flat_vec <- extract_terminals_helper(pd_nested) %>%
unlist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the helper uses list(), this result can be simply combined using bind_rows().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether I understand you correctly. I used list() instead of c() in the helper in if (terminal). This still gives me some nested list, and if I call bind_rows() on that, I get an error:

'getCharCE' must be called on a CHARSXP

Which I think is due to the fact that we try to bind nested lists. I tried to use purrr::transpose() but I could not find a quick solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember problems with bind_rows() that might be resolved in the dev version, but that's not the main point.

Maybe a visitor solution along the following lines would work around the need to sort:

visit_extract_terminals <- function(pd_nested) {
  pd_split <- split(pd_nested, seq_len(nrow(pd_nested)))
  bind_rows(ifelse(pd_nested$terminal, pd_split, pd_nested$child))
}

It may be slow, but we can optimize later.

Copy link
Collaborator Author

@lorenzwalthert lorenzwalthert Aug 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's a good idea

@lorenzwalthert lorenzwalthert requested a review from krlmlr August 7, 2017 07:17
Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good. Agree to your last comment about converting to a visitor, can you please file an issue?

NAMESPACE Outdated
@@ -21,4 +21,7 @@ importFrom(purrr,pmap)
importFrom(purrr,pwalk)
importFrom(purrr,reduce)
importFrom(purrr,when)
importFrom(readr,col_integer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ups no.

R/serialize.R Outdated
@@ -66,6 +68,28 @@ serialize_parse_data_flat <- function(pd_flat) {
collapse = "")) %>%
.[["text_ws"]] %>%
strsplit("\n", fixed = TRUE) %>%
.[[1L]]
.[[1L]] %>%
trimws(which = "right")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to trim whitespace after serializing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't do it elsewhere anymore and I think it's easiest done here. I think we once had a transformer for it but that does not work if we do not want to touch indention (next topic after this PR is merged) because there, the spaces after the token are the spaces after the token and the line breaks (so spaces = indention for last token on a line). In your initial version of styler, we trimmed the initial text and replaced tabs with two white spaces. Why did we remove it there again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did the trimming in the initial text exclusively for the roundtrip test, which we currently don't have. For serialization, we're inserting spaces ourselves, and I'd rather avoid inserting extra space (by adapting newlines_and_spaces() if necessary) instead of trimming after the fact.

Copy link
Collaborator Author

@lorenzwalthert lorenzwalthert Aug 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, you are right. In fact, it already works like that, just not for one edge case: empty comments. This is because we have the rule that adds a space at the beginning of each comment.

#' 

So now I changed the rule to only add a space if the comment is non-empty (see adapted help file) and reverted 3b193f0.

R/serialize.R Outdated
.[["text_ws"]] %>%
strsplit("\n", fixed = TRUE) %>%
.[[1L]] %>%
trimws(which = "right")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

R/visit.R Outdated
#' `line1`. The same applies for `col1` and `col2`.
#' @inheritParams choose_indention
enrich_terminals <- function(flattened_pd, use_raw_indention = FALSE) {
flattened_pd$lag_spaces <- lag(flattened_pd$spaces, default = 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

R/visit.R Outdated
#' @importFrom readr type_convert col_integer cols
extract_terminals <- function(pd_nested) {
if (is.null(pd_nested)) return(pd)
pd_splitted <- split(pd_nested, seq_len(nrow(pd_nested)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitted -> split

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uuups

@lorenzwalthert
Copy link
Collaborator Author

Yes, issue #113 filed regarding the advantage of using visitors instead of recursive function calls for enhanced user control.

@lorenzwalthert lorenzwalthert requested a review from krlmlr August 7, 2017 08:56
@lorenzwalthert lorenzwalthert force-pushed the pr106_1_flattening_out branch from 8ddc1c1 to bcc2218 Compare August 7, 2017 09:18
@lorenzwalthert lorenzwalthert merged commit 5f57d9f into r-lib:master Aug 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants