Skip to content

Pass lexer data to nodes #5045

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

GeoffreyBooth
Copy link
Collaborator

@GeoffreyBooth GeoffreyBooth commented Apr 29, 2018

Along the lines of #5044 and #4984, this PR adds a way to pass arbitrary token data from the lexer to the node classes. Just as inside every node class since #4572 there’s a @comments property if comments should be output before or after that node, per this PR there’s now also a @data property where we can stash extra data from the lexer or rewriter to sneak through the parser and be accessible in the nodes classes.

In particular, this should make #5019 solvable. Besides creating the new code to allow this data property on tokens to sneak through the parser, this PR adds some data to the StringLiteral token that should allow us to move a lot of the string manipulation logic out of lexer.coffee and into nodes.coffee. This is more just an example at this point, with the actual moving of that logic saved for a future PR. If you set a breakpoint in the nodes.coffee StringLiteral class compileNode method, you should see @data present with the values we set in the lexer.

@zdenko I think the code at lexer.coffee#L307-L315 is where multiline ' and " strings get deindented; thanks to the extra data I’m attaching to StringLiterals in lexer.coffee#L770-L773, the former code block should be able to move into the StringLiteral nodes class. (A lot of the string code could move over there, like the lexer functions about escapes and so on.)

Even if we didn’t bother moving whatever we could out of the lexer into nodes (though we should do that), simply being able to pass raw data from the lexer to the nodes should make building a complete AST possible.

There’s one caveat to this PR. Technically this new data property isn’t passing through the parser, just as the old comments property isn’t; we compare the location data (start row/column, end row/column) of each column with data or comments properties with the location data of parser-generated nodes, and the properties are reattached when we find matches. This is only an issue when there are multiple tokens with the same location data, which is rare but does happen; the only example I can find is a generated JS token added to the start or end of a file to hold comments, but presumably similar “special case” or generated tokens will also share location data with user-generated ones. For comments, overlapping isn’t an issue; I just combine all the comments together. But for token data, that isn’t an option; but I don’t think it should be an issue, since no “special” tokens should ever have data that needs to be preserved. If I’m wrong about this, we’ll need to patch Jison to truly allow extra data to be passed through the parser. cc @helixbass

@helixbass
Copy link
Collaborator

@GeoffreyBooth I've been starting to expose AST types (on my prettier branch) and correspondingly add formatting rules for those types in Prettier (on my fork of your coffeescript branch). So far I've just done some simple cases but I could try something trickier using this branch's token data mechanism - what are the known cases where we currently wouldn't have enough data in our initialized nodes to be able to properly emit an AST?

Eg I'm aware of non-preservation of heredoc indents, but I'd be inclined to try and rework that so that the whole string literal makes it directly through the grammar intact. In general, it seems like we should only use this token data as a fallback "escape hatch" if it's not feasible/worth it to rework the lexer/grammar to pass the missing information through more directly?

A couple notes on what I've been working on:
I didn't directly use code from your nodes branch but started doing something along similar lines to how you had a generic recursive toPlainObject(). I used different naming - I don't see a reason to "overwrite" the existing --nodes behavior (which I find useful), so I exposed AST generation via a --ast/ast option, and I called the method toAst() rather than toPlainObject()

There is clearly lots of potential for reuse from the work I've done for generating JS AST. I've started fleshing out a generic way for the JS AST compilation to default to reusing the Coffeescript AST generation code for that type, with an optional precompile() hook (see eg Base::_compileToBabylon()). Since we're trying to stick close to the corresponding JS AST types, a lot of the logic I wrote for generating JS AST has been fully shareable

Also fleshing out a generic structure for how to specify AST generation for a given node type as declaratively as possible (see Base::toAst()/Base::_toAst()):

  • for the AST type it'll look for a declared astType: (defaulting to the node class name), which can be a simple string or a function with logic based on the current instance (returning a string, the type)
  • for AST "children" (ie full-fledged sub-nodes), it'll look for astChildren: (defaulting to @children), which can be either a list of prop names (like @children, if the AST field naming matches ours) or an object which can specify field-name mapping and compilation level
  • for other AST "props" (ie fields that aren't full sub-nodes) besides type (and location data), it looks for astProps: which can similarly be a list, object, or function

As far as the corresponding work getting Prettier to generate formatted Coffeescript, I used your coffeescript branch as a starting point (thanks for getting that all wired up), and per your suggestion swapped out the coffeescript dependency for a local dependency pointing at my prettier branch-in-progress. I've been mimicking the corresponding code in printer-estree.js - I'm not sure whether it'd be worth trying to get any actual code reuse there but I'm sure there will be lots of benefits to sticking closely to the way they've structured the code for JS generation

Getting Prettier to be able to gradually emit more and more formatted Coffeescript constructs is a nice way to be able to measure progress. There are some obvious bigger targets like being able to run Coffeescript test files through Prettier and have the reformatted version pass. But it'd probably be worth figuring out how to start building an actual Prettier test suite that I can add examples to and use to check for regressions as I'm adding new types and refining formatting

Thus far my process has been to modify a simple test.coffee and run eg ./bin/prettier.js --parser coffeescript --print-width 20 test.coffee to dump the Prettier-formatted version of test.coffee (ie artificially constricting/modifying the line width via --print-width to make it easier to check Prettier's linebreaking behavior for different constructs)

@GeoffreyBooth
Copy link
Collaborator Author

Hey @helixbass, thanks for all this. I think you and @zdenko and I should get on a call together soon to coordinate. Some thoughts:

Eg I’m aware of non-preservation of heredoc indents, but I’d be inclined to try and rework that so that the whole string literal makes it directly through the grammar intact. In general, it seems like we should only use this token data as a fallback “escape hatch” if it’s not feasible/worth it to rework the lexer/grammar to pass the missing information through more directly?

Yes, that makes sense. The indent is the big one; I’m going to update this branch so that the data property is created for all tokens, with a few universal properties (such as indent, and token tag name and token value) always attached. Basically any manipulation you see happening in the lexer now is potentially stuff that we can’t do in the nodes without extra token data; if it was, it would’ve been implemented in the nodes classes in the first place (unless the feature was added by a more novice contributor). But sure, if there’s a way to do everything without needing data, that would be the ideal solution. I think in the short term I want to just add data, and then refactoring individual nodes to not need it can be a piecemeal process.

I didn’t directly use code from your nodes branch but started doing something along similar lines to how you had a generic recursive toPlainObject(). I used different naming - I don’t see a reason to “overwrite” the existing --nodes behavior (which I find useful), so I exposed AST generation via a --ast/ast option, and I called the method toAst() rather than toPlainObject()

In #5044 @jashkenas asked to rename toPlainObject to toJSON, and I see no reason not to use that name. As for overriding --nodes, we’re trying to avoid creating new flags, so I repurposed that one. We occasionally use environment variables to override things, so maybe NODES_OUTPUT=plain coffee --nodes could generate the current output while coffee --nodes generates the JSON AST.

There is clearly lots of potential for reuse from the work I’ve done for generating JS AST.

For sure.

Thus far my process has been to modify a simple test.coffee and run eg ./bin/prettier.js –parser coffeescript –print-width 20 test.coffee to dump the Prettier-formatted version of test.coffee (ie artificially constricting/modifying the line width via --print-width to make it easier to check Prettier’s linebreaking behavior for different constructs)

I was planning to not have Prettier break lines, as CoffeeScript is a significant-whitespace language and the concept of a column limit clashes with the idea that indentation is meaningful. You can see this in the CoffeeScript codebase itself, which I feel really shouldn’t be contorting itself to mostly keep inside a column limit: it’s lots of hacks, of knowing that and will trigger an implied line continuation, and stuff like that. I don’t think this is a good style to encourage, and Prettier should be outputting a recommended style.

At the very least, I would focus on just getting Prettier to output legal source code, and worry about line breaks later.

@jashkenas
Copy link
Owner

I think if you're going to keep the existing --nodes output as well, then adding a new flag for this (and a new option in the programmatic API) is totally fine. coffee --json or coffee --ast sound good.

The intended effect of the main prohibition against adding flags is to avoid having CoffeeScript parse/compile/evaluate different programs differently, depending on the flags passed. I want to avoid a situation where if you use a certain technique in your code, you need to pass a certain flag — hurting interoperability and copy pasting.

@vendethiel
Copy link
Collaborator

I realize it’s something that’s done already, but the idea of adding such « recognition mechanisms » to nodes seems wrong.

@jashkenas
Copy link
Owner

jashkenas commented May 1, 2018

@vendethiel — how do you mean, exactly?

Edit: I mean, the token data dictionary does feel like a nasty kludge ... but I'm not sure what the clean alternative would be for what Geoffrey is trying to do here.

@GeoffreyBooth
Copy link
Collaborator Author

the token data dictionary does feel like a nasty kludge

Yes, I pored over parser.js for a few hours to see if there was a way to patch it to simply pass through the token data, along the lines of (psuedocode):

tokens = [ ... array of tokens objects, some with data properties ... ]
nodes = parser(tokens)
code = nodes.compile()

As in, the token objects in the tokens array seem to be passing through the parser on their way to becoming nodes, so shouldn’t there be some way to attach a property to those objects that can stowaway through the parser and be used in the node classes? But the more I look at the parser, the more I think it might not be possible, because a token object is not one-to-one with a node. Nodes are groups of tokens:

o 'IMPORT String', -> new ImportDeclaration null, $2

So an ImportDeclaration node is made up of two tokens, IMPORT and String (let’s assume for the sake of simplicity that String is always just a StringLiteral). So StringLiteral is its own node class, so there’s data that could be passed to it; and in this particular case you could attach data to IMPORT that affects ImportDeclaration. But there’s no direct parallel between tokens and ImportDeclaration, and this is one of the simplest cases; in the more complex ones, I don’t know how I would map token data into nodes in meaningful ways.

Comments are so low priority that the kludge felt fine, but if there’s a better way to attach lexer/token data to get it more sensibly through to the nodes, I’m open to suggestions.

@jashkenas
Copy link
Owner

I think that the easiest way to find out would be to simply ask @zaach. We could open a question ticket on https://github.com/zaach/jison/issues

Otherwise, a Jison patch would probably be needed...

@vendethiel
Copy link
Collaborator

vendethiel commented May 1, 2018

Yes, I pored over parser.js for a few hours to see if there was a way to patch it to simply pass through the token data, along the lines of (psuedocode):

We already do "pass data" in the sense of $1 $2 $3.... Could some of these be made objects so that we can append .data to them?

@helixbass
Copy link
Collaborator

@GeoffreyBooth

I think you and @zdenko and I should get on a call together soon to coordinate

Sure

In #5044 @jashkenas asked to rename toPlainObject to toJSON, and I see no reason not to use that name

.toJSON() seems ok to me as long as it's not weird to conflate its meaning in overriding JSON.stringify() behavior with its significance as "this generates an AST (object) representation". I guess that makes sense given that the AST object is supposed to be an "equivalent" "serialization" of the actual node object

I was planning to not have Prettier break lines, as CoffeeScript is a significant-whitespace language and the concept of a column limit clashes with the idea that indentation is meaningful

Ya there are a lot of interesting questions here. I also instinctively don't really see Coffeescript as fitting inside Prettier's "box" (and, per @jashkenas' comment, the logistics of guaranteeing correctness of auto-formatted Coffeescript may be more intricate than Javascript). However, line-breaking smartly is at the core of Prettier (see the paper its algorithm is based on and its description of its formatting primitives) so I don't think it's a question of punting completely on specifying line-breaking. And I still see Prettier as the logical initial choice (at least as far as what I'm interested in working on) for an engine for auto-formatting a Coffeescript AST

Thus far I've been specifying basic line-breaking rules like "you can line-break array elements/function params/call args" or "you can line-break after an = in an assignment". I think you raise a wise word of caution that we don't want to get overzealous with linebreaking that's not good Coffeescript style. I will surely have a more nuanced grasp of how to use Prettier's formatting primitives to wrangle certain stylized formatting choices as time goes on

But to take your example, if we really didn't think a line should ever break after an and/or, you just don't specify an optional line (break) at that position when formatting a binary operation. I don't personally have a big problem with picturing it automatically splitting up complicated logic conditions after and/or (maybe with some guiding parens). But I think we can definitely punt on arguments about what specifically should be considered good style or not, that should be the fun part once we're in a pretty good position as far as having it working reasonably

I started looking at the work-in-progress Prettier Python formatter (linked from the Prettier homepage), curious about what choices another whitespace-indented language would make. I did notice some line-breaking backslashes, which is something I'd instinctively avoid

But actually the main thing I came away with for now was that they seem to be taking advantage of a Prettier plugin architecture to allow their development to be in a separate repo from Prettier proper. I'm guessing this is recommended (eg the Ruby one was set up similarly) so I mimicked its setup and yanked the core parsing/formatting code I've been working on from my coffeescript Prettier branch into a new prettier-plugin-coffeescript repo. So now that should be set up for starting to add Prettier-formatted snapshots (the recommended way to test Prettier)

@GeoffreyBooth
Copy link
Collaborator Author

We do have a coffeescript organization, perhaps we should put the prettier-plugin-coffeescript repo there?

I would say we should at least focus on using Prettier as a code generator first, before worrying about line breaks. And if Prettier lacks an option to disable line breaking, we should add one. I understand that that functionality is part of its core mission, but whitespace-dependent languages are different. I think inevitably we’d end up with line-continuation backslashes all over the place, like Python, unless we took a less strict approach about line breaks (like maybe put objects inline when they’re short, and expand to multiline syntax only when they’re too long for a single line, for example). What does Ruby do?

@GeoffreyBooth
Copy link
Collaborator Author

Okay, I updated this branch so that all tokens have metadata attached to them, with at least the common properties (tag, value, and indentation info). I also did some inspection on which node classes end up with this token data.

The following 23 node classes get data passed from tokens:

  • BooleanLiteral
  • CSXTag
  • Elision
  • Expansion
  • ExportSpecifier
  • FuncGlyph
  • IdentifierLiteral
  • ImportDefaultSpecifier
  • ImportSpecifier
  • InfinityLiteral
  • Literal
  • NaNLiteral
  • NullLiteral
  • NumberLiteral
  • PassthroughLiteral
  • PropertyName
  • RegexLiteral
  • RegexWithInterpolations
  • StatementLiteral
  • StringLiteral
  • StringWithInterpolations
  • ThisLiteral
  • UndefinedLiteral

And the following 16 node classes don’t get any token data:

  • Arr
  • Assign
  • AwaitReturn
  • Existence
  • ExportNamedDeclaration
  • For
  • If
  • ImportDeclaration
  • In
  • Obj
  • Parens
  • Splat
  • SuperCall
  • Switch
  • Throw
  • While

And the following 14 node classes sometimes get token data:

  • Access
  • Block
  • Call
  • Class
  • Code
  • Index
  • Op
  • Param
  • Range
  • Return
  • Slice
  • Super
  • Try
  • Value

Most if not all of the latter two groups have child nodes that have token data. But I’m starting to wonder how useful this is. What data is there in the lexer that might need for a “complete” AST, that isn’t already available in the nodes? If we can solve #5019 without this data, is there any reason we need the token data?

@GeoffreyBooth
Copy link
Collaborator Author

Closing for now, in the hope that we won’t need this to be able to produce a “complete” AST. Will reopen if it turns out we need this workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants