-
Notifications
You must be signed in to change notification settings - Fork 2k
Pass lexer data to nodes #5045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass lexer data to nodes #5045
Conversation
… clearer that it only happens once
…ass through the parser into the node classes
…ent them in the nodes class
@GeoffreyBooth I've been starting to expose AST types (on my Eg I'm aware of non-preservation of heredoc indents, but I'd be inclined to try and rework that so that the whole string literal makes it directly through the grammar intact. In general, it seems like we should only use this token A couple notes on what I've been working on: There is clearly lots of potential for reuse from the work I've done for generating JS AST. I've started fleshing out a generic way for the JS AST compilation to default to reusing the Coffeescript AST generation code for that type, with an optional Also fleshing out a generic structure for how to specify AST generation for a given node type as declaratively as possible (see
As far as the corresponding work getting Prettier to generate formatted Coffeescript, I used your Getting Prettier to be able to gradually emit more and more formatted Coffeescript constructs is a nice way to be able to measure progress. There are some obvious bigger targets like being able to run Coffeescript test files through Prettier and have the reformatted version pass. But it'd probably be worth figuring out how to start building an actual Prettier test suite that I can add examples to and use to check for regressions as I'm adding new types and refining formatting Thus far my process has been to modify a simple |
Hey @helixbass, thanks for all this. I think you and @zdenko and I should get on a call together soon to coordinate. Some thoughts:
Yes, that makes sense. The indent is the big one; I’m going to update this branch so that the
In #5044 @jashkenas asked to rename
For sure.
I was planning to not have Prettier break lines, as CoffeeScript is a significant-whitespace language and the concept of a column limit clashes with the idea that indentation is meaningful. You can see this in the CoffeeScript codebase itself, which I feel really shouldn’t be contorting itself to mostly keep inside a column limit: it’s lots of hacks, of knowing that At the very least, I would focus on just getting Prettier to output legal source code, and worry about line breaks later. |
I think if you're going to keep the existing The intended effect of the main prohibition against adding flags is to avoid having CoffeeScript parse/compile/evaluate different programs differently, depending on the flags passed. I want to avoid a situation where if you use a certain technique in your code, you need to pass a certain flag — hurting interoperability and copy pasting. |
I realize it’s something that’s done already, but the idea of adding such « recognition mechanisms » to nodes seems wrong. |
@vendethiel — how do you mean, exactly? Edit: I mean, the token data dictionary does feel like a nasty kludge ... but I'm not sure what the clean alternative would be for what Geoffrey is trying to do here. |
Yes, I pored over
As in, the token objects in the o 'IMPORT String', -> new ImportDeclaration null, $2 So an Comments are so low priority that the kludge felt fine, but if there’s a better way to attach lexer/token data to get it more sensibly through to the nodes, I’m open to suggestions. |
I think that the easiest way to find out would be to simply ask @zaach. We could open a question ticket on https://github.com/zaach/jison/issues Otherwise, a Jison patch would probably be needed... |
We already do "pass data" in the sense of |
Sure
Ya there are a lot of interesting questions here. I also instinctively don't really see Coffeescript as fitting inside Prettier's "box" (and, per @jashkenas' comment, the logistics of guaranteeing correctness of auto-formatted Coffeescript may be more intricate than Javascript). However, line-breaking smartly is at the core of Prettier (see the paper its algorithm is based on and its description of its formatting primitives) so I don't think it's a question of punting completely on specifying line-breaking. And I still see Prettier as the logical initial choice (at least as far as what I'm interested in working on) for an engine for auto-formatting a Coffeescript AST Thus far I've been specifying basic line-breaking rules like "you can line-break array elements/function params/call args" or "you can line-break after an = in an assignment". I think you raise a wise word of caution that we don't want to get overzealous with linebreaking that's not good Coffeescript style. I will surely have a more nuanced grasp of how to use Prettier's formatting primitives to wrangle certain stylized formatting choices as time goes on But to take your example, if we really didn't think a line should ever break after an I started looking at the work-in-progress Prettier Python formatter (linked from the Prettier homepage), curious about what choices another whitespace-indented language would make. I did notice some line-breaking backslashes, which is something I'd instinctively avoid But actually the main thing I came away with for now was that they seem to be taking advantage of a Prettier plugin architecture to allow their development to be in a separate repo from Prettier proper. I'm guessing this is recommended (eg the Ruby one was set up similarly) so I mimicked its setup and yanked the core parsing/formatting code I've been working on from my |
We do have a coffeescript organization, perhaps we should put the I would say we should at least focus on using Prettier as a code generator first, before worrying about line breaks. And if Prettier lacks an option to disable line breaking, we should add one. I understand that that functionality is part of its core mission, but whitespace-dependent languages are different. I think inevitably we’d end up with line-continuation backslashes all over the place, like Python, unless we took a less strict approach about line breaks (like maybe put objects inline when they’re short, and expand to multiline syntax only when they’re too long for a single line, for example). What does Ruby do? |
Okay, I updated this branch so that all tokens have metadata attached to them, with at least the common properties (tag, value, and indentation info). I also did some inspection on which node classes end up with this token data. The following 23 node classes get data passed from tokens:
And the following 16 node classes don’t get any token data:
And the following 14 node classes sometimes get token data:
Most if not all of the latter two groups have child nodes that have token data. But I’m starting to wonder how useful this is. What data is there in the lexer that might need for a “complete” AST, that isn’t already available in the nodes? If we can solve #5019 without this data, is there any reason we need the token data? |
Closing for now, in the hope that we won’t need this to be able to produce a “complete” AST. Will reopen if it turns out we need this workaround. |
Along the lines of #5044 and #4984, this PR adds a way to pass arbitrary token data from the lexer to the node classes. Just as inside every node class since #4572 there’s a
@comments
property if comments should be output before or after that node, per this PR there’s now also a@data
property where we can stash extra data from the lexer or rewriter to sneak through the parser and be accessible in the nodes classes.In particular, this should make #5019 solvable. Besides creating the new code to allow this
data
property on tokens to sneak through the parser, this PR adds some data to theStringLiteral
token that should allow us to move a lot of the string manipulation logic out oflexer.coffee
and intonodes.coffee
. This is more just an example at this point, with the actual moving of that logic saved for a future PR. If you set a breakpoint in thenodes.coffee
StringLiteral
classcompileNode
method, you should see@data
present with the values we set in the lexer.@zdenko I think the code at lexer.coffee#L307-L315 is where multiline
'
and"
strings get deindented; thanks to the extra data I’m attaching toStringLiteral
s in lexer.coffee#L770-L773, the former code block should be able to move into theStringLiteral
nodes class. (A lot of the string code could move over there, like the lexer functions about escapes and so on.)Even if we didn’t bother moving whatever we could out of the lexer into nodes (though we should do that), simply being able to pass raw data from the lexer to the nodes should make building a complete AST possible.
There’s one caveat to this PR. Technically this new
data
property isn’t passing through the parser, just as the oldcomments
property isn’t; we compare the location data (start row/column, end row/column) of each column withdata
orcomments
properties with the location data of parser-generated nodes, and the properties are reattached when we find matches. This is only an issue when there are multiple tokens with the same location data, which is rare but does happen; the only example I can find is a generatedJS
token added to the start or end of a file to hold comments, but presumably similar “special case” or generated tokens will also share location data with user-generated ones. For comments, overlapping isn’t an issue; I just combine all the comments together. But for token data, that isn’t an option; but I don’t think it should be an issue, since no “special” tokens should ever have data that needs to be preserved. If I’m wrong about this, we’ll need to patch Jison to truly allow extra data to be passed through the parser. cc @helixbass