Skip to content

Question: why is the Visitor trait limited to statements, relations & expressions? #934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
freshtonic opened this issue Jul 25, 2023 · 25 comments

Comments

@freshtonic
Copy link

freshtonic commented Jul 25, 2023

What is the reason for that particular design decision versus providing a more general Visitor implementation?

Two options for a generalised Visitor trait come to mind:

  1. expose pre + post trait method variants for every AST node type, or
  2. expose only two trait methods (pre_visit + post_visit) with signatures like fn pre_visit(&mut self, node: &AstNode) -> ControlFlow<Self::Break> - where AstNode is an enum with a wrapper variant for every AST node type found in src/ast/mod.rs and can be matched against.

Would the maintainers be interested in a PR that implements one of the above two approaches?

My preference would be for option 2 because it would not break the trait when node types are added/removed.

Suggested approach:

  1. Define a new RawVisitor trait (and RawVisitorMut trait) like this:
pub trait RawVisitor {
  type Break;
  fn pre_visit(&mut self, node: &AstNode) -> ControlFlow<Self::Break>;
  fn post_visit(&mut self, node: &AstNode) -> ControlFlow<Self::Break>;
}
  1. Define an adapter type (RawVisitorAdapter ?) that accepts a V: Visitor generic argument and implements RawVisitor & RawVisitorMut, which calls the appropriate method on V (or none at all)
struct RawVisitorAdapter<V: Visitor>(v);

impl<V: Visitor> RawVisitor for RawVisitorAdapter<V> {
  type Break = V::Break;

  fn pre_visit(&mut self, node: &AstNode) -> ControlFlow<Self::Break> {
     match node {
        AstNode(Statement) => self.0.pre_visit_statement(...),
        // etc
     }
  }

  fn post_visit(&mut self, node: &AstNode) -> ControlFlow<Self::Break>;
}
  1. Change the Visit derivation macros to generate code in terms of RawVisitor & RawVisitorMut instead of Visitor, like this:
pub trait Visit {
    fn visit_raw<V: RawVisitor>(&self, visitor: &mut V) -> ControlFlow<V::Break>;

    // This has an identical signature to the existing trait, but has a default implementation
    fn visit<V: Visitor>(&self, visitor: &mut V) -> ControlFlow<V::Break> {
       self.visit_raw(RawVisitorAdapter::new(visitor))
    }
}
@alamb
Copy link
Contributor

alamb commented Aug 7, 2023

What is the reason for that particular design decision versus providing a more general Visitor implementation?

The core reasons are to keep this crate easy to extend with new sql syntax by casual contributors and limited maintenance bandwidth.

The current visitor pattern / macros require a single new #[derive] call

In terms of your proposed implementation, it seems reasonable. I would be interested in knowing how it would work when a contributor added a new node, like #897

How would we ensure that these new nodes were added to the AstNode enum? What would happen if we forgot to do so?

perhaps @lovasoa and @tustvold have some other perspectives

Basically I think a general purpose rewriter / visitor would be a great addition, as long as we don't make it too hard to add new code to this repo

@lovasoa
Copy link
Contributor

lovasoa commented Aug 7, 2023

@freshtonic AST visiting in sqlparser-rs has a long history. It took very long for a design to be agreed on and merged. I had proposed an alternative, more general, approach that would have allowed to do what you wanted: #601

But it was rejected, by fear it would take too much work to maintain.

My proposition to embark new maintainers for the feature (myself and @nikis05) was rejected.

sqlparser-rs is at the very core of SQLPage, so I can reiterate my proposition: I wouldn't mind being added as a maintainer here :)

@alamb
Copy link
Contributor

alamb commented Aug 7, 2023

sqlparser-rs is at the very core of SQLPage, so I can reiterate my proposition: I wouldn't mind being added as a maintainer here :)

Indeed -- I am looking for help maintaining it for sure. I think you can do all the time consuming parts of maintenance (PR reviews, answering questions / tickets, etc) without having write access to the repository.

I am waiting for someone to actually show the initiative to do this work (there are plenty of people willing to write code).

I had a promising candidate in @AugustoFKL (see #808) but sadly there were some circumstances that required them to stop working here

I will write up a discussion explaining the current state of affairs

Update: https://github.com/sqlparser-rs/sqlparser-rs/discussions/940

@freshtonic
Copy link
Author

freshtonic commented Aug 7, 2023

In terms of your proposed implementation, it seems reasonable. I would be interested in knowing how it would work when a contributor added a new node, like #897

How would we ensure that these new nodes were added to the AstNode enum? What would happen if we forgot to do so?

That's a great question. Assuming the parent node of the new node has a derived Visitor instance then compilation would fail due to the derived code not being able to find a matching AstNode variant, which is not ideal.

Ideally, the AstNode enum would be generated automatically via a proc macro, by walking the types starting from sqlparser::ast::Statement as the root type, but that would require some deep macro voodoo. I'm not even sure it's possible without using something like rust-analyzer as a library to discover all of the types recursively. Someone please let me know if it's achievable.

@lovasoa
Copy link
Contributor

lovasoa commented Aug 8, 2023

You don't need to create a huge enum. You can use the built in Any trait. Have a look at my PR ;)

@freshtonic
Copy link
Author

@lovasoa depends on if you want to match on it. I'll take a look at you PR though.

@alamb
Copy link
Contributor

alamb commented Aug 22, 2023

FWIW #951 added visiting TableFactor as well

@freshtonic
Copy link
Author

Update: I've been experimenting with various implementation strategies to generalise the AST visitor. Nothing to ready to show yet, but this is business critical for my company so it's a high priority to figure out something that works well.

I'll update this issue with a link to a draft PR hopefully in a matter of days.

@freshtonic
Copy link
Author

Progress update - I have a working generalised visitor implementation. I'll put up a PR later this week when I've done some more testing.

@osaton
Copy link

osaton commented Feb 10, 2024

Is this still happening? Generalised visitor implementation would be appreciated.

@freshtonic
Copy link
Author

freshtonic commented Feb 11, 2024

@osaton @alamb yes this is still happening. Apologies for the lack of comms.

TL;DR I ended up taking quite a different approach to what I originally planned, due to our (the company I work for) evolving needs.

The comprehensive Visitor implementation will be initially released as its own crate. My original intention was to release a PR against sqlparser itself, but the design space is fairly large and I'd like to retain the freedom to iterate on it after releasing the code (at least before committing to a version 1 release).

Summary of features so far:

  • A Visitor implementation is notified of every type in the sqlparser AST (200+ types when you take into account the Vec, Box and Option combinations, plus primitives).
  • The Visitor trait itself is generic over the AST node type of interest (there aren't 200+ trait methods - in fact there are just two: enter and exit).
  • The AST traversal is performed in an optimal order for semantic analysis workloads (instead of the order of AST fields in the source). Specifically, AST nodes that can refer to other AST nodes are visited after the AST node that introduces the referent - e.g. a FROM clause is visited before the a projection that refers to columns in the tables introduced by the FROM clause. This cuts down on the number of passes required for semantic analysis workloads.

In order to pull that off, I needed to write a build.rs script to analyse sqlparser's sources because hand writing AST traversal in semantically-useful order for all of the AST types would have been a lot of work (and error prone).

I need to write up some docs but I would love some feedback on the code and general approach before I publicly publish the repo. I can share it privately if either of you are interested in taking a look.

@freshtonic
Copy link
Author

freshtonic commented Feb 11, 2024

Three features that I really wanted from sqlparser while writing the enhanced Visitor code:

  1. Optional feature to have a unique ID baked into all AST nodes

It's possible for multiple AST nodes at different places in the tree that have the same type to have the same hash and compare as equal. This seems counter intuitive, but two syntactically identical nodes can be different semantically.

Not having robust (and cheap) means of uniquely identifying a node makes building HashMaps of derived metadata keyed by node particularly tricky.

  1. Optional feature to have span information baked into all AST nodes

This would be super useful for debugging and reporting errors nicely (like miette does).

  1. Type per enum variant (for Statement and Expr at least, maybe some additional other types)

It would be nice to not have to match an enum in order to pluck out a variant. Instead I'd like be able to implement Visitor<InsertStatement>.

1 & 2 could be solved by wrapping every struct field or enum variant field in a Node<T> type that Derefs to T but also contains a unique ID and span info. Because of the Deref it might be possible to do this without breaking backwards compatibility but I haven't actually tried it.

@osaton
Copy link

osaton commented Feb 12, 2024

@freshtonic Your visitor crate sounds perfect for my use case. I would love to try it but I'm not sure if I'm able to give any constructive criticism about the code as I'm fairly new to Rust and systems programming languages in general.

  1. Optional feature to have a unique ID baked into all AST nodes

It's possible for multiple AST nodes at different places in the tree that have the same type to have the same hash and compare as equal. This seems counter intuitive, but two syntactically identical nodes can be different semantically.

Not having robust (and cheap) means of uniquely identifying a node makes building HashMaps of derived metadata keyed by node particularly tricky.

  1. Optional feature to have span information baked into all AST nodes

This would be super useful for debugging and reporting errors nicely (like miette does).

  1. Type per enum variant (for Statement and Expr at least, maybe some additional other types)

I don't know if this can be done without breaking changes, but having both 1 and 2 would be amazing.

@ryb73
Copy link

ryb73 commented Apr 9, 2024

+1 to this – would be curious to try out your crate @freshtonic if/when it gets to a publishable state :)

@ramnes
Copy link

ramnes commented Apr 7, 2025

@alamb In the meantime, would you be open to a PR that basically just adds visit(with = "visit_...") for every node type?

@freshtonic
Copy link
Author

@ryb73 I built sqltk on top of sqlparser which provides an alternate take on the Visitor trait. You can find it here sqltk.

Note that the docs building currently fails on docs.rs due to very clunky build step to work around being unable to derive traits for foreign types in Rust.

The workaround I'm considering is to fork sqlparser as sqltk-parser in the workspace (which gratuitous credit given) and maintain compatibility with upstream sqlparser.

We're using it in production at CipherStash so it's "production ready" in that sense but we'd appreciate some community feedback and PRs to fix/improve it :)

@alamb
Copy link
Contributor

alamb commented Apr 10, 2025

Sorry for the late reply -- Option 2 above seems like a reasonable idea to me. @iffyio does that sound reasonable to you?

@iffyio
Copy link
Contributor

iffyio commented Apr 10, 2025

Yeah I think option 2 above sounds reasonable!

@ramnes
Copy link

ramnes commented Apr 10, 2025

FWIW I quickly implemented option 1 here in the meantime, if you want to take a look. That's indeed a large chunk of boilerplate code.

Anyone wants to work on option 2? Otherwise I may be able to land something next week, but I don't want to guarantee anything.

@freshtonic
Copy link
Author

freshtonic commented Apr 29, 2025

@ramnes anything I can do to help you make your Visitor changes into a PR?

@alamb is there a realistic chance of @ramnes work being merged?

If it lands, it potentially means sqltk no longer needs to exist. The only other unique feature of sqltk is a traversal order guarantee, which it turns out I don't need.

@ramnes
Copy link

ramnes commented Apr 29, 2025

I can open a PR for this commit, but is this the direction we want to go in? I thought folks here wanted option 2 (which I didn't have the time to work on so far.)

@freshtonic
Copy link
Author

I can open a PR for this commit, but is this the direction we want to go in? I thought folks here wanted option 2 (which I didn't have the time to work on so far.)

@ramnes I just realised it's not 100% clear to me what you mean by "Option 2" - are you referring to my comment about baked in span information in all nodes? Because that's something that's being gradually rolled out already.

@ramnes
Copy link

ramnes commented Apr 30, 2025

Two options for a generalised Visitor trait come to mind:

  1. expose pre + post trait method variants for every AST node type, or
  2. expose only two trait methods (pre_visit + post_visit) with signatures like fn pre_visit(&mut self, node: &AstNode) -> ControlFlowSelf::Break - where AstNode is an enum with a wrapper variant for every AST node type found in src/ast/mod.rs and can be matched against.

I meant option 2 here. Isn't it what @alamb and @iffyio meant as well? 😅

@freshtonic
Copy link
Author

freshtonic commented Apr 30, 2025

@ramnes thank you for clarifying!

I'm 100% in favour of any style of Visitor implementation that covers all AST node types in sqlparser. I'm less fussy than I used to be about the particular approach taken. These are the "obvious" approaches to me:

  1. Big Visitor trait (enter/exit or pre/post methods for every AST node type)
  2. Small Visitor trait (just enter/exit methods, e.g. fn enter<N: Visitable>(&mut self, node: &N) -> ControlFlow
  3. Small generic Visitor<N: Visitable> trait (just enter/exit methods, e.g. fn enter(&mut self, node: &N) -> ControlFlow

Option 1 is (subjectively) ugly but gets the job done.

Option 2 has a cleaner trait but implementations must downcast the node to do anything useful with it.

Option 3 is just as clean as 2 and downcasting is no longer required in every implementation, but Visitor dispatch no longer works because the AST traversal has to be able to dispatch any node type to a Visitor, so you still need an additional dispatch mechanism. This is where the enum based dispatch could help.

The approach to Visitor I ended up going with in sqltk is essentially Option 2 above and looks like this.

I have been thinking about alternative approach to avoid all of the explicit casting and just generate the code.

Here's the playground


Some background. 20 months ago(!) when I created this issue I was tasked with creating a type-inferencer for the CRUD portion of the SQL grammar as part of my job at CipherStash.

What I didn't know then (but I know now) is that full AST coverage was not necessary for our particular problem. Currently the only AST nodes that we need to visit are: Statement,Query, Insert, Delete, Expr, SetExpr, Select, Vec<SelectItem>, Function, Values, Value.

But I can see a need for access to more AST nodes as we extend our product into other areas of SQL analysis. There would always be workarounds: just implement a Visitor for the nearest parent node type of the node type you're actually interested in and then match against its sub-tree. That's not very nice though and why I'm a proponent of a general Visitor trait.

@freshtonic
Copy link
Author

freshtonic commented Apr 30, 2025

@ramnes oh and one more thing that's very important.

For workloads such as analysing an AST (in my case, deriving type information for a subset of node types) a Visitor implementation has to be able to associate information with AST nodes, so an implementation would have, say, a field like derived_node_info: HashMap<&Expr, SomeDerivedInfo>.

The problem is that &Expr must have a lifetime and that lifetime must not outlive the parsed AST.

That means the Visitor trait has to have some lifetime associated with the trait itself (e.g. Visitor<'ast>) and the enter/exit methods must use that lifetime for the node references, e.g. fn enter(&mut self, node: &'ast N) ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants