Skip to content

Parse column constraints in any order #93

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 9, 2019

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Jun 3, 2019

CREATE TABLE t (a INT NOT NULL DEFAULT 1 PRIMARY KEY) is as valid as CREATE TABLE t (a INT DEFAULT 1 PRIMARY KEY NOT NULL).

@nickolay I got annoyed by the tests in sqlparser_postgres and adjusted them so that the first test asserts every column specification, and the remainder simply verify that the SQL roundtrips. Let me know if you'd prefer to be more precise with the later tests.

@benesch benesch requested a review from nickolay June 3, 2019 02:57
@coveralls
Copy link

coveralls commented Jun 3, 2019

Pull Request Test Coverage Report for Build 313

  • 246 of 272 (90.44%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at ?%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/sqlast/ddl.rs 45 46 97.83%
src/sqlparser.rs 41 42 97.62%
tests/sqlparser_common.rs 73 84 86.9%
tests/sqlparser_postgres.rs 87 100 87.0%
Totals Coverage Status
Change from base Build 307: 0.0%
Covered Lines:
Relevant Lines: 0

💛 - Coveralls

@benesch benesch force-pushed the col-constr-order branch 2 times, most recently from c42250d to bf7d50b Compare June 4, 2019 14:53
@nickolay
Copy link
Contributor

nickolay commented Jun 5, 2019

This got rather long, sorry!

  1. The CHECK constraint is only partially implemented: it's in the AST, but the parser doesn't handle it. Please make a note if you want to leave it unimplemented for now.

  2. A thought on parsing of the DEFAULT expression:

We want to allow as many DEFAULT expressions as possible, but calling self.parse_expr() will choke on expressions like "DEFAULT 1 NOT NULL". Typically this would be a syntax error, as "1 NOT NULL" is not a valid SQL expression, but in this case we know that NOT NULL is part of the next column constraint.

A syntax error happens whenever a "token" (or - with the change from #82 - a token pair, like "NOT IN") following the DEFAULT expression has a get_next_precedence > 0. The Pratt parser then attempts to parse it as an "infix", and in the case of NOT NULL fails with "Expected IN or BETWEEN after NOT, found: null".

As it turns out, the only expressions that cause trouble all have precedence less than or equal to BETWEEN, so we pass BETWEEN_PREC to stop parsing on tokens with precedence less than or equal to BETWEEN.
The same trick is employed by PostgreSQL

I've looked through the docs for a few dialects (below) and I can't think of anything other than NOT (and COLLATE - see below) that's under a risk of being incorrectly treated as a DEFAULT expression's infix, while being a valid keyword starting a new constraint or a column attribute definition.

After #82 get_next_precedence is only used to get the left-binding precedence of the peeked token (unlike before when get_precedence(make_keyword("NOT")) was used to get the right-binding precedence to pass to parse_subexpr - we now use constants for that).

I don't see why get_next_precedence needs to treat NOT not followed by IN/BETWEEN any differently than other keywords that are not "infix" operators. It currently returns lbp=UNARY_NOT_PREC for NOT in infix position instead of 0, so you have to call parse_subexpr with rbp=BETWEEN_PREC to make it stop (per rbp>=lbp) when encountering NOT. If we defined lbp(NOT) = 0, the parser invoked with the default rbp=0 would stop on NOT followed by NULL just like it stops on any unknown keyword. (And the unreachable! removed in #71 could be restored then.)

What do you think?

  1. On COLLATE:
  • ANSI puts COLLATE as the last clause in a column definition, but it doesn't allow arbitrary expressions in DEFAULT, so it doesn't conflict.
  • PG/MS both seem to expect COLLATE before DEFAULT in column definition, presumably to avoid the ambiguity of DEFAULT 'foo' COLLATE bar.

As we parse COLLATE as an optional suffix at the end of parse_prefix, the precedence trick doesn't help with this anyway, and to support ANSI syntax we'd need to use a different radically simpler parser for the DEFAULT clause. This is not in the scope of this PR, of course.

  1. On the AST design:

ANSI allows CONSTRAINT <name> to be used with NOT NULL, UNIQUE/PRIMARY KEY, and REFERENCES. MSSQL doesn't seem to support it for NOT NULL, but supports it for DEFAULT instead. Postgres allows both NOT NULL and DEFAULT, plus NULL and GENERATED as constraints.

All three dialects agree that COLLATE isn't a constraint, but expect it in different positions. MS has a number of custom options interleaved with the standard ones.

Maybe this could be modeled as a Vec<ColumnOption> where a ColumnOption is either a Constraint { name: Option<String>, constraint: Box<ColumnOption>, deferred, deferrable, etc }, one of standard options (including constraints, COLLATE, DEFAULT, and GENERATED), or a dialect-specific String?

  1. I didn't get to the tests, but given that you're being very thorough with testing, I'll probably agree with whatever approach you decided on.

  2. Relevant parts of the docs:

  • ANSI
    <column definition> ::= <column name> [ <data type or domain name> ]
        [ <default clause> | <identity column specification> | <generation clause>
          | <system time period start column specification>
          | <system time period end column specification>
        ]
        [ <column constraint definition>... ]
        [ <collate clause> ]
    
    <column constraint definition> ::= [CONSTRAINT ..]
      { NOT NULL
        | <unique specification>
        | <references specification>
        | <check constraint definition>
      }
      { in any order:
        [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
        [ [ NOT ] DEFERRABLE ]
      }
      [ [ NOT ] ENFORCED ]
    
  • Postgres
    column_name data_type [ COLLATE collation ] [ column_constraint [ ... ] ]
    and column_constraint is:
    [ CONSTRAINT constraint_name ]
    { NOT NULL | NULL | CHECK | DEFAULT .. | GENERATED .. |
    UNIQUE .. | PRIMARY KEY .. | REFERENCES .. }
    [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
    
  • MSSQL
    <column_definition> ::= column_name <data_type>
      [ FILESTREAM ]
      [ COLLATE collation_name ]
      [ custom stuff ]
      [ CONSTRAINT constraint_name [ DEFAULT constant_expression ] ]
      [ GENERATED .. ]
      [ NULL | NOT NULL ]
      [ more custom stuff ]
      [ <column_constraint> [, ...n ] ]
      [ <column_index> ]
    
    <column_constraint> ::= [ CONSTRAINT constraint_name ]
      { CHECK .. 
      | FOREIGN KEY .. 
      | { PRIMARY KEY | UNIQUE } [ custom ] }
    

@benesch
Copy link
Contributor Author

benesch commented Jun 5, 2019

After #82 get_next_precedence is only used to get the left-binding precedence of the peeked token (unlike before when get_precedence(make_keyword("NOT")) was used to get the right-binding precedence to pass to parse_subexpr - we now use constants for that).

I don't see why get_next_precedence needs to treat NOT not followed by IN/BETWEEN any differently than other keywords that are not "infix" operators. It currently returns lbp=UNARY_NOT_PREC for NOT in infix position instead of 0, so you have to call parse_subexpr with rbp=BETWEEN_PREC to make it stop (per rbp>=lbp) when encountering NOT. If we defined lbp(NOT) = 0, the parser invoked with the default rbp=0 would stop on NOT followed by NULL just like it stops on any unknown keyword. (And the unreachable! removed in #71 could be restored then.)

This is brilliant! You're entirely right that there's a bug here. We should absolutely be returning 0 if we see a unary NOT in get_next_precedence, because unary NOT is not an infix operator and therefore has no left binding power.

I'm going to split that change out into its own PR.

@benesch
Copy link
Contributor Author

benesch commented Jun 6, 2019

Ok, I've created a separate PR for the unary NOT issue (#107), and rebased this PR on top of it.

  1. The CHECK constraint is only partially implemented: it's in the AST, but the parser doesn't handle it. Please make a note if you want to leave it unimplemented for now.

Oops, thanks! This is fixed now, and I've added a test.

  1. A thought on parsing of the DEFAULT expression...

This was spot on, and is addressed in #107.

  1. On COLLATE...
  2. On the AST design...

I took a shortcut here and added COLLATE as an Option<SQLObjectName> on the SQLColumnDef struct. I'll try to find time soon to rework the structs as you've suggested.

@benesch benesch force-pushed the col-constr-order branch from 58dabda to 1af39db Compare June 6, 2019 17:45
@nickolay
Copy link
Contributor

nickolay commented Jun 6, 2019

I'll try to find time soon to rework the structs as you've suggested.

Did you mean you wanted to merge this PR before reworking? I’ll try look at this over the weekend, but please let me know if it blocks you.

@benesch
Copy link
Contributor Author

benesch commented Jun 6, 2019

Did you mean you wanted to merge this PR before reworking? I’ll try look at this over the weekend, but please let me know if it blocks you.

Not blocked! Just wanted to write down which chunks of your feedback I've addressed and which chunks I haven't, so that I remember what's left when I get back to it.

@benesch benesch force-pushed the col-constr-order branch from 1af39db to b47438d Compare June 6, 2019 21:33
Copy link
Contributor

@nickolay nickolay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! I think future improvements can be done when/if someone needs them.

pub enum ColumnConstraint {
/// `NULL`
///
/// The ANSI specification technically allows NULL constraints to have a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for nit-picking: I don't believe ANSI SQL allows CONSTRAINT foo NULL - it's Postgres that seems to allow it judging from its docs...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man, good point. What a nightmare. Fixed.

}

#[derive(Debug, Clone, PartialEq, Hash)]
pub enum ColumnConstraint {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current AST seems to be partially modeled after Postgres in which any column option (except COLLATE) is called a "column constraint", even though it doesn't constrain anything. Shall we add "GENERATED" to this struct when it's implemented? Other options, like MS-specific ones? Would we keep the ColumnConstraint name then? If we plan to rename, it might be better to rename it before landing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. I've pretty substantially reworked the design of the AST. I've renamed the main enum to ColumnOption, to better reflect the fact that it might not contain only constraints. I've also moved the optional name to a wrapping ColumnOptionDef struct, which better allows any of the options to take on a name, to allow for Postgres syntax, where practically everything can have a name.

I think this should provide the right amount of future flexibility. GENERATED naturally fits as a new enum variant in ColumnOption, and hopefully it's less grating now that the struct doesn't refer to "constraints." MSSQL-specific options can also go into ColumnOption, if that makes sense, or they can live as fields on SQLColumnDef, if they're the kind of options (like SPARSE) that can't be named and must appear in a fixed order.

@benesch benesch force-pushed the col-constr-order branch 2 times, most recently from 55988c3 to 2943e05 Compare June 8, 2019 16:11
CREATE TABLE t (a INT NOT NULL DEFAULT 1 PRIMARY KEY) is as valid as
CREATE TABLE t (a INT DEFAULT 1 PRIMARY KEY NOT NULL).
@benesch benesch force-pushed the col-constr-order branch from 2943e05 to ffa1c8f Compare June 8, 2019 16:14
@nickolay
Copy link
Contributor

nickolay commented Jun 8, 2019

Looks great, thanks!

@benesch benesch merged commit 7e96d81 into apache:master Jun 9, 2019
@benesch benesch deleted the col-constr-order branch June 21, 2019 22:34
nickolay added a commit to nickolay/sqlparser-rs that referenced this pull request Jul 31, 2020
Since PR apache#93
`parse_column_def` parses a set of column options in a loop, e.g. given:

```
                  _______ column_def _______
CREATE TABLE foo (bar INT NOT NULL DEFAULT 1, )
                          -------- ---------
                          option 1  option 2
````

it parses column options until it encounters one of the delimiter tokens

First when we only supported `CREATE TABLE`, the set of delimiters that
stopped the parsing used to be `Token::Comma | Token::RParen`.

Then we added support for `ALTER TABLE ADD COLUMN <column_def>`. Turns
out the parser started to bail if the statement ended with a semicolon,
while attempting to parse the semicolon as a column option, as we forgot
to add it to the set of delimiter tokens.

This was recently fixed in apache#246
by including Token::SemiColon to the list, but it felt wrong to have 
to update this list, and to have a common list of delimiters for two
different contexts (CREATE TABLE with parens vs ALTER TABLE ADD COLUMN
without parens).

Also our current approach cannot handle multiple statements NOT
separated by a semicolon, as is common in MS SQL DDL. We don't
explicitly support it in `parse_statements`, but that's a use-case
like to keep in mind nevertheless.
nickolay added a commit to nickolay/sqlparser-rs that referenced this pull request Jul 31, 2020
Since PR apache#93
`parse_column_def` parses a set of column options in a loop, e.g. given:

```
                  _______ column_def _______
CREATE TABLE foo (bar INT NOT NULL DEFAULT 1, )
                          -------- ---------
                          option 1  option 2
````

it parses column options until it encounters one of the delimiter tokens

First when we only supported `CREATE TABLE`, the set of delimiters that
stopped the parsing used to be `Token::Comma | Token::RParen`.

Then we added support for `ALTER TABLE ADD COLUMN <column_def>`. Turns
out the parser started to bail if the statement ended with a semicolon,
while attempting to parse the semicolon as a column option, as we forgot
to add it to the set of delimiter tokens.

This was recently fixed in apache#246
by including Token::SemiColon to the list, but it felt wrong to have
to update this list, and to have a common list of delimiters for two
different contexts (CREATE TABLE with parens vs ALTER TABLE ADD COLUMN
without parens).

Also our current approach cannot handle multiple statements NOT
separated by a semicolon, as is common in MS SQL DDL. We don't
explicitly support it in `parse_statements`, but that's a use-case
like to keep in mind nevertheless.
nickolay added a commit to nickolay/sqlparser-rs that referenced this pull request Aug 10, 2020
Since PR apache#93
`parse_column_def` parses a set of column options in a loop, e.g. given:

```
                  _______ column_def _______
CREATE TABLE foo (bar INT NOT NULL DEFAULT 1, )
                          -------- ---------
                          option 1  option 2
````

it parses column options until it encounters one of the delimiter tokens

First when we only supported `CREATE TABLE`, the set of delimiters that
stopped the parsing used to be `Token::Comma | Token::RParen`.

Then we added support for `ALTER TABLE ADD COLUMN <column_def>`. Turns
out the parser started to bail if the statement ended with a semicolon,
while attempting to parse the semicolon as a column option, as we forgot
to add it to the set of delimiter tokens.

This was recently fixed in apache#246
by including Token::SemiColon to the list, but it felt wrong to have
to update this list, and to have a common list of delimiters for two
different contexts (CREATE TABLE with parens vs ALTER TABLE ADD COLUMN
without parens).

Also our current approach cannot handle multiple statements NOT
separated by a semicolon, as is common in MS SQL DDL. We don't
explicitly support it in `parse_statements`, but that's a use-case
like to keep in mind nevertheless.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants