Skip to content

New Rule: Detect common Unicode character substitutions. #981

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gbuktenica opened this issue May 3, 2018 · 6 comments
Open

New Rule: Detect common Unicode character substitutions. #981

gbuktenica opened this issue May 3, 2018 · 6 comments

Comments

@gbuktenica
Copy link

When PowerShell code is copied into Visual Studio code from external sources some characters have been converted to an incorrect Unicode equivalent.
Examples are:
Double quote U+0022 <"> converted to left or right double quote U+201C and U+201D
Apostrophe U+0027 <'> converted to left or right apostrophe U+2018 U+2019
Dash U+002D <-> converted to En dash U+2013

The vscode-powershell add on displays these characters as valid while PowerShell ISE either displays these characters as invalid or automatically changes them during cut and paste.

@gbuktenica gbuktenica changed the title New Rule: Detect common Unicode character subsitutions. New Rule: Detect common Unicode character substitutions. May 3, 2018
@bergmeister
Copy link
Collaborator

Very good idea, I have had similar problems in the past as well. However, PSSA usually relies on the parser, therefore this might become interesting to implement. I will try to look at some example cases and see if there is some good heuristics to detect such cases reliably

@gbuktenica
Copy link
Author

As this is a common issue across languages and no other linter appears to pick this up I have started a new extension to cover this.
It is a VERY early release and it isn't published in the market place yet so it requires a manual installation:
https://github.com/gbuktenica/Unicode-Substitutions

@FISHMANPET
Copy link

So RE: relying on the parser, I've actually written a pester test that uses the AST parser to find these characters and generate an error if it finds them, I've written about here but this is the important part:

Describe "General project validation" {

    $predicate = {
        param ( $ast )

        if ($ast -is [System.Management.Automation.Language.BinaryExpressionAst] -or
            $ast -is [System.Management.Automation.Language.CommandParameterAst] -or
            $ast -is [System.Management.Automation.Language.AssignmentStatementAst]) {

            if ($ast.ErrorPosition.Text[0] -in 0x2013, 0x2014, 0x2015) {return $true}

        }
        if ($ast -is [System.Management.Automation.Language.CommandAst] -and
            $ast.GetCommandName() -match '\u2013|\u2014|\u2015') {return $true}

        if (($ast -is [System.Management.Automation.Language.StringConstantExpressionAst] -or
                $ast -is [System.Management.Automation.Language.ExpandableStringExpressionAst]) -and
            $ast.Parent -is [System.Management.Automation.Language.CommandExpressionAst]) {
            if ($ast.Parent -match '^[\u2018-\u201e]|[\u2018-\u201e]$') {return $true}
        }
    }

    # TestCases are splatted to the script so we need hashtables
    $testCase = $scripts | Foreach-Object {@{file = $_}}
    It "Script <file> should be valid powershell" -TestCases $testCase {
        param (
            $file
        )
        $script = Get-Content -Raw -Encoding UTF8 -Path $file
        $tokens = $errors = @()
        $ast = [System.Management.Automation.Language.Parser]::ParseInput($Script, [Ref]$tokens, [Ref]$errors)
        $elements = $ast.FindAll($predicate, $true)

        $elements | Should -BeNullOrEmpty -Because $elements
        $errors | Should -BeNullOrEmpty -Because $errors
    }
}

Looking at it again with a clear mind I think I missed a case, something like $date.AddDays(―1), need to look for a UnaryExpressionAst with a Parent that contains one of 0x2013, 0x2014, 0x2015, but I think this is a good baseline for how a rule could be written. I don't know enough about C# or how rules are implemented generally so unfortunately I can't contribute much beyond this.

@FISHMANPET
Copy link

If nobody minds I'm going to use this as a place to dump my ideas on this as I notice any problems with my $predicate, in the hopes these notes are useful to someone who decides to implement this in C#. You can of course get the ast in the terminal but i've found the Show-Ast module very useful to give me a window I can scroll through and more easily poke around the AST. It's rudimentary but helpful

I had this line that should have been detected but wasn't:
if ($result.SideIndicator -eq “=>”) {
Dug into the AST for it and found that it was a StringConstantExpressionAst (which I am looking for) but with a parent of BinaryExpressionAst (which I wasn't looking for. So now my last check looks like this:

        if (($ast -is [System.Management.Automation.Language.StringConstantExpressionAst] -or
                $ast -is [System.Management.Automation.Language.ExpandableStringExpressionAst]) -and
            (($ast.Parent -is [System.Management.Automation.Language.CommandExpressionAst]) -or
                $ast.Parent -is [System.Management.Automation.Language.BinaryExpressionAst])) {
            if ($ast.Parent -match '^[\u2018-\u201e]|[\u2018-\u201e]$') {
                return $true
            }
        }

Another idea I had about quote matching in general. Both types of quoted strings, StringConstantExpressionAst and ExpandableStringExpressionAst have a StringConstantType property that will specify if it's a BareWord (probably a command name) or a quoted string, and perhaps from there all of those strings could be searched for bad quote characters. Extra work may have to be done to ensure that there aren't any special quotes being actually quoted, maybe just look if the first or last character in the string matches one of the bad characters.

@FISHMANPET
Copy link

The ―1 in $date.AddDays(―1) is a ConstantExpressionAst not a UnaryExpressionAst but regardless I'm not catching it.

A UnaryExpressionAst is something like $temp ―― which I'm ALSO not catching

The ConstantExpressionAst is easy, just add it to the -or list. The Unary is a little harder. And in writing that I wouldn't catch 12 ―eq 12. So I fixed both by adding this if statement:

if (($ast -is [System.Management.Automation.Language.UnaryExpressionAst] -or
            $ast -is [System.Management.Automation.Language.BinaryExpressionAst]) -and
            $ast.Extent.Text -match '\u2013|\u2014|\u2015') {
            return $true
        }

Both of these have a TokenKind which could be looked at similair to StringConstantType above but the list of possible operators is huge and may not make it any easier.

@FISHMANPET
Copy link

I should also add in case you didn't catch it in the blog post, the reason for these convoluted searches rather than just finding a character is that these special characters are allowed in quoted strings, so it needs to be smart about parsing.

The blog I linked to from 2009 includes a snippet of the PS source code that does the on the fly replacement, and I found that same code in the PS repository now. I didn't look at it very closely, but maybe looking at the code it could be easier to figure out how the engine tells the difference between special characters inside a quoted string and outside a quoted string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants