-
Notifications
You must be signed in to change notification settings - Fork 395
New Rule: Detect common Unicode character substitutions. #981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Very good idea, I have had similar problems in the past as well. However, PSSA usually relies on the parser, therefore this might become interesting to implement. I will try to look at some example cases and see if there is some good heuristics to detect such cases reliably |
As this is a common issue across languages and no other linter appears to pick this up I have started a new extension to cover this. |
So RE: relying on the parser, I've actually written a pester test that uses the AST parser to find these characters and generate an error if it finds them, I've written about here but this is the important part: Describe "General project validation" {
$predicate = {
param ( $ast )
if ($ast -is [System.Management.Automation.Language.BinaryExpressionAst] -or
$ast -is [System.Management.Automation.Language.CommandParameterAst] -or
$ast -is [System.Management.Automation.Language.AssignmentStatementAst]) {
if ($ast.ErrorPosition.Text[0] -in 0x2013, 0x2014, 0x2015) {return $true}
}
if ($ast -is [System.Management.Automation.Language.CommandAst] -and
$ast.GetCommandName() -match '\u2013|\u2014|\u2015') {return $true}
if (($ast -is [System.Management.Automation.Language.StringConstantExpressionAst] -or
$ast -is [System.Management.Automation.Language.ExpandableStringExpressionAst]) -and
$ast.Parent -is [System.Management.Automation.Language.CommandExpressionAst]) {
if ($ast.Parent -match '^[\u2018-\u201e]|[\u2018-\u201e]$') {return $true}
}
}
# TestCases are splatted to the script so we need hashtables
$testCase = $scripts | Foreach-Object {@{file = $_}}
It "Script <file> should be valid powershell" -TestCases $testCase {
param (
$file
)
$script = Get-Content -Raw -Encoding UTF8 -Path $file
$tokens = $errors = @()
$ast = [System.Management.Automation.Language.Parser]::ParseInput($Script, [Ref]$tokens, [Ref]$errors)
$elements = $ast.FindAll($predicate, $true)
$elements | Should -BeNullOrEmpty -Because $elements
$errors | Should -BeNullOrEmpty -Because $errors
}
} Looking at it again with a clear mind I think I missed a case, something like |
If nobody minds I'm going to use this as a place to dump my ideas on this as I notice any problems with my $predicate, in the hopes these notes are useful to someone who decides to implement this in C#. You can of course get the ast in the terminal but i've found the Show-Ast module very useful to give me a window I can scroll through and more easily poke around the AST. It's rudimentary but helpful I had this line that should have been detected but wasn't: if (($ast -is [System.Management.Automation.Language.StringConstantExpressionAst] -or
$ast -is [System.Management.Automation.Language.ExpandableStringExpressionAst]) -and
(($ast.Parent -is [System.Management.Automation.Language.CommandExpressionAst]) -or
$ast.Parent -is [System.Management.Automation.Language.BinaryExpressionAst])) {
if ($ast.Parent -match '^[\u2018-\u201e]|[\u2018-\u201e]$') {
return $true
}
} Another idea I had about quote matching in general. Both types of quoted strings, |
The A The if (($ast -is [System.Management.Automation.Language.UnaryExpressionAst] -or
$ast -is [System.Management.Automation.Language.BinaryExpressionAst]) -and
$ast.Extent.Text -match '\u2013|\u2014|\u2015') {
return $true
} Both of these have a |
I should also add in case you didn't catch it in the blog post, the reason for these convoluted searches rather than just finding a character is that these special characters are allowed in quoted strings, so it needs to be smart about parsing. The blog I linked to from 2009 includes a snippet of the PS source code that does the on the fly replacement, and I found that same code in the PS repository now. I didn't look at it very closely, but maybe looking at the code it could be easier to figure out how the engine tells the difference between special characters inside a quoted string and outside a quoted string. |
When PowerShell code is copied into Visual Studio code from external sources some characters have been converted to an incorrect Unicode equivalent.
Examples are:
Double quote U+0022 <"> converted to left or right double quote U+201C and U+201D
Apostrophe U+0027 <'> converted to left or right apostrophe U+2018 U+2019
Dash U+002D <-> converted to En dash U+2013
The vscode-powershell add on displays these characters as valid while PowerShell ISE either displays these characters as invalid or automatically changes them during cut and paste.
The text was updated successfully, but these errors were encountered: