-
-
Notifications
You must be signed in to change notification settings - Fork 600
Support equivalent words in license detection #4190 #4215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Handle similar words in license detection by allowing multiple "legalese words" to have the same token id. Regenerate the tokens ids accordingly. Convert Index.tokens_by_tid to a computed property, available on demand. Convert tokens_by_tid to a dictionary from a list. Ensure that all code relying on the tokens_by_tid is updated as needed. All locations were used only for testing and debugging. Deprecate all rules that are duplicated under this new regime, where tokens like "license" and "licence" are not treated as identical. Update test suite to test the detection of all deprecated licenses and rules as a sanity check. A rule with "relevance" set to 0 is not tested if deprecated, as some rules are deprecated because they are false positive and should no longer be detected. Also improved the validation and loading of rules relevance, including the case for zero relevance. Update ambiguous or conflicting rules as needed. In particular ensure that all rules in the style of "MIT or GPL" without a GPL version are now reported consistently as: "mit or gpl-1.0-plus" Add new rules as needed to resolve failing tests and improve accuracy. Improve deprecated support for rules and licenses, adding a new "replaced_by" list attribute that lists the new expressions that must be detected from scanning the deprecated license or rule text. Reference: #4190 Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Use correct function names Remove duplicated license Remove duplicated rules Update failed merges Adjust and rename rules as needed Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
1394a76
to
1b508c8
Compare
Signed-off-by: Philippe Ombredanne <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AyanSinhaMahapatra here are the notes from our review
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
USe a JSON assertion on full scan results Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Reporting a full stack trace and reraising an exception is helpful in debug mode. Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
This was making alpine test fail massively Signed-off-by: Philippe Ombredanne <[email protected]>
Provide details on each step of the Alpine expression cleanups Signed-off-by: Philippe Ombredanne <[email protected]>
There are upcoming PRs in develop that would use the same rule file names. Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Update rule to adopt the the "replaced_by" attribute Update tests from renaming MIT license rule files Signed-off-by: Philippe Ombredanne <[email protected]>
@AyanSinhaMahapatra all feedback is addressed. Follow up PRs are pending ASAP once this is merged:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks++ @pombredanne, LGTM!
See comments for a few small issues, will handle them separately.
@@ -76,7 +76,7 @@ def add_sequence(automaton, tids, rid, start=0, with_duplicates=False): | |||
|
|||
|
|||
MATCH_AHO_EXACT = '2-aho' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not make sense to change the MATCH_AHO_EXACT_ORDER
from 2 to 1, keeping the MATCH_AHO_EXACT
value as 2-aho
as this is quite confusing. We should either rename MATCH_AHO_EXACT
as 1-aho
or remove the numbers from these entirely.
@@ -1,2 +1,2 @@ | |||
license_expressions: | |||
- odc-by-1.0 | |||
- ppl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a regression potentially, Maybe this rule is not added correctly? @pombredanne ?
See [1]. Adjust a license score which now has higher confidence (probably due to [2]). [1]: https://github.com/aboutcode-org/scancode-toolkit/releases/tag/v32.4.0 [2]: aboutcode-org/scancode-toolkit#4215 Signed-off-by: Sebastian Schuberth <[email protected]>
See [1]. Adjust a license score which now has higher confidence (probably due to [2]). [1]: https://github.com/aboutcode-org/scancode-toolkit/releases/tag/v32.4.0 [2]: aboutcode-org/scancode-toolkit#4215 Signed-off-by: Sebastian Schuberth <[email protected]>
This PR improves license detection with multiple new features, driven from considering "license" and "licence" alternative spellings as equivalent.
Handle similar words in license detection by allowing multiple "legalese words" to have the same token id.
Tag interesting similar words with the same token id including license/licence and more.
Regenerate the tokens ids accordingly.
Convert Index.tokens_by_tid to a computed property, available on demand. Convert tokens_by_tid to a dictionary from a list. Ensure that all code relying on the tokens_by_tid is updated as needed. All locations were used only for testing and debugging.
Deprecate all rules that are duplicated under this new regime, where tokens like "license" and "licence" are not treated as identical.
Update test suite to test the detection of all deprecated licenses and rules as a sanity check. A rule with "relevance" set to 0 is not tested if deprecated, as some rules are deprecated because they are false positive and should no longer be detected. Also improved the validation and loading of rules relevance, including the case for zero relevance.
Update ambiguous or conflicting rules as needed. In particular ensure that all rules in the style of "MIT or GPL" without a GPL version are now reported consistently as: "mit or gpl-1.0-plus"
Add new rules as needed to resolve failing tests and improve accuracy.
Improve deprecated support for rules and licenses, adding a new "replaced_by" list attribute that lists the new expressions that must be detected from scanning the deprecated license or rule text.
Reference: #4190
Fixes: #4190
Tasks
Run tests locally to check for errors.