Skip to content

Support equivalent words in license detection #4190 #4215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Apr 22, 2025

Conversation

pombredanne
Copy link
Member

@pombredanne pombredanne commented Mar 26, 2025

This PR improves license detection with multiple new features, driven from considering "license" and "licence" alternative spellings as equivalent.

  • Handle similar words in license detection by allowing multiple "legalese words" to have the same token id.

  • Tag interesting similar words with the same token id including license/licence and more.

  • Regenerate the tokens ids accordingly.

  • Convert Index.tokens_by_tid to a computed property, available on demand. Convert tokens_by_tid to a dictionary from a list. Ensure that all code relying on the tokens_by_tid is updated as needed. All locations were used only for testing and debugging.

  • Deprecate all rules that are duplicated under this new regime, where tokens like "license" and "licence" are not treated as identical.

  • Update test suite to test the detection of all deprecated licenses and rules as a sanity check. A rule with "relevance" set to 0 is not tested if deprecated, as some rules are deprecated because they are false positive and should no longer be detected. Also improved the validation and loading of rules relevance, including the case for zero relevance.

  • Update ambiguous or conflicting rules as needed. In particular ensure that all rules in the style of "MIT or GPL" without a GPL version are now reported consistently as: "mit or gpl-1.0-plus"

  • Add new rules as needed to resolve failing tests and improve accuracy.

  • Improve deprecated support for rules and licenses, adding a new "replaced_by" list attribute that lists the new expressions that must be detected from scanning the deprecated license or rule text.

Reference: #4190
Fixes: #4190

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁
  • Updated documentation pages (if applicable)
  • Updated CHANGELOG.rst (if applicable)

Handle similar words in license detection by allowing multiple
"legalese words" to have the same token id.

Regenerate the tokens ids accordingly.

Convert Index.tokens_by_tid to a computed property, available on demand.
Convert tokens_by_tid to a dictionary from a list.
Ensure that all code relying on the tokens_by_tid is updated as needed.
All locations were used only for testing and debugging.

Deprecate all rules that are duplicated under this new regime, where
tokens like "license" and "licence" are not treated as identical.

Update test suite to test the detection of all deprecated licenses and
rules as a sanity check. A rule with "relevance" set to 0 is not tested
if deprecated, as some rules are deprecated because they are false
positive and should no longer be detected. Also improved the validation
and loading of rules relevance, including the case for zero relevance.

Update ambiguous or conflicting rules as needed.
In particular ensure that all rules in the style of "MIT or GPL"
without a GPL version are now reported consistently as:
"mit or gpl-1.0-plus"

Add new rules as needed to resolve failing tests and improve accuracy.

Improve deprecated support for rules and licenses, adding a new
"replaced_by" list attribute that lists the new expressions that must be
detected from scanning the deprecated license or rule text.

Reference: #4190
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Use correct function names
Remove duplicated license
Remove duplicated rules
Update failed merges
Adjust and rename rules as needed

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne pombredanne force-pushed the 4190-license-licence branch from 1394a76 to 1b508c8 Compare April 12, 2025 13:47
Signed-off-by: Philippe Ombredanne <[email protected]>
Copy link
Member Author

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AyanSinhaMahapatra here are the notes from our review

Signed-off-by: Philippe Ombredanne <[email protected]>
USe a JSON assertion on full scan results

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Reporting a full stack trace and reraising an
exception is helpful in debug mode.

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne pombredanne mentioned this pull request Apr 21, 2025
6 tasks
This was making alpine test fail massively

Signed-off-by: Philippe Ombredanne <[email protected]>
Provide details on each step of the Alpine expression cleanups

Signed-off-by: Philippe Ombredanne <[email protected]>
There are upcoming PRs in develop that would use the same rule file
names.

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Update rule to adopt the the "replaced_by" attribute
Update tests from renaming MIT license rule files

Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne
Copy link
Member Author

@AyanSinhaMahapatra all feedback is addressed. Follow up PRs are pending ASAP once this is merged:

Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks++ @pombredanne, LGTM!
See comments for a few small issues, will handle them separately.

@@ -76,7 +76,7 @@ def add_sequence(automaton, tids, rid, start=0, with_duplicates=False):


MATCH_AHO_EXACT = '2-aho'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not make sense to change the MATCH_AHO_EXACT_ORDER from 2 to 1, keeping the MATCH_AHO_EXACT value as 2-aho as this is quite confusing. We should either rename MATCH_AHO_EXACT as 1-aho or remove the numbers from these entirely.

@@ -1,2 +1,2 @@
license_expressions:
- odc-by-1.0
- ppl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a regression potentially, Maybe this rule is not added correctly? @pombredanne ?

@AyanSinhaMahapatra AyanSinhaMahapatra merged commit 427e8ae into develop Apr 22, 2025
43 checks passed
@pombredanne pombredanne deleted the 4190-license-licence branch April 23, 2025 06:24
sschuberth added a commit to oss-review-toolkit/ort that referenced this pull request Jun 27, 2025
See [1]. Adjust a license score which now has higher confidence
(probably due to [2]).

[1]: https://github.com/aboutcode-org/scancode-toolkit/releases/tag/v32.4.0
[2]: aboutcode-org/scancode-toolkit#4215

Signed-off-by: Sebastian Schuberth <[email protected]>
sschuberth added a commit to oss-review-toolkit/ort that referenced this pull request Jun 27, 2025
See [1]. Adjust a license score which now has higher confidence
(probably due to [2]).

[1]: https://github.com/aboutcode-org/scancode-toolkit/releases/tag/v32.4.0
[2]: aboutcode-org/scancode-toolkit#4215

Signed-off-by: Sebastian Schuberth <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

False Positive Detection of LGPL-2.0-plus and Other Licenses
2 participants