Support equivalent words in license detection #4190 #4215

pombredanne · 2025-03-26T16:19:58Z

This PR improves license detection with multiple new features, driven from considering "license" and "licence" alternative spellings as equivalent.

Handle similar words in license detection by allowing multiple "legalese words" to have the same token id.
Tag interesting similar words with the same token id including license/licence and more.
Regenerate the tokens ids accordingly.
Convert Index.tokens_by_tid to a computed property, available on demand. Convert tokens_by_tid to a dictionary from a list. Ensure that all code relying on the tokens_by_tid is updated as needed. All locations were used only for testing and debugging.
Deprecate all rules that are duplicated under this new regime, where tokens like "license" and "licence" are not treated as identical.
Update test suite to test the detection of all deprecated licenses and rules as a sanity check. A rule with "relevance" set to 0 is not tested if deprecated, as some rules are deprecated because they are false positive and should no longer be detected. Also improved the validation and loading of rules relevance, including the case for zero relevance.
Update ambiguous or conflicting rules as needed. In particular ensure that all rules in the style of "MIT or GPL" without a GPL version are now reported consistently as: "mit or gpl-1.0-plus"
Add new rules as needed to resolve failing tests and improve accuracy.
Improve deprecated support for rules and licenses, adding a new "replaced_by" list attribute that lists the new expressions that must be detected from scanning the deprecated license or rule text.

Reference: #4190
Fixes: #4190

Tasks

Reviewed contribution guidelines
PR is descriptively titled 📑 and links the original issue above 🔗
Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
Run tests locally to check for errors.
Commits are in uniquely-named feature branch and has no merge conflicts 📁
Updated documentation pages (if applicable)
Updated CHANGELOG.rst (if applicable)

Handle similar words in license detection by allowing multiple "legalese words" to have the same token id. Regenerate the tokens ids accordingly. Convert Index.tokens_by_tid to a computed property, available on demand. Convert tokens_by_tid to a dictionary from a list. Ensure that all code relying on the tokens_by_tid is updated as needed. All locations were used only for testing and debugging. Deprecate all rules that are duplicated under this new regime, where tokens like "license" and "licence" are not treated as identical. Update test suite to test the detection of all deprecated licenses and rules as a sanity check. A rule with "relevance" set to 0 is not tested if deprecated, as some rules are deprecated because they are false positive and should no longer be detected. Also improved the validation and loading of rules relevance, including the case for zero relevance. Update ambiguous or conflicting rules as needed. In particular ensure that all rules in the style of "MIT or GPL" without a GPL version are now reported consistently as: "mit or gpl-1.0-plus" Add new rules as needed to resolve failing tests and improve accuracy. Improve deprecated support for rules and licenses, adding a new "replaced_by" list attribute that lists the new expressions that must be detected from scanning the deprecated license or rule text. Reference: #4190 Signed-off-by: Philippe Ombredanne <[email protected]>

Signed-off-by: Philippe Ombredanne <[email protected]>

Use correct function names Remove duplicated license Remove duplicated rules Update failed merges Adjust and rename rules as needed Signed-off-by: Philippe Ombredanne <[email protected]>

Signed-off-by: Philippe Ombredanne <[email protected]>

src/licensedcode/models.py

tests/licensedcode/test_match.py

pombredanne

@AyanSinhaMahapatra here are the notes from our review

tests/licensedcode/test_plugin_license_detection.py

tests/scancode/test_cli.py

tests/packagedcode/test_maven.py

src/licensedcode/data/rules/agpl-3.0_26.RULE

Signed-off-by: Philippe Ombredanne <[email protected]>

USe a JSON assertion on full scan results Signed-off-by: Philippe Ombredanne <[email protected]>

Signed-off-by: Philippe Ombredanne <[email protected]>

Reporting a full stack trace and reraising an exception is helpful in debug mode. Signed-off-by: Philippe Ombredanne <[email protected]>

Signed-off-by: Philippe Ombredanne <[email protected]>

This was making alpine test fail massively Signed-off-by: Philippe Ombredanne <[email protected]>

Provide details on each step of the Alpine expression cleanups Signed-off-by: Philippe Ombredanne <[email protected]>

There are upcoming PRs in develop that would use the same rule file names. Signed-off-by: Philippe Ombredanne <[email protected]>

Signed-off-by: Philippe Ombredanne <[email protected]>

Update rule to adopt the the "replaced_by" attribute Update tests from renaming MIT license rule files Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne · 2025-04-22T09:53:29Z

@AyanSinhaMahapatra all feedback is addressed. Follow up PRs are pending ASAP once this is merged:

Maven Improve maven license detection #4261
Required phrases Improve license required phrase generation #4237
- (I just merged in this branch that PR Refine required phrases with stopwords #4238 #4241 )

AyanSinhaMahapatra

Thanks++ @pombredanne, LGTM!
See comments for a few small issues, will handle them separately.

AyanSinhaMahapatra · 2025-04-22T14:20:58Z

src/licensedcode/match_aho.py

@@ -76,7 +76,7 @@ def add_sequence(automaton, tids, rid, start=0, with_duplicates=False):


 MATCH_AHO_EXACT = '2-aho'


It does not make sense to change the MATCH_AHO_EXACT_ORDER from 2 to 1, keeping the MATCH_AHO_EXACT value as 2-aho as this is quite confusing. We should either rename MATCH_AHO_EXACT as 1-aho or remove the numbers from these entirely.

AyanSinhaMahapatra · 2025-04-22T19:08:30Z

tests/licensedcode/data/datadriven/lic3/odc-1.0.text.yml

@@ -1,2 +1,2 @@
 license_expressions:
-  - odc-by-1.0
+  - ppl


This is a regression potentially, Maybe this rule is not added correctly? @pombredanne ?

See [1]. Adjust a license score which now has higher confidence (probably due to [2]). [1]: https://github.com/aboutcode-org/scancode-toolkit/releases/tag/v32.4.0 [2]: aboutcode-org/scancode-toolkit#4215 Signed-off-by: Sebastian Schuberth <[email protected]>

pombredanne mentioned this pull request Mar 26, 2025

False Positive Detection of LGPL-2.0-plus and Other Licenses #4190

Closed

pombredanne added 10 commits April 10, 2025 22:59

Merge latest develop branch

0d8151b

Signed-off-by: Philippe Ombredanne <[email protected]>

Adjust licenses and rules post-merge

471ccc2

Use correct function names Remove duplicated license Remove duplicated rules Update failed merges Adjust and rename rules as needed Signed-off-by: Philippe Ombredanne <[email protected]>

Improve license rules and tests

8889ab5

Signed-off-by: Philippe Ombredanne <[email protected]>

Add new and improved rules

7bdd64c

Signed-off-by: Philippe Ombredanne <[email protected]>

Add new and improved rules

648c7db

Signed-off-by: Philippe Ombredanne <[email protected]>

Add new and improved rules

1d7cda6

Signed-off-by: Philippe Ombredanne <[email protected]>

Add new license tests

6944487

Signed-off-by: Philippe Ombredanne <[email protected]>

Fix typo

f61a5c8

Signed-off-by: Philippe Ombredanne <[email protected]>

Correct license tests

7b17bb1

Signed-off-by: Philippe Ombredanne <[email protected]>

Build licenserules with no referenced_filenames

1b508c8

Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne force-pushed the 4190-license-licence branch from 1394a76 to 1b508c8 Compare April 12, 2025 13:47

Make license rules more selective

c85f0f6

Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne requested a review from AyanSinhaMahapatra April 13, 2025 14:36

pombredanne commented Apr 14, 2025

View reviewed changes

src/licensedcode/models.py Outdated Show resolved Hide resolved

pombredanne commented Apr 14, 2025

View reviewed changes

tests/licensedcode/test_match.py Show resolved Hide resolved

pombredanne commented Apr 14, 2025

View reviewed changes

This was referenced Apr 14, 2025

Update various license rules #4093

Merged

Improve license required phrase generation #4237

Merged

pombredanne added 10 commits April 18, 2025 09:48

Fix typo in doc string

1afbeab

Signed-off-by: Philippe Ombredanne <[email protected]>

Explain weird looking expected license test result

5854716

Signed-off-by: Philippe Ombredanne <[email protected]>

Add extended license check

7185ecd

USe a JSON assertion on full scan results Signed-off-by: Philippe Ombredanne <[email protected]>

Create correct Python version variables

8af86ce

Signed-off-by: Philippe Ombredanne <[email protected]>

Simplify matches_have_unknown license function

2829ccc

Signed-off-by: Philippe Ombredanne <[email protected]>

Refine debugging output in packages

f2f36d6

Reporting a full stack trace and reraising an exception is helpful in debug mode. Signed-off-by: Philippe Ombredanne <[email protected]>

Simplify unknown license presence check

ed4fbf9

Signed-off-by: Philippe Ombredanne <[email protected]>

Remove unused import

0b340bd

Signed-off-by: Philippe Ombredanne <[email protected]>

Sort imports

c4db8f9

Signed-off-by: Philippe Ombredanne <[email protected]>

Add comments and improve docstrings

5b6998d

Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne mentioned this pull request Apr 21, 2025

Improve maven license detection #4261

Merged

6 tasks

pombredanne added 2 commits April 21, 2025 19:33

Revert changes to licensing.matches_have_unknown

16c0e65

This was making alpine test fail massively Signed-off-by: Philippe Ombredanne <[email protected]>

Improve alpine license tests debuggability

1ff605e

Provide details on each step of the Alpine expression cleanups Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne mentioned this pull request Apr 22, 2025

Use the rule identifier for replaced_by #4270

Open

pombredanne added 4 commits April 22, 2025 09:03

Rename rules to avoid merge conflicts

9a58a12

There are upcoming PRs in develop that would use the same rule file names. Signed-off-by: Philippe Ombredanne <[email protected]>

Merge latest develop

12e8f4c

Signed-off-by: Philippe Ombredanne <[email protected]>

Update changelog

2e46bd9

Signed-off-by: Philippe Ombredanne <[email protected]>

Update tests after merge and rename

d81f2b5

Update rule to adopt the the "replaced_by" attribute Update tests from renaming MIT license rule files Signed-off-by: Philippe Ombredanne <[email protected]>

AyanSinhaMahapatra approved these changes Apr 22, 2025

View reviewed changes

AyanSinhaMahapatra merged commit 427e8ae into develop Apr 22, 2025
43 checks passed

pombredanne deleted the 4190-license-licence branch April 23, 2025 06:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support equivalent words in license detection #4190 #4215

Support equivalent words in license detection #4190 #4215

Uh oh!

pombredanne commented Mar 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

pombredanne left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pombredanne commented Apr 22, 2025

Uh oh!

AyanSinhaMahapatra left a comment

Uh oh!

AyanSinhaMahapatra Apr 22, 2025

Uh oh!

AyanSinhaMahapatra Apr 22, 2025

Uh oh!

Uh oh!

Uh oh!

		@@ -76,7 +76,7 @@ def add_sequence(automaton, tids, rid, start=0, with_duplicates=False):


		MATCH_AHO_EXACT = '2-aho'

Uh oh!

Support equivalent words in license detection #4190 #4215

Support equivalent words in license detection #4190 #4215

Uh oh!

Conversation

pombredanne commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tasks

Uh oh!

Uh oh!

Uh oh!

pombredanne left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pombredanne commented Apr 22, 2025

Uh oh!

AyanSinhaMahapatra left a comment

Choose a reason for hiding this comment

Uh oh!

AyanSinhaMahapatra Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

AyanSinhaMahapatra Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pombredanne commented Mar 26, 2025 •

edited

Loading