Skip to content

Update rules with required phrases automatically #3924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 91 commits into from
Apr 10, 2025

Conversation

AyanSinhaMahapatra
Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra commented Sep 17, 2024

This is a continuation of #3254 with added required phrases in license rules after review and further manual curations. Also contains improvements in required phrase collection and marking.

Reference: #2637 #2878

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁

Add a script which can add required phrases in already existing rules
automatically from required phrases already present in other rules and
license field names. This can be done one license expression at a time.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
@AyanSinhaMahapatra AyanSinhaMahapatra changed the title Update rules with required phrases auto Update rules with required phrases automatically Sep 17, 2024
@AyanSinhaMahapatra AyanSinhaMahapatra force-pushed the update-rules-with-required-phrases-auto branch from ea221d4 to 518116d Compare September 18, 2024 13:43
@pombredanne
Copy link
Member

I am pushing shortly a few updates:

  • decouple the creation of new rules from updating existing rules in a separate CLI
  • ensure we skip more rules in the whole process: any rule that cannot be matched approximately and not only tiny rules, and also false positives
  • ensure that no rule get a required phrase addition that would break in the middle of a URL, email, or copyright. This will be done to check that no required phrase injection changes the set of ignorables of a rule and makes the URL not longer a proper URL for instance.
  • extend "skipping" the collection of required phrases flag to skip a rule from both required phrases collection AND injection. This allow to handle exceptions more easily.

Do not damage rules with URLs

Signed-off-by: Philippe Ombredanne <[email protected]>
Ensure that the leading /usr is not broken with {{ required phrase }}
markers.

Signed-off-by: Philippe Ombredanne <[email protected]>
Ensure that /usr paths are not broken with {{ required phrase }}
markers.

Signed-off-by: Philippe Ombredanne <[email protected]>
Ensure that URLs are not broken with {{ required phrase }} markers.

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
This is code that belongs to required_phrase.py, not to tokenize.py

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
This creates many false positives.

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
This helps with required phrases handling, addition and generation

* Add new Rule.source attribute to track the "source" of a license rule
  like when adding a new required phrase to a rule
* Add new Rule.is_tiny computed attribute to ytrack tiny, very small
  rules
* Add new Rule.is_approx_matchable property for rules that can only be
  matched exactly
* Add new Rule.is_generic for rules that contain "generic" licenses
* Support required_phrases-related fields in Rule.validate()
* Update index.py accordingly

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Only process stopwords this for "is_continuous" rules

Signed-off-by: Philippe Ombredanne <[email protected]>
Some rules now have a "is_required_phrase" flag

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
@petergardfjall
Copy link
Contributor

petergardfjall commented Feb 19, 2025

Hey! This PR looks promising as I think it has the potential to greatly reduce false positives.
Just to drop an example, having required phrases throughout the rule set would avoid "silly" mistakes like this:

"rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/lgpl-2.1_250.RULE",
"matched_text": "// Licensed under the MIT license. \nnamespace Microsoft.OpenApi.Readers.V2"

And I do see quite a lot of these "ghost" detections.

I was just checking in to ask if there is any hope of seeing this PR merged. It has been open for some time.

@pombredanne
Copy link
Member

@petergardfjall re:

I was j ust checking in to ask if there is any hope of seeing this PR merged. It has been open for some time.

Yes! I am about to have a chat with @AyanSinhaMahapatra on just that.

@pombredanne
Copy link
Member

The longer story is that there are reconciliations to do wrt. the changes I made and changes that @AyanSinhaMahapatra has not yet pushed.

@AyanSinhaMahapatra AyanSinhaMahapatra force-pushed the update-rules-with-required-phrases-auto branch from eba882d to de19fbe Compare March 31, 2025 10:31
This was referenced Apr 4, 2025
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
The test for a key phrase uninterrupted continuity must be done in all
cases, whether a matched rule is "continuous", or a full "required
phrase".

Fix test and add new test.

Signed-off-by: Philippe Ombredanne <[email protected]>
Some tests running with a single process and threading are randombly
failiing in the CI that run varisou wrapper node-based envt. that then
spawn lower level tools like our test runners.
This sometimes makes the Python process and threading run in corner
cases that fail. This commits moves these issues out of the way to
avoid failing when this fails.

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne pombredanne force-pushed the update-rules-with-required-phrases-auto branch from 681bbee to f837a38 Compare April 9, 2025 21:34
@pombredanne
Copy link
Member

@AyanSinhaMahapatra all green... I am merging now 🙇

@pombredanne pombredanne merged commit e054254 into develop Apr 10, 2025
43 checks passed
@pombredanne pombredanne deleted the update-rules-with-required-phrases-auto branch April 10, 2025 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants