RFC: Revamp "unknown" license detection

### [Context and Problem](#context-and-problem)

The whole idea of detecting unknown licenses and reporting these with the real detection is a bit weird and confusing. 

We should make this area clearer and cleaner by cleaning up license keys in use and changing the way "unknown" are reported, named and detected. Since we eventually detect many short mentions and do diffs, we may at time catch too many of such "unknown" or not enough and returning these in the same data structure as the detected licenses may be confusing.


I think we can do better.

First let's refine what we mean by unknown: these is a detected texts that is highly likely to be a license text or notice, but that cannot be properly matched to a known, `named` license (one with a proper, not `unknown` license key.

When doing a full scrub on a codebase they are important as they are the things that needs a detailed review.

This is a proposal to revamp and clarify the way detect, process and report these.

There are several things to consider:
- improve the data model for the License "YAML" data
- use fewer license keys for unknown things
- improve the way we deal with multiple "low score" detected licenses in a single file
- report unknowns in a separate section of the scan results, not mixed with the main license detection
- improve the detection of unknown licenses
- follow license referenced in references such as "see COPYING for license" to report the detected license in the referenced file if any. 
- properly handle "intro texts" that are used generically to introduce license terms


### [Improve the License data model definition](#license-data-model-definition) (e.g. the License "YAML" data)

Unknown means for now that this is something matched to a license rule tagged with the "unknown" license key. It would help to clarify which licenses are "special" licenses by having proper attributes. Today we can only identify some "unknown" license based on its key.

Beyond the regular "named" licenses, we have a few "special" license keys that could be made more explicit.

They could be separated in these categories:

1.  "generic" licenses: `commercial-license` `proprietary-license` `generic-cla` `other-copyleft` `other-permissive` `public-domain` `public-domain-disclaimer` `warranty-disclaimer`

All of these are used as an alternative to a named license (and sometimes for rarer licenses not worthy of a named entry). These are all generic "catch-all" license keys. Each match can reference a different text or notice: they do not have a self-standing reference license text stored with the .LICENSE file. We could make these more explicit by adding a `is_generic` license attribute, to mark them as being different from the default "named" licenses.

2.  "unknown" licenses: `unknown` `free-unknown` `unknown-spdx` `unknown-license-reference`

These are used to depict a detection that is likely about licenses but is NOT matched to a proper rule for a named or generic license.
Instead this is matched to a rule tagged with one of these `unknown` license keys.

3.  :heavy_check_mark: Also we have `unknown-license-reference` that are more like generic intro texts about licensing. The .RULE file names have been prefixed with "lead-in".
For instance "Software License agreement" would be such as an intro text. 



##### So to handle these all these, I suggest we could add these [new model attributes](#new-model-attributes) :

~~In the License (and also to the License Rule):~~

- ~~`is_generic` tells that a license is generic (case 1.)~~
- ~~`is_unknown` tells that a license is about some unnamed license. (case 2.)~~

~~In the License Rule:~~

- ~~`is_license_intro` to tell that a license detection rule is some license intro text. (case 3.). This would be used in conjunction with the `unknown` license key.~~

All done:
**:heavy_check_mark:  implemented with `is_generic` flag
:heavy_check_mark:  implemented with `is_unknown` flag for licenses and `has_unknown` property for rules
:heavy_check_mark:  This has been implemented with `is_license_intro` flag** 

### [License keys cleanup and retirement](#license-keys-cleanup-and-retirement) : use fewer license keys for unknown things.

- The `free-unknown` is a weird one that should be retired: we should return either `unknown` or `other-copyleft` or `other-permissive` instead.

- `unknown-spdx` is used to report some unknown license in an SPDX license expression. Since this is only produced by the SPDX expression detection it would unlikely be part of other "unknown" detection. It should be tagged with `is_unknown`

- `unknown-license-reference` is special if and only if it has a `referenced_filenames`. So it could be folded in `unknown` too.
The ones that have a `referenced_filenames` will be dealt with accordingly. Otherwise any rule can be tagged with `is_license_reference` to the same effect.

- We have also a category of unknown license rules that have been for now renamed with the prefix `lead-in_` that are not really depicting unknown licenses but rather are intro or title texts used before a license text or notice. For instance `Software license agreement` or `As a special exception` or `is under the following license` etc.  These are really special and are of interest if and only if they show up with otherwise unknown license texts. When they are detected just before an actual known, named license text or notice, reporting them as unknown is noisy and instead their match should be folded in the main match they introduce. We should tag these as `unknown` and `is_license_intro`.

#### improve the way we [deal with multiple "low score"](#deal-with-multiple-low-score) detected licenses in a single file

We could merge matches to different detected "rules" in the same text region when they point to the related license in one match.

We could also handle the cases where we have some mentions of bare GPL (detected as gpl-1.0-plus) together with some GPL version: these could be merged in some case as one match to the versioned GPL. This is mostly for FSF licenses 


### [Report unknowns separately](#report-unknowns-separately)

We should use a separate section of the scan results to report "unknown" license detections, not mixed with the main license detection for clarity.
This could be new section as "unknown_license_references" where we report the matched text and positions but we cannot report a specific matched license rule or key.


### [Improve the detection of unknown licenses](#improve-the-detection-of-unknown-licenses)

Beside or as a replacement to the actual detection of the "unknown" license rules, we should have a new way and more efficient way to detect unknown licenses using ngrams. The process would be roughly:

- run the regular license detection.
- remove from results and keep aside any match with a low coverage below a threshold (eventually merge them too)
- collect all the spans of scanned text that are not matched to any license
- run these spans through an automaton index that will contain ngrams from all regular license texts and rules
- merge all these matched spans (and the possible also the weak matches) in a single matched span.
  Consider also including any "unknown" rule matches.
- splits in multiple spans based on having some large enough gaps in the match.
- report these as "unknown_license_references" 


### [Follow license references to another file](#follow-license-references-to-another-file)

This is for references such as "see COPYING for license" to report the detected license in the referenced file if any e.g. mentions to look for license details in another file.

We should have a way to follow license references using the `referenced_filenames` attribute and find the detected license in these filenames. And when such a conclusive reference is found positively, we should update the match to use the referenced license(s) that were detected.

This is tracked in https://github.com/nexB/scancode-toolkit/issues/1364

Note that for now, this does NOT include following URLs which would imply having network access.

### [Properly handle "intro texts"](#properly-handle-intro-texts) that are used generically to introduce license terms

The are "lead-in"-like intro texts and are detected alone and qualified as  "unknown" today. For instance "License agreement" is a rule that would be detected as such.

We should have a way to detect such a text fragment and to discard or merge such a detection if this is immediately tied to an actual "named" license detection. This would avoid many unknown detections.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: Revamp "unknown" license detection #1675

Context and Problem

Improve the License data model definition (e.g. the License "YAML" data)

So to handle these all these, I suggest we could add these new model attributes :

License keys cleanup and retirement : use fewer license keys for unknown things.

improve the way we deal with multiple "low score" detected licenses in a single file

Report unknowns separately

Improve the detection of unknown licenses

Follow license references to another file

Properly handle "intro texts" that are used generically to introduce license terms

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

RFC: Revamp "unknown" license detection #1675

Description

Context and Problem

Improve the License data model definition (e.g. the License "YAML" data)

So to handle these all these, I suggest we could add these new model attributes :

License keys cleanup and retirement : use fewer license keys for unknown things.

improve the way we deal with multiple "low score" detected licenses in a single file

Report unknowns separately

Improve the detection of unknown licenses

Follow license references to another file

Properly handle "intro texts" that are used generically to introduce license terms

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions