Description
Context and Problem
The whole idea of detecting unknown licenses and reporting these with the real detection is a bit weird and confusing.
We should make this area clearer and cleaner by cleaning up license keys in use and changing the way "unknown" are reported, named and detected. Since we eventually detect many short mentions and do diffs, we may at time catch too many of such "unknown" or not enough and returning these in the same data structure as the detected licenses may be confusing.
I think we can do better.
First let's refine what we mean by unknown: these is a detected texts that is highly likely to be a license text or notice, but that cannot be properly matched to a known, named
license (one with a proper, not unknown
license key.
When doing a full scrub on a codebase they are important as they are the things that needs a detailed review.
This is a proposal to revamp and clarify the way detect, process and report these.
There are several things to consider:
- improve the data model for the License "YAML" data
- use fewer license keys for unknown things
- improve the way we deal with multiple "low score" detected licenses in a single file
- report unknowns in a separate section of the scan results, not mixed with the main license detection
- improve the detection of unknown licenses
- follow license referenced in references such as "see COPYING for license" to report the detected license in the referenced file if any.
- properly handle "intro texts" that are used generically to introduce license terms
Improve the License data model definition (e.g. the License "YAML" data)
Unknown means for now that this is something matched to a license rule tagged with the "unknown" license key. It would help to clarify which licenses are "special" licenses by having proper attributes. Today we can only identify some "unknown" license based on its key.
Beyond the regular "named" licenses, we have a few "special" license keys that could be made more explicit.
They could be separated in these categories:
- "generic" licenses:
commercial-license
proprietary-license
generic-cla
other-copyleft
other-permissive
public-domain
public-domain-disclaimer
warranty-disclaimer
All of these are used as an alternative to a named license (and sometimes for rarer licenses not worthy of a named entry). These are all generic "catch-all" license keys. Each match can reference a different text or notice: they do not have a self-standing reference license text stored with the .LICENSE file. We could make these more explicit by adding a is_generic
license attribute, to mark them as being different from the default "named" licenses.
- "unknown" licenses:
unknown
free-unknown
unknown-spdx
unknown-license-reference
These are used to depict a detection that is likely about licenses but is NOT matched to a proper rule for a named or generic license.
Instead this is matched to a rule tagged with one of these unknown
license keys.
- ✔️ Also we have
unknown-license-reference
that are more like generic intro texts about licensing. The .RULE file names have been prefixed with "lead-in".
For instance "Software License agreement" would be such as an intro text.
So to handle these all these, I suggest we could add these new model attributes :
In the License (and also to the License Rule):
is_generic
tells that a license is generic (case 1.)is_unknown
tells that a license is about some unnamed license. (case 2.)
In the License Rule:
is_license_intro
to tell that a license detection rule is some license intro text. (case 3.). This would be used in conjunction with theunknown
license key.
All done:
✔️ implemented with is_generic
flag
✔️ implemented with is_unknown
flag for licenses and has_unknown
property for rules
✔️ This has been implemented with is_license_intro
flag
License keys cleanup and retirement : use fewer license keys for unknown things.
-
The
free-unknown
is a weird one that should be retired: we should return eitherunknown
orother-copyleft
orother-permissive
instead. -
unknown-spdx
is used to report some unknown license in an SPDX license expression. Since this is only produced by the SPDX expression detection it would unlikely be part of other "unknown" detection. It should be tagged withis_unknown
-
unknown-license-reference
is special if and only if it has areferenced_filenames
. So it could be folded inunknown
too.
The ones that have areferenced_filenames
will be dealt with accordingly. Otherwise any rule can be tagged withis_license_reference
to the same effect. -
We have also a category of unknown license rules that have been for now renamed with the prefix
lead-in_
that are not really depicting unknown licenses but rather are intro or title texts used before a license text or notice. For instanceSoftware license agreement
orAs a special exception
oris under the following license
etc. These are really special and are of interest if and only if they show up with otherwise unknown license texts. When they are detected just before an actual known, named license text or notice, reporting them as unknown is noisy and instead their match should be folded in the main match they introduce. We should tag these asunknown
andis_license_intro
.
improve the way we deal with multiple "low score" detected licenses in a single file
We could merge matches to different detected "rules" in the same text region when they point to the related license in one match.
We could also handle the cases where we have some mentions of bare GPL (detected as gpl-1.0-plus) together with some GPL version: these could be merged in some case as one match to the versioned GPL. This is mostly for FSF licenses
Report unknowns separately
We should use a separate section of the scan results to report "unknown" license detections, not mixed with the main license detection for clarity.
This could be new section as "unknown_license_references" where we report the matched text and positions but we cannot report a specific matched license rule or key.
Improve the detection of unknown licenses
Beside or as a replacement to the actual detection of the "unknown" license rules, we should have a new way and more efficient way to detect unknown licenses using ngrams. The process would be roughly:
- run the regular license detection.
- remove from results and keep aside any match with a low coverage below a threshold (eventually merge them too)
- collect all the spans of scanned text that are not matched to any license
- run these spans through an automaton index that will contain ngrams from all regular license texts and rules
- merge all these matched spans (and the possible also the weak matches) in a single matched span.
Consider also including any "unknown" rule matches. - splits in multiple spans based on having some large enough gaps in the match.
- report these as "unknown_license_references"
Follow license references to another file
This is for references such as "see COPYING for license" to report the detected license in the referenced file if any e.g. mentions to look for license details in another file.
We should have a way to follow license references using the referenced_filenames
attribute and find the detected license in these filenames. And when such a conclusive reference is found positively, we should update the match to use the referenced license(s) that were detected.
This is tracked in #1364
Note that for now, this does NOT include following URLs which would imply having network access.
Properly handle "intro texts" that are used generically to introduce license terms
The are "lead-in"-like intro texts and are detected alone and qualified as "unknown" today. For instance "License agreement" is a rule that would be detected as such.
We should have a way to detect such a text fragment and to discard or merge such a detection if this is immediately tied to an actual "named" license detection. This would avoid many unknown detections.