Skip to content

RFC: Revamp "unknown" license detection #1675

Closed
@pombredanne

Description

@pombredanne

Context and Problem

The whole idea of detecting unknown licenses and reporting these with the real detection is a bit weird and confusing.

We should make this area clearer and cleaner by cleaning up license keys in use and changing the way "unknown" are reported, named and detected. Since we eventually detect many short mentions and do diffs, we may at time catch too many of such "unknown" or not enough and returning these in the same data structure as the detected licenses may be confusing.

I think we can do better.

First let's refine what we mean by unknown: these is a detected texts that is highly likely to be a license text or notice, but that cannot be properly matched to a known, named license (one with a proper, not unknown license key.

When doing a full scrub on a codebase they are important as they are the things that needs a detailed review.

This is a proposal to revamp and clarify the way detect, process and report these.

There are several things to consider:

  • improve the data model for the License "YAML" data
  • use fewer license keys for unknown things
  • improve the way we deal with multiple "low score" detected licenses in a single file
  • report unknowns in a separate section of the scan results, not mixed with the main license detection
  • improve the detection of unknown licenses
  • follow license referenced in references such as "see COPYING for license" to report the detected license in the referenced file if any.
  • properly handle "intro texts" that are used generically to introduce license terms

Improve the License data model definition (e.g. the License "YAML" data)

Unknown means for now that this is something matched to a license rule tagged with the "unknown" license key. It would help to clarify which licenses are "special" licenses by having proper attributes. Today we can only identify some "unknown" license based on its key.

Beyond the regular "named" licenses, we have a few "special" license keys that could be made more explicit.

They could be separated in these categories:

  1. "generic" licenses: commercial-license proprietary-license generic-cla other-copyleft other-permissive public-domain public-domain-disclaimer warranty-disclaimer

All of these are used as an alternative to a named license (and sometimes for rarer licenses not worthy of a named entry). These are all generic "catch-all" license keys. Each match can reference a different text or notice: they do not have a self-standing reference license text stored with the .LICENSE file. We could make these more explicit by adding a is_generic license attribute, to mark them as being different from the default "named" licenses.

  1. "unknown" licenses: unknown free-unknown unknown-spdx unknown-license-reference

These are used to depict a detection that is likely about licenses but is NOT matched to a proper rule for a named or generic license.
Instead this is matched to a rule tagged with one of these unknown license keys.

  1. ✔️ Also we have unknown-license-reference that are more like generic intro texts about licensing. The .RULE file names have been prefixed with "lead-in".
    For instance "Software License agreement" would be such as an intro text.
So to handle these all these, I suggest we could add these new model attributes :

In the License (and also to the License Rule):

  • is_generic tells that a license is generic (case 1.)
  • is_unknown tells that a license is about some unnamed license. (case 2.)

In the License Rule:

  • is_license_intro to tell that a license detection rule is some license intro text. (case 3.). This would be used in conjunction with the unknown license key.

All done:
✔️ implemented with is_generic flag
✔️ implemented with is_unknown flag for licenses and has_unknown property for rules
✔️ This has been implemented with is_license_intro flag

License keys cleanup and retirement : use fewer license keys for unknown things.

  • The free-unknown is a weird one that should be retired: we should return either unknown or other-copyleft or other-permissive instead.

  • unknown-spdx is used to report some unknown license in an SPDX license expression. Since this is only produced by the SPDX expression detection it would unlikely be part of other "unknown" detection. It should be tagged with is_unknown

  • unknown-license-reference is special if and only if it has a referenced_filenames. So it could be folded in unknown too.
    The ones that have a referenced_filenames will be dealt with accordingly. Otherwise any rule can be tagged with is_license_reference to the same effect.

  • We have also a category of unknown license rules that have been for now renamed with the prefix lead-in_ that are not really depicting unknown licenses but rather are intro or title texts used before a license text or notice. For instance Software license agreement or As a special exception or is under the following license etc. These are really special and are of interest if and only if they show up with otherwise unknown license texts. When they are detected just before an actual known, named license text or notice, reporting them as unknown is noisy and instead their match should be folded in the main match they introduce. We should tag these as unknown and is_license_intro.

improve the way we deal with multiple "low score" detected licenses in a single file

We could merge matches to different detected "rules" in the same text region when they point to the related license in one match.

We could also handle the cases where we have some mentions of bare GPL (detected as gpl-1.0-plus) together with some GPL version: these could be merged in some case as one match to the versioned GPL. This is mostly for FSF licenses

Report unknowns separately

We should use a separate section of the scan results to report "unknown" license detections, not mixed with the main license detection for clarity.
This could be new section as "unknown_license_references" where we report the matched text and positions but we cannot report a specific matched license rule or key.

Improve the detection of unknown licenses

Beside or as a replacement to the actual detection of the "unknown" license rules, we should have a new way and more efficient way to detect unknown licenses using ngrams. The process would be roughly:

  • run the regular license detection.
  • remove from results and keep aside any match with a low coverage below a threshold (eventually merge them too)
  • collect all the spans of scanned text that are not matched to any license
  • run these spans through an automaton index that will contain ngrams from all regular license texts and rules
  • merge all these matched spans (and the possible also the weak matches) in a single matched span.
    Consider also including any "unknown" rule matches.
  • splits in multiple spans based on having some large enough gaps in the match.
  • report these as "unknown_license_references"

Follow license references to another file

This is for references such as "see COPYING for license" to report the detected license in the referenced file if any e.g. mentions to look for license details in another file.

We should have a way to follow license references using the referenced_filenames attribute and find the detected license in these filenames. And when such a conclusive reference is found positively, we should update the match to use the referenced license(s) that were detected.

This is tracked in #1364

Note that for now, this does NOT include following URLs which would imply having network access.

Properly handle "intro texts" that are used generically to introduce license terms

The are "lead-in"-like intro texts and are detected alone and qualified as "unknown" today. For instance "License agreement" is a rule that would be detected as such.

We should have a way to detect such a text fragment and to discard or merge such a detection if this is immediately tied to an actual "named" license detection. This would avoid many unknown detections.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions