Skip to content

RFC: a plan for false positive license detection #2878

@pombredanne

Description

@pombredanne

Context

We are reporting too many false positive licenses. We need to fix this!

Problem

There are several false cases, yet they boil down to these types:

  1. False detection of very short and weak license detection rules detected exactly such as:

  2. Detection of a license text or notice fragment which is too weak to represent a bona fide license detection alone.

  3. Detection of longer unknown license references such as

    • a "license introduction" (as in "This is licensed under....") that may be noisy when followed by a bona fide license notice or text.
    • a license reference to the license in a file (as in "See file COPYING for license") where we can follow the reference
  4. Lack of proper detection of a structured license tag found in a package manifest which is returned as an unknown license

  5. When fragments of the same license are detected with only copyrights added in between as in license detection: Add the nunit license #2859

  6. When sequence of SPDX licenses id are found in license detection tools

  7. Please add yours!

Solution elements

We could treat and report separately mere clues such as this one: they could be an interesting insight in some cases, but alone they are too weak to be considered a license detection

The upcoming two-step process where license matches are grouped in a license detection is another way to consider. We could detect patterns of license matches that could be resolved in a detection. For instance a license intro followed by a license notice.

The scancode-analyzer heuristics and ML-based detection of false positive is another way

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions