Evaluate test scoring methods

On https://webstatus.dev/ and feature details pages like https://webstatus.dev/features/dialog we show a test score between 0 and 100% based on WPT results.

The current approach is to count passing subtests divided by number of known subtests, the same as the default wpt.fyi view. Let's evaluate how well that works, and compare it to other scoring methods.

Desirable properties:

- Correlates with implementation quality as judged by web developers
- Correlates with implementation completeness as judged by browser engineers
- Easy to explain and understand

The options are, along with their wpt.fyi URL query parameter. (Note that the URLs aren't exactly right and include tentative tests, working around https://github.com/web-platform-tests/wpt.fyi/issues/3930 to make comparison possible.)

## Passing subtests (`view=subtest`)

This method counts all subtests and

Example: [225 / 258](https://wpt.fyi/results/html/semantics/interactive-elements/the-dialog-element?label=master&label=stable&aligned&view=subtest) = 87%

Pros:

- Matches the default view of wpt.fyi.
- Easy to explain assuming familiarity with WPT's tests/subtests.

Cons:

- The total number of subtests is only known once all subtests are passing and often differ across browsers. (Not easy to understand.)
- The harness status sometimes counts and sometimes doesn't. (Not easy to understand.)
- Fixing a timeout or subtest can cause new failing subtests to appear, reducing the score. (Does not correlate with improvement.)

## Partially passing tests (`view=interop`)

Example: [105.12 / 109](https://wpt.fyi/results/html/semantics/interactive-elements/the-dialog-element?label=master&label=stable&aligned&view=interop) = 96%

Pros:

- Total number of tests (the denominator) is easy to explain and understand

Cons:

- Fixing a timeout or subtest can cause new failing subtests to appear, reducing the score. (But the effect is smaller than for `view=subtest`.)
- Linking to `view=interop` would likely cause confusion, as the view is named for the Interop project. (Renaming/aliasing the URL query parameter would address this.)

## Fully passing tests (`view=test`)

Example: [102 / 109](https://wpt.fyi/results/html/semantics/interactive-elements/the-dialog-element?label=master&label=stable&aligned&view=test) = 94%

Pros:

- Fully passing test is a simple rule
- Total number of tests (the denominator) is easy to explain and understand

Cons:

- Fixing a subtest doesn't count unless all subtests pass. (Does not correlate with improvement.)
- Similarly, introducing a single failing subtest in a previously passing test has a large effect.

## Next steps

Evaluate how well each method corresponds with feature completeness/quality, by taking a random sample of features and listing what the scores would be. Things to consider:

- What does the score tend to be for features not supported at all? (Closer to 0 is better.)
- What does the score tend to be for features browser engineers and web developers think are complete? (Closer to 100 is better, and below 80 or 90 is bad.)
- What does the score tend to be for in-development features? (Exact score is not important, but an even progression is better.)

cc @gsnedders @jgraham since we have discussed test scoring many times over the years, most recently in https://github.com/web-platform-tests/rfcs/pull/190.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate test scoring methods #567

Passing subtests (`view=subtest`)

Partially passing tests (`view=interop`)

Fully passing tests (`view=test`)

Next steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluate test scoring methods #567

Description

Passing subtests (view=subtest)

Partially passing tests (view=interop)

Fully passing tests (view=test)

Next steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Passing subtests (`view=subtest`)

Partially passing tests (`view=interop`)

Fully passing tests (`view=test`)