Skip to content

[test] Add demo-based axe-core accessibility tests for mui-material#48341

Open
siriwatknp wants to merge 41 commits intomui:masterfrom
siriwatknp:test/a11y-demos
Open

[test] Add demo-based axe-core accessibility tests for mui-material#48341
siriwatknp wants to merge 41 commits intomui:masterfrom
siriwatknp:test/a11y-demos

Conversation

@siriwatknp
Copy link
Copy Markdown
Member

@siriwatknp siriwatknp commented Apr 21, 2026

closes #47887 · supersedes #47895

Summary

Automated axe-core accessibility testing for @mui/material, driven off the real docs demos. Folded into the existing visual-regression Playwright loop (test/regressions/index.test.js) — for each enrolled demo, axe runs against the rendered [data-testid="testcase"] element in the same browser session that takes the screenshot, and the result is written to docs/data/material/components/{slug}/{slug}.a11y.json (one file per slug, keyed by demo name).

Initial scope is buttons only — BasicButtons and ColorButtons — to validate the pipeline end-to-end before broader enrolment in follow-ups. Layout wrappers (Box, Grid, Stack, Paper, Container, Portal, …) will be excluded on principle when added: they contribute no a11y signal of their own; any violation found on them really belongs to their children, already covered by the child's own enrolment.

Config shape

Two independent rule arrays — one per tool — evaluated last-match-wins (no inheritance: an override rule must restate every field it cares about) against the docs path docs/data/material/components/{slug}/{Demo} (minimatch globs). Independent arrays mean editing one tool's config can't stomp the other.

// test/regressions/demoMeta.ts
export const A11Y_RULES: A11yRule[] = [
  { test: 'docs/data/material/components/buttons/{BasicButtons,ColorButtons}', enabled: true },
];

pnpm test:regressions refreshes the *.a11y.json files. CI enforces they're up to date via a git-diff check.

For Reviewers

Why demo-based, not scenario-based

#47895 introduced scenario files (test/a11y/scenarios/Button.a11y.tsx) that hand-authored every axe case per component. Per Olivier's feedback, docs demos already cover every real usage scenario — duplicating them as test fixtures is waste. This PR replaces the scenarios folder with a thin rule list pointed at the demo routes VRT already serves.

Why folded into the VRT loop

@Janpot's earlier concern about duplicate Playwright sessions: rather than a parallel browser, axe now runs inside the existing screenshot loop. Each enrolled demo gets one axe.run against the already-rendered [data-testid="testcase"] element before the screenshot is taken. Failure isolation is preserved by recording the result on ctx.task.meta.a11y first; the assertion fires only on visual rules (color-contrast, link-in-text-block) that depend on real CSS.

Screenshot and a11y opt-outs are independent — a demo screenshot-disabled for being redundant or interaction-heavy can still be audited by axe.

Config follow-up from review

@Janpot's review comment flagged the original dual-Map structure (SLUG_A11Y + DEMO_META) as non-intuitive and at risk of cross-tool stomping. Replaced with two rule arrays + minimatch globs as suggested; per-demo opt-outs are appended after slug-wide rules; the override rule restates every field it wants (no implicit inheritance).

Output shape

One file per enrolled slug at docs/data/material/components/{slug}/{slug}.a11y.json, co-located with the slug's other docs files. Each file is an object keyed by demo name:

{
  "BasicButtons": { "passedRules": [...], "failedRules": [], "testedRules": { "wcag2a": [...], "wcag2aa": [...], "wcag22aa": [...] } },
  "ColorButtons": { ... }
}

Co-locating with {slug}.md and the demo source files makes the data discoverable from the slug folder and avoids the slug-prefix collision problem of a flat per-demo layout.

Code

CI

The existing test:regressions job already builds, serves, and runs Playwright. Adding axe is a no-op on infra. The git-diff guard ensures every *.a11y.json is up-to-date.

Agent docs

  • AGENTS.md — "Accessibility Testing" section covers the rule-array shape, brace-glob narrowing, per-demo opt-outs, and the skipAssertions escape hatch.

Not in scope (follow-ups)

  • Docs presentation (separate PR). A demo-toolbar pill reads each slug's *.a11y.json via the docs render pipeline and shows pass/fail counts plus a per-WCAG-tag rule breakdown on click.
  • Broader enrolment. Buttons is the validation slug; other components onboard via slug-wide rules in A11Y_RULES.
  • Rule graduation (global severity levels). Today every non-visual violation is recorded but not asserted; every visual violation is asserted unless skipAssertions lists it. Fine for a POC; may want finer control once more components enroll.

Runs axe-core against real docs demos in Chromium, reusing the VRT Vite
server. Enrollment is an explicit list in test/a11y/a11y.test.mjs
(POC: Button + Card, 4 demos).

Output a11y-results.json carries both per-component aggregates (for the
docs widget) and per-demo detail (for future per-demo UI).

Supersedes mui#47895 — swaps hand-written scenario modules for the docs
demos themselves, per Olivier's suggestion to mirror Janpot's
html-validate-over-docs-pages pattern in mui#48088.
@code-infra-dashboard
Copy link
Copy Markdown

code-infra-dashboard Bot commented Apr 21, 2026

Deploy preview

https://deploy-preview-48341--material-ui.netlify.app/

Bundle size

Bundle Parsed size Gzip size
@mui/material 0B(0.00%) 0B(0.00%)
@mui/lab 0B(0.00%) 0B(0.00%)
@mui/private-theming 0B(0.00%) 0B(0.00%)
@mui/system 0B(0.00%) 0B(0.00%)
@mui/utils 0B(0.00%) 0B(0.00%)

Details of bundle changes


Check out the code infra dashboard for more information about this PR.

Replaces the hardcoded ENROLLED map in a11y.test.mjs with a full component
roster at packages/mui-material/test/a11y/config.ts. Every component has
an entry; status is 'enabled' (Button, Card) or 'pending' (rest). Adoption
per PR = flip 'pending' to 'enabled' and fill in `demos`.

The config doubles as the rollout ledger — a future docs presentation can
iterate the full array to show both enrolled and pending components.

Drops ComponentAccessibilityStatus.js and the two .md embeds. Presentation
is a separate concern; the JSON shape stays (per-component aggregates +
per-demo detail) so the future widget has the data it needs.
Replaces the single a11y-results.json (383 KB / 14k lines at full rollout)
with packages/mui-material/test/a11y/results/{Component}.json. A future
docs widget can import('.../results/Button.json') and webpack/vite will
code-split per component, so each page only ships its own ~2 KB payload
instead of the whole roster.

Also updates the docs:a11y script to rimraf the directory before each
run, and the AGENTS.md reference.
Flips 45 components from 'pending' to 'enabled' in config.ts. The 26 that
currently trip color-contrast keep a skipRules: ['color-contrast'] entry
so CI stays green while the violations still land in failedRules for the
team to triage.

a11y.test.mjs now falls back to the VRT nav when a config entry omits its
'demos' list, so enrolling is one line (flip the status). Explicit demos
remain the override knob.

The 11 components VRT excludes (need interaction, no demos, flaky) stay
'pending' with inline reasons — nothing observable today.
@Janpot
Copy link
Copy Markdown
Member

Janpot commented Apr 21, 2026

Rendering all the demos, one by one, in a real browser and run tests on them. Sounds an awful lot like the current screenshot tests we have. Would it make sense (be possible?) to fold accessibility testing into the regression tests. It would likely save a ton on circleci credits compared to running another full suite where we render all demos.

`skipAssertions` better describes the behavior — the rule still runs and
still lands in the results JSON, only the test-failing assertion is
suppressed. `skipRules` read as "rule is skipped entirely", which isn't
what happens.

Also drops Box from the roster — nothing to a11y-test.
Layout wrappers (Grid, Masonry, Portal, NoSsr, Stack, Paper) and pure
behavior components (Backdrop, ClickAwayListener, Container, CssBaseline)
have no a11y semantics of their own — axe either finds nothing or
attributes violations to them that actually belong to their inner
content (already covered by the child component's own entry).

Keeps Typography (heading hierarchy), Skeleton (aria-busy), Divider
(caught a real list violation) since those contribute real signal.

44 components in roster → 36 enabled, 8 pending (interaction-only).
The status gate was infrastructure for a presentation UI that isn't in
scope for this PR. The config now lists only components that are actually
enrolled; pending ones live as `// TODO: {Component} — <blocker>` comments
so the reason stays visible without needing a parseable data shape.

If a future docs widget ever wants a "coming soon" list, convert the
comments back to structured entries then.
@siriwatknp siriwatknp marked this pull request as draft April 21, 2026 06:20
@siriwatknp
Copy link
Copy Markdown
Member Author

siriwatknp commented Apr 21, 2026

It would likely save a ton on circleci credits compared to running another full suite where we render all demos.

Infra is already shared — pnpm docs:a11y reuses test:regressions:build and runs the Vitest axe loop concurrently with test:regressions:server. The only new spend is one Chromium cold start + 350 axe runs ≈ 44s (~1/5 of one VRT pass).

Kept it as a separate Vitest file for a standalone dev loop, not because of the infra cost. Happy to fold axe into test/regressions/index.test.js if you'd rather reclaim the 44 s — would need to swap the Vitest reporter for inline aggregation + after(all) write, add a route filter (enrolled /docs-components-* only), and split each route into two its for failure isolation. It's doable.

Do you think saving around ~40s is a hard constraint? Happy to follow your judgement.

@Janpot
Copy link
Copy Markdown
Member

Janpot commented Apr 21, 2026

Do you think saving around ~40s is a hard constraint? Happy to follow your judgement.

It'd cost about 8 euro per month, so cost-wise probably not. It'd increase the current job time (6m 12s/last 30 days) to ~7m, making it the second longest running job, after the type checking. That technically doesn't make it the bottle neck. The bottle neck currently is type checking, we plan to migrate to typescript native soon, so in not so long the test_browser job will likely become the bottleneck. There aren't many hard constraint, only the time the team is willing to spend waiting for jobs to finish during releases and deploys.

@siriwatknp
Copy link
Copy Markdown
Member Author

siriwatknp commented Apr 21, 2026

I pushed a commit to move to regression test, the timing does not differ much but the code is simplified. Net CI wall-clock saving is ~break-even; real wins are ~16 s of duplicate Chromium setup avoided per run and infra simplification (one Vitest suite, one config, one script deleted). No new violations; 2 of 38 result files differ (all pre-existing false-positive color-contrast flags disappearing because the VRT loop parks the mouse and normalizes body.overflow — post-fold is strictly more accurate).

Per Jan's review on mui#48341: run axe inline in test/regressions/index.test.js,
reuse the same Chromium session. Drop the standalone test/a11y/ suite and the
docs:a11y:dev script. CI a11y gate moves from test_browser to test_regressions.

Slider/Table results reflect a cleaner DOM (mouse parked, scrollbar
normalized); all deltas are false-positive color-contrast flags disappearing.
@siriwatknp siriwatknp marked this pull request as ready for review April 21, 2026 07:44
@Janpot
Copy link
Copy Markdown
Member

Janpot commented Apr 21, 2026

Ok, if the win in runtime is marginal then we can go either way, your call.

Did you check how currently screenshots are being excluded? Because if you implement this as enrollment in the existing screenshots fixtures, you will also inherit its exclusion mechanism. If we go the combined route then we should probably import all demos and build separate exclusion mechanisms for a11y and screenshots. e.g.

declare const demoMeta: Map<string, {
  screenshot: {
    enabled: boolean,
    // extra options, e.g. waitForSelector...
  },
  a11y: {
    enabled: boolean,
    rules: {
      // ...
    }
  }
}>()

aside: Thanks for editing the LLM generated comment, it really demotivates me when I have to answer these. Feels like I'm putting in the effort the author should have done when interacting with their LLM.

@zannager zannager added the test label Apr 21, 2026
@zannager zannager requested a review from Janpot April 21, 2026 13:59
Copy link
Copy Markdown
Member

@Janpot Janpot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two things,

  1. My edited comment in #48341 (comment) about screenshot exclusion already being encoded in how we load the test fixtures, is it a problem?
  2. We probably want to generate the data nested under the ./docs folder somewhere, just like we do for other docs generated data?

@siriwatknp
Copy link
Copy Markdown
Member Author

aside: Thanks for editing the LLM generated comment, it really demotivates me when I have to answer these. Feels like I'm putting in the effort the author should have done when interacting with their LLM.

Sorry for that, I usually don't want LLM to write response directly on GH but it was trying to be smart and push the comment directly 🥲. I'm building guard rail around it, bear with me.

Per-tool exclusions now live in test/regressions/demoMeta.ts instead of
glob negations in index.jsx. Screenshot-specific reasons (Redundant,
Flaky image loading) no longer silently drop a11y coverage; autocomplete
Redundant variants, skeleton/Facebook, snackbars/Customized, steppers
image-heavy demos, and AccessibleTabs1/2 now get axe coverage.

Glob keeps only structural exclusions + whole slugs no tool consumes.
DEMO_META accepts both demo-level and slug-level keys; demo wins when
both exist. shouldScreenshot + resolveA11y are the test-runner gates.

Unit tests in demoMeta.test.ts cover current shape + the 3-line
workflow for enabling a11y on an interaction-heavy slug (e.g. dialogs).

a11yConfig.ts removed; AGENTS.md updated.
@siriwatknp
Copy link
Copy Markdown
Member Author

siriwatknp commented Apr 24, 2026

Pushed f337bc7 implementing the separation.

The regression exclusion can be separately configurable in demoMeta.ts file.
Meaning, the screenshot and a11y are independent from each other.

Colocates the generated JSON with other docs data, per Janpot's second
review point. Reporter's OUT_DIR and the test:regressions prettier step
updated to the new path. No other consumers touched the old path.
Migration expanded axe coverage to demos previously screenshot-excluded
(autocomplete Redundant, skeleton/Facebook, etc.). Surfaces color-contrast
near-threshold findings on fab, skeleton, bottom-navigation — same class
of issue already suppressed elsewhere. Added skipAssertions for the three
slugs and regenerated results JSONs at docs/data/material/a11y/.
@github-actions github-actions Bot removed the PR: out-of-date The pull request has merge conflicts and can't be merged. label Apr 27, 2026
@siriwatknp
Copy link
Copy Markdown
Member Author

siriwatknp commented Apr 27, 2026

@Janpot for the result to appear on the docs per demo. Do you think it should be shipped at build-time or better to lazy-loading the json when demo is visible?

Currently, the result is generated per demo .e.g buttons-BasicButton.json

The implementation to show the UI is in separate PR so I want to make sure that this PR introduce the suitable structure for the UI.

@Janpot
Copy link
Copy Markdown
Member

Janpot commented Apr 27, 2026

@Janpot for the result to appear on the docs per demo. Do you think it should be shipped at build-time or better to lazy-loading the json when demo is visible?

Yes, let's ship it at build time. It's static content and we want to make search engines aware of it. It doesn't look like it's huge. Maybe it makes sense to emit those files next to the demo files, like we do for the js version and the preview? If it becomes a problem, we can always move to lazy loading.

Config loading looks a lot better, personally I'd avoid merging. Merging is what makes renovatebot config so harder to reason about imo. Last-one-wins, like netlify redirects feels simpler. Safer to move rules around, fewer opportunities for adding a rule that silently breaks everything that comes after

Copy link
Copy Markdown
Member

@Janpot Janpot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No hard blockers, just a few suggestions.


// Also use some of the demos to avoid code duplication.
//
// Two exclusion layers:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment stuill relevant?

test: string;
enabled?: boolean;
/** Playwright waits for this selector before snapshotting. */
waitForSelector?: string;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not yet wired up to

switch (route) {
case '/docs-components-table/ReactVirtualizedTable': {
await page.waitForSelector('[data-index="1"]');
break;
}
default:
break;
}
.

The shouldScreenshot and resolveA11y functions make the demoMeta abstraction a bit leaky. Alternatively we could just export getConfig and parseRoute and let the screenshot and a11y functions themselves decide how to handle the defaults? We're going to need to screenshot config anyway to be able to use waitForSelector.

@@ -0,0 +1,56 @@
{
"BasicButtons": {
"passedRules": [
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend prototyping the docs before generating for all other components. That way we can benchmark this data format first.

Copy link
Copy Markdown
Member Author

@siriwatknp siriwatknp Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure is brought from #47895 to be able to show like this https://deploy-preview-47895--material-ui.netlify.app/material-ui/react-card/#accessibility-compliance

image

The thought behind:

  • each component/demo could have different scenario tests, so must show what rules are tested
  • the rule name is used to link to the source for the details of the rule to https://dequeuniversity.com/rules/axe/*
  • the passed/failed are for degbuggin/plan to fix them. we can decide if we want to display them or not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be an overview? As in, as a customer I'd like to know:

  1. If MUI is committed to a certain standard? Can I Hold MUI to that standard?
  2. Which trade-offs are there to be made?

I can't be required to open each and every demo and aggregate myself to answer that question, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, there should be an overview but we are not in the stage to do it (at least in this PR).

The main issue (the obvious one) is the color contrast which the team agree on having the test setup first, see how many affected, update the colors and iterate until the tests passed.

At the end, when we are confident about the compliance. We will have that overview page.

This PR is like a first step toward the goal. Let me know if you think any task should be slot in this PR apart from the setup.

@github-actions github-actions Bot added the PR: out-of-date The pull request has merge conflicts and can't be merged. label Apr 28, 2026
@github-actions github-actions Bot removed the PR: out-of-date The pull request has merge conflicts and can't be merged. label May 4, 2026
@github-actions github-actions Bot added the PR: out-of-date The pull request has merge conflicts and can't be merged. label May 4, 2026
Copy link
Copy Markdown
Member

@mj12albert mj12albert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor findings from Codex review:

  • waitForSelector is declared on ScreenshotRule, but the runner only reads screenshotRule?.enabled before calling takeScreenshot. A future rule with waitForSelector would silently no-op. Either wire it before screenshot capture or remove the field until it’s needed.

  • Generated *.a11y.json files aren’t pruned before regeneration. Since the reporter only writes slugs seen in the current run, removing a slug from A11Y_RULES could leave stale tracked JSON behind while CI still passes. Consider deleting existing generated a11y files before the run or pruning stale files in the reporter.

Do you aim to maximize the coverage for the Button component in this PR or just set up the infra? There are a few more WCAG SC that could be covered by existing demos (e.g. size="small" and size="large" for SC 2.5.8 Target size (minimum)), but also some that would require new fixtures (e.g. color="info", warning, inherit) for full coverage of the component's whole API surface

@siriwatknp
Copy link
Copy Markdown
Member Author

Do you aim to maximize the coverage for the Button component in this PR or just set up the infra?

just the infra (with couple of Button demos generated).

siriwatknp added 2 commits May 5, 2026 09:34
Address Codex review on mui#48341: waitForSelector field on ScreenshotRule
was declared but never read. Migrate the hardcoded ReactVirtualizedTable
wait into a rule and apply screenshotRule.waitForSelector before axe +
screenshot in the VRT loop.

Reporter now prunes stale {slug}.a11y.json after writing — gated on the
VRT module having actually run and no -t filter, so unrelated unit-test
runs and filtered slug subsets leave untouched files alone.
@github-actions github-actions Bot removed the PR: out-of-date The pull request has merge conflicts and can't be merged. label May 5, 2026
@siriwatknp
Copy link
Copy Markdown
Member Author

  • waitForSelector is declared on ScreenshotRule, but the runner only reads screenshotRule?.enabled before calling takeScreenshot. A future rule with waitForSelector would silently no-op. Either wire it before screenshot capture or remove the field until it’s needed.
  • Generated *.a11y.json files aren’t pruned before regeneration. Since the reporter only writes slugs seen in the current run, removing a slug from A11Y_RULES could leave stale tracked JSON behind while CI still passes. Consider deleting existing generated a11y files before the run or pruning stale files in the reporter.

nice catch, fixed both of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[test] Set up automated test for accessibility

4 participants