feat(cli): add stats subcommand — per-agent React Doctor leaderboard#932
Open
aidenybai wants to merge 13 commits into
Open
feat(cli): add stats subcommand — per-agent React Doctor leaderboard#932aidenybai wants to merge 13 commits into
stats subcommand — per-agent React Doctor leaderboard#932aidenybai wants to merge 13 commits into
Conversation
…d from agent history Adds `react-doctor stats`, which reads local AI agent history (Claude Code + Codex transcripts, the Cursor composer database), reconstructs the React code each model actually wrote, lints it with the existing engine, and ranks models and providers by a confidence-weighted React Doctor score. - Reconstructs faithful post-edit file content per provider (Claude snapshots, Cursor `afterContentId` blobs, Codex `apply_patch`), filtered to real React. - Confidence-weighted ranking: each group's raw score regresses toward the global mean by its evidence (files dominant, lightly discounted by sessions), so a tiny clean sample can't top the board. - Plain-language terminal leaderboard with color-coded tools (adds an `orange` to the shared highlighter for Claude); `--json` for the machine-readable report.
commit: |
Discovery loaded each candidate session from the Cursor SQLite DB synchronously, blocking the event loop so the ora spinner appeared frozen for a few seconds. Yield to the event loop periodically and report live "(N found)" progress during the history walk.
Cap the terminal table to the top 5 with a "+ N more" pointer to --json; the full ranking still ships in the JSON report and the best/worst callout.
Consolidate the asString/asRecord/asArray/parseJson coercers (copied across the Claude/Codex/Cursor adapters) into a shared coerce.ts, extract the "most common model" tally into most-common-key.ts, and reuse statMtimeMs in findJsonlFiles. Behavior unchanged.
The static `node:sqlite` import crashed the whole adapter test file on Node 20 (where the module doesn't exist), failing the 20.19 CI matrix job. Load it via a guarded require and skip the Cursor suite when unavailable, mirroring cursor-db.ts's runtime degradation.
- closeCursorDb now closes the underlying node:sqlite database instead of only dropping the cached reference, so the fixture file is unlocked and Windows can unlink the temp dir (was EBUSY in the adapter test teardown). - The reconstruct test compared emitted absolute paths against hardcoded POSIX strings; on Windows resolveAgainstCwd normalizes to backslashes, so expectations now mirror that normalization. Production code unchanged.
- A failed apply_patch update hunk left the prior in-session buffer in place and still emitted the file as faithfully reconstructed; drop it to unreconstructable so stale content is never linted as the model's output. - Sessions touching only non-lintable files (e.g. markdown) had zero reconstructed files and zero failures but were counted as "unreconstructable"; require an actual reconstruction failure for that bucket so the skip note stays accurate.
Replace the readFileSync + split("\n") transcript reader with a streaming
node:readline parser so memory stays flat on large Claude/Codex transcripts.
Makes session loading async (SessionCandidate.load + the parse adapters);
the Cursor composer load wraps its sync DB walk to match.
…--since) - Drop scans that error/skip/lint-fail instead of counting them as clean code, which was inflating the leaderboard. - Emit structured JSON on failure in --json mode (reuse enableJsonMode), which also silences the incidental score-API stderr warning. - Exclude unknown-timestamp candidates under --since so the filter is consistent. - Consolidate the path-inside predicate, move render magic numbers to constants, type-guard the provider flag, throw on invalid --limit, rename op -> operation.
… content A replace/Edit whose oldString isn't in the in-session buffer now marks the file unreconstructable (like a failed apply_patch hunk) rather than keeping the stale snapshot and scoring it as the model's final output.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9a20e3d. Configure here.
Confidence weighting now counts only sessions that contributed scanned files, so non-React/failed/skipped sessions no longer raise session reliability or effective file weight. The reported per-group session count still reflects every analyzed session.
react-doctor has a runtime `deslop-js: workspace:*` dependency, but the Continuous Releases workflow didn't publish deslop-js, so pkg.pr.new couldn't rewrite the ref and `npx https://pkg.pr.new/react-doctor@<pr>` failed with EUNSUPPORTEDPROTOCOL ("workspace:"). Add deslop-js to the publish set.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Adds
react-doctor stats: a per-model/per-tool code-quality leaderboard built from your local AI agent history. It answers one question — which agent writes the cleanest React code in my repo?It reads local agent history (Claude Code + Codex transcripts, the Cursor composer database), reconstructs the file content each model actually wrote, lints it with the existing engine, and ranks models and providers by a confidence-weighted React Doctor score.
afterContentIdfull-content blobs fromstate.vscdb(real model attribution, not "Auto"), Codexapply_patchreplay. Only actual React files (JSX/TSX,use client/use server, or a React-ecosystem import) are scored, so backend/util/config files don't dilute the result.--json.orangeformatter to the sharedhighlighter(honors--no-color).--global(all repos),--since,--limit,--provider,--json. Default scope is the current repo.Coverage is honest about limits: Codex shell edits aren't reconstructable (surfaced as skipped), the Cursor DB needs
node:sqlite(Node 22.13+) and covers GUI composer sessions (not cursor-agent CLI), and the score requires network access.Test plan
pnpm --filter react-doctor test(32 new stats tests: adapters, reconstruct, apply-patch, aggregate/weighting, is-react-source, render)pnpm typecheck/pnpm lint/pnpm format:checkreact-doctor statsin a repo with local Cursor/Claude/Codex history renders a sane leaderboardreact-doctor stats --jsonemits{ schemaVersion, models, providers, best, worst, … }with bothscoreandweightedScorereact-doctor stats --globalranks across repos;--provider cursornarrows the sourceNote
Medium Risk
Large new CLI surface that reads local agent/SQLite data and spawns many lint subprocesses; scoring still depends on the network score API. Logic is heavily tested and read-only on user data, but reconstruction and ranking edge cases could mis-rank in real histories.
Overview
Adds
react-doctor stats, a leaderboard that ranks AI agents and models by how clean the React code they wrote is, using local chat history and the existing lint/score pipeline.The command discovers sessions from Claude Code and Codex JSONL transcripts plus Cursor’s
state.vscdbcomposer DB (replacing model-less transcript JSONL for Cursor). It replays edits into file snapshots (Claude tool results, CursorafterContentId, Codexapply_patch), keeps only React sources, materializes temp trees with project config, and runs oxlint per session (bounded concurrency). Failed or skipped scans are dropped so they cannot look like “perfect” code.Ranking groups by model and provider, calls the score API per group, and sorts on a confidence-weighted score (Bayesian shrink toward the global mean; files dominate, sessions lightly discount correlation). Terminal output shows top models, per-tool tables, score bars, and best/worst callouts;
--jsonships the full report. Flags:--global,--since,--limit,--provider.Also adds
highlighter.orangefor Claude branding in stats output,stats.runtelemetry, CLI flag stripping forstats, a changeset, anddeslop-jsin the publish workflow. Broad unit test coverage for adapters, reconstruction, aggregation, and rendering.Reviewed by Cursor Bugbot for commit 509f229. Bugbot is set up for automated code reviews on this repo. Configure here.