Skip to content

Jan/December 2025 updates#19

Merged
andrewginns merged 8 commits intomainfrom
december-2025-updates
Jan 14, 2026
Merged

Jan/December 2025 updates#19
andrewginns merged 8 commits intomainfrom
december-2025-updates

Conversation

@andrewginns
Copy link
Copy Markdown
Owner

@andrewginns andrewginns commented Jan 14, 2026

Add model factory + merbench eval updates

Summary

  • Introduce a centralized model factory with explicit provider handling (including OpenRouter) plus a new demo and documentation for multi-provider usage in the basic MCP examples.
  • Expand Merbench evaluation tooling with per-case debug trace capture, new request-usage tracking, and updated evaluation guidance (including OpenRouter examples).
  • Add Merbench utilities for merging processed benchmark outputs and verifying retry orchestration, plus preprocessing updates to normalize model names for cost lookups.
  • Refresh model pricing data, add OpenRouter env configuration, and pin the Mermaid CLI version used by the validator.

Why

  • OpenRouter makes multi-provider demos and evaluation runs possible.
  • Merbench debugging and reporting improvements (debug traces + request usage tracking + cost normalization) make it easier to diagnose tool loops, retries, and cost accuracy across providers.

How to Run / Verify

  • Run the new model-factory demo:
    • uv run agents_mcp_usage/basic_mcp/basic_mcp_use/pydantic_mcp_factory.py --list-models
  • Run multi-model Merbench with OpenRouter:
    • uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py --models "openrouter:anthropic/claude-3.7-sonnet,gemini-3-flash-preview" --runs 5 --sequential

- Add cost metadata for `gemini-3-pro-preview` to mermaid_evals
- Define pricing for new OpenAI GPT-5 family variants (base, mini,
  nano, pro, chat-latest, codex, 5.1, search API)
- Update lockfile to use pydantic-evals 1.17.0 to support and align
  evaluations with the latest model set
- Update dashboard axis and series labels to clearly distinguish
  input, output, and thinking tokens, and add units (USD, seconds)
- Treat missing token counts as 0 instead of NaN when parsing
  Metric_details and computing costs, preventing groupby means from
  being inflated and keeping Deep Dive charts consistent with the
  leaderboard
- Add `OPENROUTER_API_KEY` to `.env.example` and document it in the Mermaid evals README, including an example `openrouter:*` model run
- Introduce `AGENTS.md` with repo structure, commands, and contribution conventions to help onboarding
Repository owner deleted a comment from chatgpt-codex-connector Bot Jan 14, 2026
@andrewginns
Copy link
Copy Markdown
Owner Author

@codex create a PR description for me based on the changes in this branch.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ab40f95f9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_data.py Outdated
@chatgpt-codex-connector
Copy link
Copy Markdown

Below is a ready-to-paste PR description based on the changes in this branch.


Proposed PR Title

  • feat: add model factory + merbench eval updates

PR Body

Summary

Why

How to Run / Verify

Schema / Output Impacts


View task →

@andrewginns andrewginns merged commit 9125423 into main Jan 14, 2026
1 check passed
@andrewginns andrewginns deleted the december-2025-updates branch January 14, 2026 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant