Jan/December 2025 updates by andrewginns · Pull Request #19 · andrewginns/agents-mcp-usage

andrewginns · 2026-01-14T21:15:58Z

Add model factory + merbench eval updates

Summary

Introduce a centralized model factory with explicit provider handling (including OpenRouter) plus a new demo and documentation for multi-provider usage in the basic MCP examples.
Expand Merbench evaluation tooling with per-case debug trace capture, new request-usage tracking, and updated evaluation guidance (including OpenRouter examples).
Add Merbench utilities for merging processed benchmark outputs and verifying retry orchestration, plus preprocessing updates to normalize model names for cost lookups.
Refresh model pricing data, add OpenRouter env configuration, and pin the Mermaid CLI version used by the validator.

Why

OpenRouter makes multi-provider demos and evaluation runs possible.
Merbench debugging and reporting improvements (debug traces + request usage tracking + cost normalization) make it easier to diagnose tool loops, retries, and cost accuracy across providers.

How to Run / Verify

Run the new model-factory demo:
- uv run agents_mcp_usage/basic_mcp/basic_mcp_use/pydantic_mcp_factory.py --list-models
Run multi-model Merbench with OpenRouter:
- uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py --models "openrouter:anthropic/claude-3.7-sonnet,gemini-3-flash-preview" --runs 5 --sequential

- Add cost metadata for `gemini-3-pro-preview` to mermaid_evals - Define pricing for new OpenAI GPT-5 family variants (base, mini, nano, pro, chat-latest, codex, 5.1, search API) - Update lockfile to use pydantic-evals 1.17.0 to support and align evaluations with the latest model set

- Update dashboard axis and series labels to clearly distinguish input, output, and thinking tokens, and add units (USD, seconds) - Treat missing token counts as 0 instead of NaN when parsing Metric_details and computing costs, preventing groupby means from being inflated and keeping Deep Dive charts consistent with the leaderboard

- Add `OPENROUTER_API_KEY` to `.env.example` and document it in the Mermaid evals README, including an example `openrouter:*` model run - Introduce `AGENTS.md` with repo structure, commands, and contribution conventions to help onboarding

andrewginns · 2026-01-14T21:19:46Z

@codex create a PR description for me based on the changes in this branch.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ab40f95f9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-14T21:23:08Z

Below is a ready-to-paste PR description based on the changes in this branch.

Proposed PR Title

feat: add model factory + merbench eval updates

PR Body

Summary

Introduce a centralized model factory with explicit provider handling (including OpenRouter) plus a new demo and documentation for multi-provider usage in the basic MCP examples. agents_mcp_usage/factory/model_factory.pyL1-L431 agents_mcp_usage/factory/MODEL_FACTORY.mdL1-L200 agents_mcp_usage/basic_mcp/basic_mcp_use/pydantic_mcp_factory.pyL1-L216 agents_mcp_usage/basic_mcp/README.mdL228-L253
Expand Merbench evaluation tooling with per-case debug trace capture, new request-usage tracking, and updated evaluation guidance (including OpenRouter examples). agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.pyL60-L122 agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.pyL319-L516 agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.pyL1151-L1219 agents_mcp_usage/evaluations/mermaid_evals/README.mdL60-L444
Add Merbench utilities for merging processed benchmark outputs and verifying retry orchestration, plus preprocessing updates to normalize model names for cost lookups. agents_mcp_usage/evaluations/mermaid_evals/scripts/merge_benchmark_results.pyL1-L240 agents_mcp_usage/evaluations/mermaid_evals/scripts/verify_retry_orchestration.pyL1-L126 agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_data.pyL1-L145
Refresh model pricing data, add OpenRouter env configuration, and pin the Mermaid CLI version used by the validator. agents_mcp_usage/evaluations/mermaid_evals/costs.jsonL1-L570 .env.exampleL1-L6 mcp_servers/mermaid_validator.pyL50-L88

Why

Centralizing provider/model creation (including OpenRouter) makes multi-provider demos and evaluation runs consistent and configurable while enabling OpenAI Responses normalization and provider-specific handling in one place. agents_mcp_usage/factory/model_factory.pyL1-L431 agents_mcp_usage/factory/MODEL_FACTORY.mdL1-L200
Merbench debugging and reporting improvements (debug traces + request usage tracking + cost normalization) make it easier to diagnose tool loops, retries, and cost accuracy across providers. agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.pyL60-L122 agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.pyL319-L516 agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_data.pyL23-L145 agents_mcp_usage/evaluations/mermaid_evals/README.mdL378-L444

How to Run / Verify

Run the new model-factory demo:
- uv run agents_mcp_usage/basic_mcp/basic_mcp_use/pydantic_mcp_factory.py --list-models agents_mcp_usage/basic_mcp/README.mdL234-L246
Run multi-model Merbench with OpenRouter:
- uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py --models "openrouter:anthropic/claude-3.7-sonnet,gemini-3-flash-preview" --runs 5 --sequential agents_mcp_usage/evaluations/mermaid_evals/README.mdL80-L90
Merge processed Merbench outputs:
- uv run agents_mcp_usage/evaluations/mermaid_evals/scripts/merge_benchmark_results.py -i results/run1.json results/run2.json -o results/merged.json --dedup keep-all --report merge_report.json agents_mcp_usage/evaluations/mermaid_evals/README.mdL116-L124
Verify retry orchestration logic (offline stub runner):
- uv run agents_mcp_usage/evaluations/mermaid_evals/scripts/verify_retry_orchestration.py agents_mcp_usage/evaluations/mermaid_evals/scripts/verify_retry_orchestration.pyL1-L126

Schema / Output Impacts

Merbench CSV output now includes a request-usage column (Requests_Used). agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.pyL60-L70 agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.pyL1151-L1219
Optional per-case debug traces are emitted to <output-dir>/debug_traces/ when enabled, containing full prompts and tool-call payloads (sensitive data). agents_mcp_usage/evaluations/mermaid_evals/README.mdL378-L444
merge_benchmark_results.py outputs the same schema as preprocess_merbench_data.py (stats/leaderboard/pareto/raw_data/etc.). agents_mcp_usage/evaluations/mermaid_evals/scripts/merge_benchmark_results.pyL1-L18
Pricing inputs updated in costs.json; include the new OpenRouter env var template (OPENROUTER_API_KEY). agents_mcp_usage/evaluations/mermaid_evals/costs.jsonL1-L570 .env.exampleL1-L6

View task →

andrewginns added 7 commits July 20, 2025 08:55

chore: Update costs

800599e

feat: Model factory approach for flexibility

e5bf2cb

chore: Pin mermaid-cli version in MCP server

6f585b7

Merge branch 'main' of github.com:andrewginns/agents-mcp-usage

448b29c

feat: add OpenRouter

6ab40f9

- Add `OPENROUTER_API_KEY` to `.env.example` and document it in the Mermaid evals README, including an example `openrouter:*` model run - Introduce `AGENTS.md` with repo structure, commands, and contribution conventions to help onboarding

Repository owner deleted a comment from chatgpt-codex-connector Bot Jan 14, 2026

chatgpt-codex-connector Bot reviewed Jan 14, 2026

View reviewed changes

Comment thread agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_data.py Outdated

fix: Add missing reasoning levels

6c9982d

andrewginns merged commit 9125423 into main Jan 14, 2026
1 check passed

andrewginns deleted the december-2025-updates branch January 14, 2026 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jan/December 2025 updates#19

Jan/December 2025 updates#19
andrewginns merged 8 commits intomainfrom
december-2025-updates

andrewginns commented Jan 14, 2026 •

edited

Loading

Uh oh!

andrewginns commented Jan 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrewginns commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add model factory + merbench eval updates

Summary

Why

How to Run / Verify

Uh oh!

andrewginns commented Jan 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Jan 14, 2026

Summary

Why

How to Run / Verify

Schema / Output Impacts

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andrewginns commented Jan 14, 2026 •

edited

Loading