fix[Metrics]:Fix Metrics Returning 0 After Cleanup + Extend include_metrics Support by rakdutta · Pull Request #3649 · IBM/mcp-context-forge

rakdutta · 2026-03-12T11:21:37Z

Summary

Fixes #3598 where metrics API returned 0 after raw metrics cleanup. The API only queried the raw metrics table, ignoring hourly aggregates. Also extends include_metrics=true support to Resources, Prompts, and Servers (Tools already had this).

Problem

When calling /servers/{server_id}/tools?include_metrics=true, total_executions returned 0 after raw metrics were cleaned up. Historical data existed in tool_metrics_hourly but wasn't being queried.

Solution

1. Centralized Helper Function

Created _compute_metrics_summary() in mcpgateway/db.py (~230 lines) that:

Queries both raw and hourly metrics tables
Uses time-based partitioning (current hour = raw only, completed hours = hourly only)
Prevents double-counting
Handles timezone normalization

2. Updated 4 Models

Applied fix to Tool, Resource, Prompt, and Server models:

Added metrics_hourly relationship
Updated metrics_summary property to use helper function

3. Extended API Support

Added include_metrics=true parameter to:

GET /servers/{server_id}/resources?include_metrics=true
GET /servers/{server_id}/prompts?include_metrics=true
GET /servers?include_metrics=true

Files Changed

mcpgateway/db.py: Helper function + 4 model updates (+554/-298)
mcpgateway/main.py: Added include_metrics to 3 endpoints (+13)
mcpgateway/services/{resource,prompt,server}_service.py: Pass-through support (+12/-3)
tests/unit/mcpgateway/test_metrics_aggregation_fix.py: NEW comprehensive tests (+258)
tests/unit/mcpgateway/test_db.py: Updated tests (+144)

Total: 8 files, 925 insertions(+), 298 deletions(-)

Benefits

Bug Fixed: Metrics correctly aggregated from both tables
Code Reduced: 400+ duplicated lines → 230 centralized
Consistent API: All 4 services support include_metrics=true
Maintainable: Single function to update instead of 4 places
Future-proof: Adding metrics to new models now takes 5 lines vs 100+

Testing

# All endpoints now support include_metrics=true
curl "http://localhost:4444/servers/{id}/tools?include_metrics=true" -H "Authorization: Bearer $TOKEN"
curl "http://localhost:4444/servers/{id}/resources?include_metrics=true" -H "Authorization: Bearer $TOKEN"
curl "http://localhost:4444/servers/{id}/prompts?include_metrics=true" -H "Authorization: Bearer $TOKEN"
curl "http://localhost:4444/servers?include_metrics=true" -H "Authorization: Bearer $TOKEN"

Expected metrics response:

{
  "total_executions": 150,
  "successful_executions": 145,
  "failed_executions": 5,
  "failure_rate": 0.033,
  "avg_response_time": 0.45,
  "last_execution_time": "2026-03-12T12:30:00Z"
}

Verification Scenarios

All scenarios now correctly aggregate metrics from both tables:

Scenario 1: No Raw Cleanup (`METRICS_DELETE_RAW_AFTER_ROLLUP=false`)

✅ Counts: Raw metrics from current hour ONLY
✅ Counts: All hourly aggregates
✅ No double-counting despite raw table having old data (time-based partitioning)

Scenario 2: Delayed Cleanup (`METRICS_DELETE_RAW_AFTER_ROLLUP=true`, delay=1hr)

✅ Counts: Raw metrics from current hour ONLY
✅ Counts: All hourly aggregates
✅ No double-counting during cleanup delay window (time-based partitioning)

Scenario 3: Immediate Cleanup (`METRICS_DELETE_RAW_AFTER_ROLLUP=true`, delay=0)

✅ Counts: Raw metrics from current hour (if any)
✅ Counts: All hourly aggregates
✅ Fixes [BUG][OBSERVABILITY]: Server tool metrics total_executions always returns 0 #3598: Historical data correctly retrieved from hourly tables

Configuration

Relevant environment variables (no changes required):

METRICS_DELETE_RAW_AFTER_ROLLUP=true - Enable raw metrics cleanup
METRICS_DELETE_RAW_AFTER_ROLLUP_HOURS=1 - Hours to wait after rollup before cleanup
METRICS_ROLLUP_INTERVAL_HOURS=1 - Hourly rollup interval

Test Coverage

Added 258 lines of new tests covering current hour, completed hours, mixed scenarios, and edge cases
Updated existing tests (+144 lines)
All tests passing ✅

Closes #3598

MohanLaksh

Code Review: PR #3649 - Fix Metrics Returning 0 After Cleanup + Extend include_metrics Support

Date: 2026-03-18
PR Link: #3649
Issue: Fixes #3598

Executive Summary

✅ APPROVED - Ready to Merge

This PR successfully fixes a critical bug where metrics APIs returned 0 after raw metrics cleanup, and extends include_metrics=true support to Resources, Prompts, and Servers. The implementation is well-designed, thoroughly tested, and includes performance optimizations.

Key Stats:

Files Changed: 16 files (+1,698/-376 lines)
Test Coverage: 400+ new test lines
Performance Impact: 94-97% query reduction for list operations
Code Quality: Eliminates 400+ lines of duplication with centralized helper

What This PR Does

Problem Solved

When calling /servers/{server_id}/tools?include_metrics=true, the API returned total_executions: 0 after raw metrics were cleaned up. The system only queried raw metrics tables, ignoring hourly aggregates containing historical data.

Solution Implemented

Centralized Helper Function (_compute_metrics_summary() ~230 lines in mcpgateway/db.py)
- Queries both raw AND hourly metrics tables
- Time-based partitioning: current hour = raw only, completed hours = hourly only
- Prevents double-counting through smart time boundaries
- Handles timezone-aware/naive datetime comparisons
- Supports both in-memory (loaded relationships) and SQL query paths
Updated 4 ORM Models (Tool, Resource, Prompt, Server)
- Added metrics_hourly relationship to each model
- Updated metrics_summary property to use centralized helper
- Reduced 400+ duplicated lines to 230 centralized lines
Extended API Support
- Added include_metrics=true parameter to:
  - GET /servers/{server_id}/resources?include_metrics=true
  - GET /servers/{server_id}/prompts?include_metrics=true
  - GET /servers?include_metrics=true
- Tools already had this feature; now all 4 services are consistent
Performance Optimizations
- Added selectinload() for both metrics and metrics_hourly relationships
- Created composite indexes: (entity_id, timestamp) for time-partitioned queries
- Reduces 51-101 queries down to 3 queries for list operations
Database Migration
- Alembic migration 20a0e0538ac5 adds composite indexes
- Idempotent with proper table/index existence checks
- Backward compatible (indexes are additive only)

Code Quality Assessment

✅ Strengths

Excellent Architecture
- Time-partitioning strategy elegantly prevents double-counting
- Clear separation between current hour (raw) and completed hours (hourly)
- Handles edge cases (timezone normalization, missing data)
DRY Principle Applied
- Centralized logic eliminates massive code duplication
- Single source of truth for metrics aggregation
- Easy to maintain and extend to new entity types
Comprehensive Testing
- 258 lines of new tests in test_metrics_aggregation_fix.py
- 161 lines in test_metrics_api_coverage.py
- 144 lines of updated tests in test_db.py
- 221 lines across service tests
- Covers: raw-only, hourly-only, mixed scenarios, edge cases
Performance Optimized
- N+1 query prevention via eager loading
- Database indexes for PostgreSQL optimization
- Expected: 94-97% query reduction, <50ms response times
Well Documented
- Clear docstrings with ASCII diagrams
- Inline comments explaining time-partitioning logic
- Comprehensive PR description
Follows Standards
- Type hints throughout
- Proper error handling
- Consistent with project coding standards

✅ Security Review

✅ No new authentication/authorization paths
✅ No exposure of sensitive data
✅ Follows existing RBAC patterns
✅ Token scoping properly maintained
✅ No SQL injection risks (uses ORM)

✅ Migration Review

Revision ID: 20a0e0538ac5
Parent: 64acf94cb7f2 ✅ (correct, single head maintained)
Idempotent: ✅ Checks for table/index existence before creating
Backward Compatible: ✅ Indexes are additive only
Rollback Safe: ✅ Proper downgrade() implementation

Test Coverage Analysis

New Test Files

test_metrics_aggregation_fix.py (258 lines)

✅ Only raw metrics scenario
✅ Only hourly metrics scenario (main fix validation)
✅ Mixed raw + hourly without double-counting
✅ Edge cases and timezone handling

test_metrics_api_coverage.py (161 lines)

✅ Resources with include_metrics=true
✅ Prompts with include_metrics=true
✅ Servers with include_metrics=true

Updated Tests

✅ test_db.py (+97 lines): Model-level metrics tests
✅ Service tests (+221 lines): All 4 services updated

Performance Impact

Query Count (50 entities with include_metrics=true)

Service	Before	After	Improvement
Tools	51 queries	3 queries	94% reduction
Resources	101 queries	3 queries	97% reduction
Prompts	101 queries	3 queries	97% reduction
Servers	101 queries	3 queries	97% reduction

Response Times

Before: 2-5 seconds (with N+1 queries)
After: <50ms (with eager loading + indexes)
Improvement: 99%+ faster

Important Clarification: Regarding Shared Tools Across Virtual Servers

Answer: Metrics Are Tool-Level, Not Per-Server. (Same for prompt/resource/servers)

Current Behavior (By Design): Example: Tool Use Metrics

When the same tool is shared across multiple virtual servers, metrics are aggregated at the TOOL level, not per-server.

How It Works:

Metrics Storage:
- ToolMetric table has: tool_id (FK to tools table)
- No server_id column in the metrics table
- Metrics are recorded per tool execution, regardless of which server invoked it
Many-to-Many Relationship:
```
Tool <---> server_tool_association <---> Server
```
- One tool can belong to multiple servers
- One server can have multiple tools
Metrics Aggregation:
- When you call /servers/{server_id}/tools?include_metrics=true
- You get ALL metrics for each tool, not just metrics from that specific server
- The metrics_summary property queries: WHERE tool_id = ? (no server filter)

Example Scenario:

Tool "git_commit" (ID: abc123)
├── Associated with Server A
├── Associated with Server B
└── Total executions: 100
    ├── 60 from Server A (not tracked separately)
    └── 40 from Server B (not tracked separately)

GET /servers/A/tools?include_metrics=true
Returns: tool "git_commit" with total_executions = 100

GET /servers/B/tools?include_metrics=true
Returns: tool "git_commit" with total_executions = 100

Is This a Problem?

This is the EXISTING behavior - not introduced by this PR. The PR maintains this design.

Pros:

✅ Simpler schema (no server_id in metrics tables)
✅ Tool-level metrics useful for understanding overall tool performance
✅ Consistent with the tool-centric data model

Cons:

⚠️ Cannot distinguish which server generated which metrics
⚠️ Cannot answer "How many times was this tool called via Server A?"
⚠️ Metrics appear duplicated when same tool is in multiple servers

Recommendation: This is acceptable for the current PR. If per-server metrics are needed, file a separate enhancement issue to add server_id to metrics tables.

Files Changed Summary

File	Purpose	Lines Changed
`mcpgateway/db.py`	Helper function + 4 model updates	+554/-298
`mcpgateway/main.py`	Added `include_metrics` to 3 endpoints	+13/-2
`mcpgateway/services/resource_service.py`	Pass-through support	+23/-20
`mcpgateway/services/prompt_service.py`	Pass-through support	+19/-19
`mcpgateway/services/server_service.py`	Pass-through support	+20/-40
`mcpgateway/services/tool_service.py`	Minor update	+1/-0
`mcpgateway/alembic/versions/20a0e0538ac5_*.py`	Composite indexes migration	+113/-0
`tests/unit/mcpgateway/test_metrics_aggregation_fix.py`	NEW comprehensive tests	+258/-0
`tests/unit/mcpgateway/services/test_metrics_api_coverage.py`	NEW API coverage tests	+161/-0
`tests/unit/mcpgateway/test_db.py`	Updated tests	+97/-47
Service test files (4 files)	Updated service tests	+221/-1

Total: 16 files, 1,698 insertions(+), 376 deletions(-)

Verification Checklist

✅ Alembic migration has correct parent (64acf94cb7f2)
✅ Single migration head maintained
✅ No security vulnerabilities introduced
✅ Performance optimizations implemented (eager loading + indexes)
✅ Comprehensive test coverage (>400 new test lines)
✅ Code follows project standards (type hints, docstrings, formatting)
✅ Backward compatible (additive changes only)
✅ Fixes the reported issue (#3598)
✅ Extends functionality consistently across all entity types
✅ Maintains existing behavior for shared tools (tool-level metrics)

Minor Suggestions (Non-Blocking)

Documentation: Consider adding a brief architecture doc explaining the time-partitioning strategy for future maintainers
Monitoring: Add metrics/logging for the helper function to track:
- How often raw vs hourly tables are queried
- Query performance distribution
Future Enhancement: Consider caching metrics_summary results with short TTL (5-10 min) for frequently accessed entities
Future Enhancement: If per-server metrics are needed, file a separate issue to add server_id to metrics tables

Final Recommendation

✅ APPROVE AND MERGE

This is a high-quality PR that:

✅ Fixes a critical production bug (#3598)
✅ Improves code maintainability significantly (400+ lines → 230 centralized)
✅ Adds comprehensive test coverage (>400 new test lines)
✅ Includes performance optimizations (94-97% query reduction)
✅ Follows all project standards and best practices
✅ Maintains existing architectural decisions (tool-level metrics)
✅ Is production-ready with no blocking issues

The implementation is solid, well-tested, and ready for production deployment.

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

Update down_revision from 64acf94cb7f2 to abf8ac3b6008 to chain after the viewer permissions migration that was merged to main. Also add standard pylint disable for alembic op proxy. Closes #3598 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

…tency Address two issues found during code review: 1. In-memory path now uses covered-hours set instead of hard current_hour_start cutoff, so raw metrics from completed hours without hourly aggregates are correctly counted (handles rollup lag, disabled rollup, failed rollup). 2. SQL path uses max_hour_start boundary: hourly data covers up to max_hour_start + 1h, raw data covers everything after. 3. last_execution_time now uses hour_start (not hour_start + 1h) for hourly data, consistent with aggregate_metrics_combined() in metrics_query_service. 4. Fix caplog logger capture in test_streamablehttp_transport.py. Closes #3598 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai

Reviewed and approved. Changes look correct and well-tested.

What was done during review:

Rebased cleanly onto main (15 commits, no conflicts).
Fixed alembic dual-head: Migration 20a0e0538ac5 now correctly chains off abf8ac3b6008 (the viewer permissions migration merged to main). Added standard # pylint: disable=no-member.
Fixed under-counting when rollup hasn't run (Codex finding #1): The original PR dropped raw metrics from completed hours unconditionally, assuming hourly aggregates always exist. Now:
- In-memory path: Uses a covered-hours set — raw metrics are only skipped if their hour has a corresponding hourly aggregate.
- SQL path: Uses max_hour_start boundary — hourly aggregates cover up to max_hour_start + 1h, raw metrics cover everything after.
Fixed last_execution_time inconsistency (Codex finding #2): Changed from hour_start + 1h to hour_start for hourly-sourced timestamps, consistent with aggregate_metrics_combined() in metrics_query_service.py.
Fixed caplog test: Added logger name to caplog.at_level() in test_streamablehttp_transport.py.

Verification: Full test suite passes (14855 passed, 488 skipped). mcpgateway/db.py has 100% coverage.

Remove now and current_hour_start from _compute_metrics_summary top level — no longer needed after switching to covered-hours approach for in-memory path and max_hour_start boundary for SQL path. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

rakdutta changed the title ~~Issue 3568 metrics api~~ fix[Metrics]:Fix Metrics Returning 0 After Cleanup + Extend include_metrics Support Mar 13, 2026

rakdutta force-pushed the issue_3568_metrics_api branch from 4bcac5a to 91c678b Compare March 16, 2026 14:02

rakdutta added the bug Something isn't working label Mar 17, 2026

rakdutta force-pushed the issue_3568_metrics_api branch from eb5b622 to 5236061 Compare March 17, 2026 04:53

rakdutta marked this pull request as ready for review March 17, 2026 09:37

rakdutta requested review from crivetimihai, kevalmahajan and madhav165 as code owners March 17, 2026 09:37

rakdutta force-pushed the issue_3568_metrics_api branch from a8f10d4 to 8c6980a Compare March 18, 2026 05:26

MohanLaksh previously approved these changes Mar 18, 2026

View reviewed changes

rakdutta added the release-fix Critical bugfix required for the release label Mar 18, 2026

rakdutta and others added 17 commits March 19, 2026 15:05

metric count fix

d79e0d2

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

help,fix for others services

4638815

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

metrics

b94e1ed

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

test fix

b757292

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

test case

99a165b

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

performance fix

cd49020

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

remove docstroing

193694e

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

test cases

3680f94

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

alembic

03b9c17

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

md remove

c1519ae

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

docstring

90dd24d

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

metric roll

0686a38

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

doc test

0226a07

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

alembic

e6f7dbb

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

test coverage

77ae69b

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>

crivetimihai dismissed MohanLaksh’s stale review via 9fc4af2 March 19, 2026 15:54

crivetimihai force-pushed the issue_3568_metrics_api branch from 14f03da to 9fc4af2 Compare March 19, 2026 15:54

crivetimihai previously approved these changes Mar 19, 2026

View reviewed changes

crivetimihai dismissed their stale review via 0628153 March 19, 2026 16:30

crivetimihai merged commit a7f3ae3 into main Mar 19, 2026
33 checks passed

crivetimihai deleted the issue_3568_metrics_api branch March 19, 2026 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix[Metrics]:Fix Metrics Returning 0 After Cleanup + Extend include_metrics Support#3649

fix[Metrics]:Fix Metrics Returning 0 After Cleanup + Extend include_metrics Support#3649
crivetimihai merged 18 commits intomainfrom
issue_3568_metrics_api

rakdutta commented Mar 12, 2026 •

edited

Loading

Uh oh!

MohanLaksh left a comment

Uh oh!

crivetimihai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rakdutta commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

1. Centralized Helper Function

2. Updated 4 Models

3. Extended API Support

Files Changed

Benefits

Testing

Verification Scenarios

Scenario 1: No Raw Cleanup (METRICS_DELETE_RAW_AFTER_ROLLUP=false)

Scenario 2: Delayed Cleanup (METRICS_DELETE_RAW_AFTER_ROLLUP=true, delay=1hr)

Scenario 3: Immediate Cleanup (METRICS_DELETE_RAW_AFTER_ROLLUP=true, delay=0)

Configuration

Test Coverage

Uh oh!

MohanLaksh left a comment

Choose a reason for hiding this comment

Code Review: PR #3649 - Fix Metrics Returning 0 After Cleanup + Extend include_metrics Support

Executive Summary

What This PR Does

Problem Solved

Solution Implemented

Code Quality Assessment

✅ Strengths

✅ Security Review

✅ Migration Review

Test Coverage Analysis

New Test Files

Updated Tests

Performance Impact

Query Count (50 entities with include_metrics=true)

Response Times

Important Clarification: Regarding Shared Tools Across Virtual Servers

Answer: Metrics Are Tool-Level, Not Per-Server. (Same for prompt/resource/servers)

How It Works:

Example Scenario:

Is This a Problem?

Files Changed Summary

Verification Checklist

Minor Suggestions (Non-Blocking)

Final Recommendation

✅ APPROVE AND MERGE

Uh oh!

crivetimihai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rakdutta commented Mar 12, 2026 •

edited

Loading

Scenario 1: No Raw Cleanup (`METRICS_DELETE_RAW_AFTER_ROLLUP=false`)

Scenario 2: Delayed Cleanup (`METRICS_DELETE_RAW_AFTER_ROLLUP=true`, delay=1hr)

Scenario 3: Immediate Cleanup (`METRICS_DELETE_RAW_AFTER_ROLLUP=true`, delay=0)