Skip to content

fix[Metrics]:Fix Metrics Returning 0 After Cleanup + Extend include_metrics Support#3649

Merged
crivetimihai merged 18 commits intomainfrom
issue_3568_metrics_api
Mar 19, 2026
Merged

fix[Metrics]:Fix Metrics Returning 0 After Cleanup + Extend include_metrics Support#3649
crivetimihai merged 18 commits intomainfrom
issue_3568_metrics_api

Conversation

@rakdutta
Copy link
Collaborator

@rakdutta rakdutta commented Mar 12, 2026

Summary

Fixes #3598 where metrics API returned 0 after raw metrics cleanup. The API only queried the raw metrics table, ignoring hourly aggregates. Also extends include_metrics=true support to Resources, Prompts, and Servers (Tools already had this).

Problem

When calling /servers/{server_id}/tools?include_metrics=true, total_executions returned 0 after raw metrics were cleaned up. Historical data existed in tool_metrics_hourly but wasn't being queried.

Solution

1. Centralized Helper Function

Created _compute_metrics_summary() in mcpgateway/db.py (~230 lines) that:

  • Queries both raw and hourly metrics tables
  • Uses time-based partitioning (current hour = raw only, completed hours = hourly only)
  • Prevents double-counting
  • Handles timezone normalization

2. Updated 4 Models

Applied fix to Tool, Resource, Prompt, and Server models:

  • Added metrics_hourly relationship
  • Updated metrics_summary property to use helper function

3. Extended API Support

Added include_metrics=true parameter to:

  • GET /servers/{server_id}/resources?include_metrics=true
  • GET /servers/{server_id}/prompts?include_metrics=true
  • GET /servers?include_metrics=true

Files Changed

  • mcpgateway/db.py: Helper function + 4 model updates (+554/-298)
  • mcpgateway/main.py: Added include_metrics to 3 endpoints (+13)
  • mcpgateway/services/{resource,prompt,server}_service.py: Pass-through support (+12/-3)
  • tests/unit/mcpgateway/test_metrics_aggregation_fix.py: NEW comprehensive tests (+258)
  • tests/unit/mcpgateway/test_db.py: Updated tests (+144)

Total: 8 files, 925 insertions(+), 298 deletions(-)

Benefits

  • Bug Fixed: Metrics correctly aggregated from both tables
  • Code Reduced: 400+ duplicated lines → 230 centralized
  • Consistent API: All 4 services support include_metrics=true
  • Maintainable: Single function to update instead of 4 places
  • Future-proof: Adding metrics to new models now takes 5 lines vs 100+

Testing

# All endpoints now support include_metrics=true
curl "http://localhost:4444/servers/{id}/tools?include_metrics=true" -H "Authorization: Bearer $TOKEN"
curl "http://localhost:4444/servers/{id}/resources?include_metrics=true" -H "Authorization: Bearer $TOKEN"
curl "http://localhost:4444/servers/{id}/prompts?include_metrics=true" -H "Authorization: Bearer $TOKEN"
curl "http://localhost:4444/servers?include_metrics=true" -H "Authorization: Bearer $TOKEN"

Expected metrics response:

{
  "total_executions": 150,
  "successful_executions": 145,
  "failed_executions": 5,
  "failure_rate": 0.033,
  "avg_response_time": 0.45,
  "last_execution_time": "2026-03-12T12:30:00Z"
}

Verification Scenarios

All scenarios now correctly aggregate metrics from both tables:

Scenario 1: No Raw Cleanup (METRICS_DELETE_RAW_AFTER_ROLLUP=false)

  • ✅ Counts: Raw metrics from current hour ONLY
  • ✅ Counts: All hourly aggregates
  • ✅ No double-counting despite raw table having old data (time-based partitioning)

Scenario 2: Delayed Cleanup (METRICS_DELETE_RAW_AFTER_ROLLUP=true, delay=1hr)

  • ✅ Counts: Raw metrics from current hour ONLY
  • ✅ Counts: All hourly aggregates
  • ✅ No double-counting during cleanup delay window (time-based partitioning)

Scenario 3: Immediate Cleanup (METRICS_DELETE_RAW_AFTER_ROLLUP=true, delay=0)

Configuration

Relevant environment variables (no changes required):

  • METRICS_DELETE_RAW_AFTER_ROLLUP=true - Enable raw metrics cleanup
  • METRICS_DELETE_RAW_AFTER_ROLLUP_HOURS=1 - Hours to wait after rollup before cleanup
  • METRICS_ROLLUP_INTERVAL_HOURS=1 - Hourly rollup interval

Test Coverage

  • Added 258 lines of new tests covering current hour, completed hours, mixed scenarios, and edge cases
  • Updated existing tests (+144 lines)
  • All tests passing ✅

Closes #3598

@rakdutta rakdutta changed the title Issue 3568 metrics api fix[Metrics]:Fix Metrics Returning 0 After Cleanup + Extend include_metrics Support Mar 13, 2026
@rakdutta rakdutta force-pushed the issue_3568_metrics_api branch from 4bcac5a to 91c678b Compare March 16, 2026 14:02
@rakdutta rakdutta added the bug Something isn't working label Mar 17, 2026
@rakdutta rakdutta force-pushed the issue_3568_metrics_api branch from eb5b622 to 5236061 Compare March 17, 2026 04:53
@rakdutta rakdutta marked this pull request as ready for review March 17, 2026 09:37
@rakdutta rakdutta force-pushed the issue_3568_metrics_api branch from a8f10d4 to 8c6980a Compare March 18, 2026 05:26
MohanLaksh
MohanLaksh previously approved these changes Mar 18, 2026
Copy link
Collaborator

@MohanLaksh MohanLaksh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #3649 - Fix Metrics Returning 0 After Cleanup + Extend include_metrics Support

Date: 2026-03-18
PR Link: #3649
Issue: Fixes #3598


Executive Summary

APPROVED - Ready to Merge

This PR successfully fixes a critical bug where metrics APIs returned 0 after raw metrics cleanup, and extends include_metrics=true support to Resources, Prompts, and Servers. The implementation is well-designed, thoroughly tested, and includes performance optimizations.

Key Stats:

  • Files Changed: 16 files (+1,698/-376 lines)
  • Test Coverage: 400+ new test lines
  • Performance Impact: 94-97% query reduction for list operations
  • Code Quality: Eliminates 400+ lines of duplication with centralized helper

What This PR Does

Problem Solved

When calling /servers/{server_id}/tools?include_metrics=true, the API returned total_executions: 0 after raw metrics were cleaned up. The system only queried raw metrics tables, ignoring hourly aggregates containing historical data.

Solution Implemented

  1. Centralized Helper Function (_compute_metrics_summary() ~230 lines in mcpgateway/db.py)

    • Queries both raw AND hourly metrics tables
    • Time-based partitioning: current hour = raw only, completed hours = hourly only
    • Prevents double-counting through smart time boundaries
    • Handles timezone-aware/naive datetime comparisons
    • Supports both in-memory (loaded relationships) and SQL query paths
  2. Updated 4 ORM Models (Tool, Resource, Prompt, Server)

    • Added metrics_hourly relationship to each model
    • Updated metrics_summary property to use centralized helper
    • Reduced 400+ duplicated lines to 230 centralized lines
  3. Extended API Support

    • Added include_metrics=true parameter to:
      • GET /servers/{server_id}/resources?include_metrics=true
      • GET /servers/{server_id}/prompts?include_metrics=true
      • GET /servers?include_metrics=true
    • Tools already had this feature; now all 4 services are consistent
  4. Performance Optimizations

    • Added selectinload() for both metrics and metrics_hourly relationships
    • Created composite indexes: (entity_id, timestamp) for time-partitioned queries
    • Reduces 51-101 queries down to 3 queries for list operations
  5. Database Migration

    • Alembic migration 20a0e0538ac5 adds composite indexes
    • Idempotent with proper table/index existence checks
    • Backward compatible (indexes are additive only)

Code Quality Assessment

✅ Strengths

  1. Excellent Architecture

    • Time-partitioning strategy elegantly prevents double-counting
    • Clear separation between current hour (raw) and completed hours (hourly)
    • Handles edge cases (timezone normalization, missing data)
  2. DRY Principle Applied

    • Centralized logic eliminates massive code duplication
    • Single source of truth for metrics aggregation
    • Easy to maintain and extend to new entity types
  3. Comprehensive Testing

    • 258 lines of new tests in test_metrics_aggregation_fix.py
    • 161 lines in test_metrics_api_coverage.py
    • 144 lines of updated tests in test_db.py
    • 221 lines across service tests
    • Covers: raw-only, hourly-only, mixed scenarios, edge cases
  4. Performance Optimized

    • N+1 query prevention via eager loading
    • Database indexes for PostgreSQL optimization
    • Expected: 94-97% query reduction, <50ms response times
  5. Well Documented

    • Clear docstrings with ASCII diagrams
    • Inline comments explaining time-partitioning logic
    • Comprehensive PR description
  6. Follows Standards

    • Type hints throughout
    • Proper error handling
    • Consistent with project coding standards

✅ Security Review

  • ✅ No new authentication/authorization paths
  • ✅ No exposure of sensitive data
  • ✅ Follows existing RBAC patterns
  • ✅ Token scoping properly maintained
  • ✅ No SQL injection risks (uses ORM)

✅ Migration Review

  • Revision ID: 20a0e0538ac5
  • Parent: 64acf94cb7f2 ✅ (correct, single head maintained)
  • Idempotent: ✅ Checks for table/index existence before creating
  • Backward Compatible: ✅ Indexes are additive only
  • Rollback Safe: ✅ Proper downgrade() implementation

Test Coverage Analysis

New Test Files

test_metrics_aggregation_fix.py (258 lines)

  • ✅ Only raw metrics scenario
  • ✅ Only hourly metrics scenario (main fix validation)
  • ✅ Mixed raw + hourly without double-counting
  • ✅ Edge cases and timezone handling

test_metrics_api_coverage.py (161 lines)

  • ✅ Resources with include_metrics=true
  • ✅ Prompts with include_metrics=true
  • ✅ Servers with include_metrics=true

Updated Tests

  • test_db.py (+97 lines): Model-level metrics tests
  • ✅ Service tests (+221 lines): All 4 services updated

Performance Impact

Query Count (50 entities with include_metrics=true)

Service Before After Improvement
Tools 51 queries 3 queries 94% reduction
Resources 101 queries 3 queries 97% reduction
Prompts 101 queries 3 queries 97% reduction
Servers 101 queries 3 queries 97% reduction

Response Times

  • Before: 2-5 seconds (with N+1 queries)
  • After: <50ms (with eager loading + indexes)
  • Improvement: 99%+ faster

Important Clarification: Regarding Shared Tools Across Virtual Servers

Answer: Metrics Are Tool-Level, Not Per-Server. (Same for prompt/resource/servers)

Current Behavior (By Design): Example: Tool Use Metrics

When the same tool is shared across multiple virtual servers, metrics are aggregated at the TOOL level, not per-server.

How It Works:

  1. Metrics Storage:

    • ToolMetric table has: tool_id (FK to tools table)
    • No server_id column in the metrics table
    • Metrics are recorded per tool execution, regardless of which server invoked it
  2. Many-to-Many Relationship:

    Tool <---> server_tool_association <---> Server
    
    • One tool can belong to multiple servers
    • One server can have multiple tools
  3. Metrics Aggregation:

    • When you call /servers/{server_id}/tools?include_metrics=true
    • You get ALL metrics for each tool, not just metrics from that specific server
    • The metrics_summary property queries: WHERE tool_id = ? (no server filter)

Example Scenario:

Tool "git_commit" (ID: abc123)
├── Associated with Server A
├── Associated with Server B
└── Total executions: 100
    ├── 60 from Server A (not tracked separately)
    └── 40 from Server B (not tracked separately)

GET /servers/A/tools?include_metrics=true
Returns: tool "git_commit" with total_executions = 100

GET /servers/B/tools?include_metrics=true
Returns: tool "git_commit" with total_executions = 100

Is This a Problem?

This is the EXISTING behavior - not introduced by this PR. The PR maintains this design.

Pros:

  • ✅ Simpler schema (no server_id in metrics tables)
  • ✅ Tool-level metrics useful for understanding overall tool performance
  • ✅ Consistent with the tool-centric data model

Cons:

  • ⚠️ Cannot distinguish which server generated which metrics
  • ⚠️ Cannot answer "How many times was this tool called via Server A?"
  • ⚠️ Metrics appear duplicated when same tool is in multiple servers

Recommendation: This is acceptable for the current PR. If per-server metrics are needed, file a separate enhancement issue to add server_id to metrics tables.


Files Changed Summary

File Purpose Lines Changed
mcpgateway/db.py Helper function + 4 model updates +554/-298
mcpgateway/main.py Added include_metrics to 3 endpoints +13/-2
mcpgateway/services/resource_service.py Pass-through support +23/-20
mcpgateway/services/prompt_service.py Pass-through support +19/-19
mcpgateway/services/server_service.py Pass-through support +20/-40
mcpgateway/services/tool_service.py Minor update +1/-0
mcpgateway/alembic/versions/20a0e0538ac5_*.py Composite indexes migration +113/-0
tests/unit/mcpgateway/test_metrics_aggregation_fix.py NEW comprehensive tests +258/-0
tests/unit/mcpgateway/services/test_metrics_api_coverage.py NEW API coverage tests +161/-0
tests/unit/mcpgateway/test_db.py Updated tests +97/-47
Service test files (4 files) Updated service tests +221/-1

Total: 16 files, 1,698 insertions(+), 376 deletions(-)


Verification Checklist

  • ✅ Alembic migration has correct parent (64acf94cb7f2)
  • ✅ Single migration head maintained
  • ✅ No security vulnerabilities introduced
  • ✅ Performance optimizations implemented (eager loading + indexes)
  • ✅ Comprehensive test coverage (>400 new test lines)
  • ✅ Code follows project standards (type hints, docstrings, formatting)
  • ✅ Backward compatible (additive changes only)
  • ✅ Fixes the reported issue (#3598)
  • ✅ Extends functionality consistently across all entity types
  • ✅ Maintains existing behavior for shared tools (tool-level metrics)

Minor Suggestions (Non-Blocking)

  1. Documentation: Consider adding a brief architecture doc explaining the time-partitioning strategy for future maintainers

  2. Monitoring: Add metrics/logging for the helper function to track:

    • How often raw vs hourly tables are queried
    • Query performance distribution
  3. Future Enhancement: Consider caching metrics_summary results with short TTL (5-10 min) for frequently accessed entities

  4. Future Enhancement: If per-server metrics are needed, file a separate issue to add server_id to metrics tables


Final Recommendation

APPROVE AND MERGE

This is a high-quality PR that:

  • ✅ Fixes a critical production bug (#3598)
  • ✅ Improves code maintainability significantly (400+ lines → 230 centralized)
  • ✅ Adds comprehensive test coverage (>400 new test lines)
  • ✅ Includes performance optimizations (94-97% query reduction)
  • ✅ Follows all project standards and best practices
  • ✅ Maintains existing architectural decisions (tool-level metrics)
  • ✅ Is production-ready with no blocking issues

The implementation is solid, well-tested, and ready for production deployment.


@rakdutta rakdutta added the release-fix Critical bugfix required for the release label Mar 18, 2026
rakdutta and others added 17 commits March 19, 2026 15:05
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Update down_revision from 64acf94cb7f2 to abf8ac3b6008 to chain after
the viewer permissions migration that was merged to main. Also add
standard pylint disable for alembic op proxy.

Closes #3598

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…tency

Address two issues found during code review:

1. In-memory path now uses covered-hours set instead of hard current_hour_start
   cutoff, so raw metrics from completed hours without hourly aggregates are
   correctly counted (handles rollup lag, disabled rollup, failed rollup).

2. SQL path uses max_hour_start boundary: hourly data covers up to
   max_hour_start + 1h, raw data covers everything after.

3. last_execution_time now uses hour_start (not hour_start + 1h) for hourly
   data, consistent with aggregate_metrics_combined() in metrics_query_service.

4. Fix caplog logger capture in test_streamablehttp_transport.py.

Closes #3598

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
crivetimihai
crivetimihai previously approved these changes Mar 19, 2026
Copy link
Member

@crivetimihai crivetimihai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed and approved. Changes look correct and well-tested.

What was done during review:

  1. Rebased cleanly onto main (15 commits, no conflicts).

  2. Fixed alembic dual-head: Migration 20a0e0538ac5 now correctly chains off abf8ac3b6008 (the viewer permissions migration merged to main). Added standard # pylint: disable=no-member.

  3. Fixed under-counting when rollup hasn't run (Codex finding #1): The original PR dropped raw metrics from completed hours unconditionally, assuming hourly aggregates always exist. Now:

    • In-memory path: Uses a covered-hours set — raw metrics are only skipped if their hour has a corresponding hourly aggregate.
    • SQL path: Uses max_hour_start boundary — hourly aggregates cover up to max_hour_start + 1h, raw metrics cover everything after.
  4. Fixed last_execution_time inconsistency (Codex finding #2): Changed from hour_start + 1h to hour_start for hourly-sourced timestamps, consistent with aggregate_metrics_combined() in metrics_query_service.py.

  5. Fixed caplog test: Added logger name to caplog.at_level() in test_streamablehttp_transport.py.

Verification: Full test suite passes (14855 passed, 488 skipped). mcpgateway/db.py has 100% coverage.

Remove now and current_hour_start from _compute_metrics_summary top
level — no longer needed after switching to covered-hours approach
for in-memory path and max_hour_start boundary for SQL path.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai merged commit a7f3ae3 into main Mar 19, 2026
33 checks passed
@crivetimihai crivetimihai deleted the issue_3568_metrics_api branch March 19, 2026 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working release-fix Critical bugfix required for the release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG][OBSERVABILITY]: Server tool metrics total_executions always returns 0

3 participants