Skip to content

feat: e2e integration tests with new mock server all endpoints#377

Merged
ajcasagrande merged 8 commits intomainfrom
ajc/new-mock
Oct 18, 2025
Merged

feat: e2e integration tests with new mock server all endpoints#377
ajcasagrande merged 8 commits intomainfrom
ajc/new-mock

Conversation

@ajcasagrande
Copy link
Copy Markdown
Contributor

@ajcasagrande ajcasagrande commented Oct 17, 2025

Summary by CodeRabbit

Release Notes

  • New Features

    • Added comprehensive integration testing framework with automated CI/CD workflow across multiple OS and Python versions.
    • Expanded mock server with support for multiple OpenAI API endpoints, GPU telemetry metrics, latency simulation, reasoning model support, and error injection capabilities.
  • Tests

    • Added integration tests for chat, completions, embeddings, and rankings endpoints.
    • Added GPU telemetry and mock server component unit tests.
  • Chores

    • Updated test categorization with pytest markers for better test organization.
    • Enhanced make targets for integration and performance test isolation.

@ajcasagrande ajcasagrande changed the title feat: new mock server implementation chat/completion/embedding/rankings feat: e2e integration tests with new mock server all endpoints Oct 17, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Oct 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Oct 17, 2025

Walkthrough

This pull request significantly refactors the AIPerf Mock Server from a basic integration test server into a production-grade mock supporting multiple OpenAI API endpoints, GPU telemetry via DCGM, latency simulation, reasoning models, error injection, and streaming responses. Accompanying changes include new test infrastructure, configuration framework, and comprehensive unit and integration tests.

Changes

Cohort / File(s) Summary
CI/CD Configuration
.github/workflows/run-integration-tests.yml, Makefile, pyproject.toml
Added GitHub Actions workflow for integration tests across Ubuntu/macOS and Python 3.10-3.12. Added Makefile targets for test-integration and test-integration-verbose. Added pytest markers (integration, performance, ffmpeg) with deselection by default and console output styling options.
Mock Server Core Redesign
tests/aiperf_mock_server/README.md, tests/aiperf_mock_server/__main__.py, tests/aiperf_mock_server/app.py, tests/aiperf_mock_server/config.py, tests/aiperf_mock_server/models.py
Redesigned README from integration test server to feature-rich mock server. Updated entry point with expanded configuration. Completely refactored app.py endpoints (chat/completions/embeddings/rankings) with new RequestContext-driven architecture, streaming support, and DCGM metrics endpoint. Introduced MockServerConfig with environment variable propagation. Replaced legacy models with comprehensive Pydantic request/response schema covering multiple API endpoints.
Mock Server Dependencies & Utilities
tests/aiperf_mock_server/pyproject.toml, tests/aiperf_mock_server/dcgm_faker.py, tests/aiperf_mock_server/tokens.py, tests/aiperf_mock_server/utils.py
Added pydantic-settings and orjson dependencies. Added H200 GPU configuration. Introduced new tokens.py module with deterministic tokenization, reasoning token generation, and budget management. Introduced new utils.py module with LatencySimulator, RequestContext, error injection, and streaming handlers.
Removed Module
tests/aiperf_mock_server/tokenizer_service.py
Removed entire tokenizer service module; functionality replaced by new tokens.py module.
Test Configuration & Utilities
tests/conftest.py, tests/server/__init__.py, tests/server/conftest.py, tests/integration/conftest.py, tests/integration/models.py, tests/integration/utils.py, tests/integration/README.md
Removed legacy pytest hooks from main conftest. Added comprehensive server conftest with test fixtures for tokenizer, DCGM, GPU config, and test client. Introduced integration conftest with AIPerfCLI runner, mock server fixtures, and subprocess management. Added models for AIPerfResults, AIPerfMockServer, and subprocess results. Added ranking dataset utility.
Server Unit Tests
tests/server/test_app.py, tests/server/test_config.py, tests/server/test_dcgm_faker.py, tests/server/test_models.py, tests/server/test_tokens.py, tests/server/test_utils.py
Added comprehensive test suites for all mock server components: API endpoints (root, health, chat, completions, embeddings, rankings, DCGM), configuration validation and environment propagation, GPU faker behavior and metrics generation, Pydantic models validation, tokenization and reasoning logic, streaming and error injection utilities.
Integration Tests
tests/integration/test_chat_endpoint.py, tests/integration/test_completions_endpoint.py, tests/integration/test_default_behavior.py, tests/integration/test_embeddings_endpoint.py, tests/integration/test_gpu_telemetry.py, tests/integration/test_rankings_endpoint.py
Added six integration test modules validating each API endpoint (chat, completions, embeddings, rankings), default behavior, and GPU telemetry collection against mock server using AIPerfCLI harness.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Rationale: This refactor introduces substantial architectural changes across multiple interconnected modules (app.py, models.py, config.py) with new complex logic (tokenization, streaming, latency simulation). The changes exhibit high heterogeneity across concerns (API endpoints, configuration, telemetry, tests) requiring separate reasoning for each area. While individual test files are straightforward, the mock server implementation modules contain moderate-to-high logic density. The extensive test coverage and documentation changes increase total scope but provide validation context. No single homogeneous pattern; multiple distinct refactors demand careful review of interactions and backward compatibility implications.

Poem

🐰 Whiskers twitching with glee,
A mock server springs forth—tokens dance free,
DCGM metrics gleam, streaming flows bright,
With reasoning tokens and tests burning right!
From endpoint to endpoint, the rabbits now roam, 🚀

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.96% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "feat: new mock server implementation chat/completion/embedding/rankings" directly corresponds to the primary objective of this changeset, which is a comprehensive redesign and reimplementation of the AIPerf Mock Server with new endpoints for chat completions, text completions, embeddings, and rankings. The title is specific and informative, clearly identifying the main components being implemented (the four API endpoints) while avoiding vague or generic language. While the changeset also includes supporting changes like workflow additions, Makefile updates, test infrastructure, and configuration refactoring, these all serve to enable and test the core mock server implementation, making the title's focus appropriate and representative of the central change.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Nitpick comments (19)
tests/integration/README.md (1)

56-65: Use proper heading syntax instead of bold emphasis.

Lines 58 and 62 use bold text as pseudo-headings. Use proper Markdown heading syntax for better semantic structure and accessibility.

Apply this diff:

 
 ## Key Components
 
-**Fixtures (conftest.py)**
+### Fixtures (conftest.py)
+
 - `aiperf_mock_server: AIPerfMockServer` - Mock LLM server instance
 - `cli: AIPerfCLI` - CLI wrapper for running AIPerf commands
 
-**Models (models.py)**
+### Models (models.py)
+
 - `AIPerfResults` - Result wrapper with typed properties for all output artifacts
 - `AIPerfMockServer` - Server connection info
 - `AIPerfSubprocessResult` - Subprocess execution result
tests/integration/test_completions_endpoint.py (1)

16-31: LGTM: Basic completions test is correct.

Consider adding a streaming completions test similar to test_streaming_chat for completeness, though non-streaming coverage is sufficient for now.

tests/server/test_utils.py (3)

27-38: Avoid hard‑coded latency expectations; derive from config.

Coupling tests to 20ms/5ms defaults will break if config changes. Read ttft/itl from server_config to compute expected seconds.

Apply:

@@
-    async def test_wait_for_next_token(self, time_traveler: TimeTraveler):
-        sim = LatencySimulator()
-        with time_traveler.sleeps_for(expected_seconds=0.02):
+    async def test_wait_for_next_token(self, time_traveler: TimeTraveler):
+        from aiperf_mock_server.config import server_config
+        sim = LatencySimulator()
+        with time_traveler.sleeps_for(expected_seconds=server_config.ttft * 0.001):
             await sim.wait_for_next_token()
@@
-    async def test_wait_for_tokens_multiple(self, time_traveler: TimeTraveler):
-        sim = LatencySimulator()
-        with time_traveler.sleeps_for(expected_seconds=0.02 + (0.005 * 5)):
+    async def test_wait_for_tokens_multiple(self, time_traveler: TimeTraveler):
+        from aiperf_mock_server.config import server_config
+        sim = LatencySimulator()
+        with time_traveler.sleeps_for(
+            expected_seconds=(server_config.ttft + server_config.itl * 5) * 0.001
+        ):
             await sim.wait_for_tokens(num_tokens=5)

6-10: Expand request ID coverage for embeddings and rankings.

Test create_request_id for all supported request types.

Apply:

@@
-from aiperf_mock_server.models import (
-    ChatCompletionRequest,
-    CompletionRequest,
-    Message,
-)
+from aiperf_mock_server.models import (
+    ChatCompletionRequest,
+    CompletionRequest,
+    EmbeddingRequest,
+    RankingRequest,
+    Message,
+)
@@
         [
             (CompletionRequest(model="test", prompt="test"), "cmpl-"),
             (
                 ChatCompletionRequest(
                     model="test", messages=[Message(role="user", content="test")]
                 ),
                 "chatcmpl-",
             ),
+            (EmbeddingRequest(model="test", input="hello"), "emb-"),
+            (
+                RankingRequest(
+                    model="test",
+                    query={"text": "q"},
+                    passages=[{"text": "p1"}],
+                ),
+                "rank-",
+            ),
         ],

Also applies to: 43-59


106-131: Strengthen SSE assertions by parsing JSON instead of substring matching.

Validate each non-[DONE] SSE chunk is valid JSON and contains expected keys; assert usage presence robustly.

Apply:

@@
-        chunks = []
+        import json
+        chunks = []
         async for chunk in stream_text_completion(ctx):
             chunks.append(chunk)
@@
-        assert any("data:" in chunk for chunk in chunks)
+        payloads = [
+            json.loads(c[len("data: ") :])
+            for c in chunks
+            if c.startswith("data: ") and c.strip() != "data: [DONE]"
+        ]
+        assert payloads and all("id" in p and "model" in p for p in payloads)
@@
-        chunks = []
+        import json
+        chunks = []
         async for chunk in stream_text_completion(ctx):
             chunks.append(chunk)
@@
-        assert any("usage" in chunk for chunk in chunks)
+        usage_chunks = [
+            json.loads(c[len("data: ") :])
+            for c in chunks
+            if c.startswith("data: ") and c.strip() != "data: [DONE]"
+        ]
+        assert any("usage" in p for p in usage_chunks)
@@
-        chunks = []
+        import json
+        chunks = []
         async for chunk in stream_chat_completion(ctx):
             chunks.append(chunk)
@@
-        assert any("data:" in chunk for chunk in chunks)
+        payloads = [
+            json.loads(c[len("data: ") :])
+            for c in chunks
+            if c.startswith("data: ") and c.strip() != "data: [DONE]"
+        ]
+        assert payloads and all("id" in p and "model" in p for p in payloads)
@@
-        chunks = []
+        import json
+        chunks = []
         async for chunk in stream_chat_completion(ctx):
             chunks.append(chunk)
@@
-        assert any("usage" in chunk for chunk in chunks)
+        usage_chunks = [
+            json.loads(c[len("data: ") :])
+            for c in chunks
+            if c.startswith("data: ") and c.strip() != "data: [DONE]"
+        ]
+        assert any("usage" in p for p in usage_chunks)

Optional: add time_traveler fixture to streaming tests to avoid wall‑clock sleeps.

Also applies to: 136-150, 166-179

tests/server/test_config.py (1)

32-45: Caution: 0.0.0.0 binding (S104).

Using host="0.0.0.0" is valid but exposes the server on all interfaces. If used outside CI, ensure it’s intentional and documented.

tests/aiperf_mock_server/__main__.py (1)

39-46: Keep access_log user‑controlled.

Auto‑enabling access logs when log_level=debug overrides explicit config.access_logs=False. Prefer honoring the flag as source of truth.

Apply:

-        access_log=config.access_logs or config.log_level.lower() == "debug",
+        access_log=config.access_logs,

If you want auto‑enable in debug, document it and add an explicit “--access-logs=auto” mode.

tests/server/test_app.py (1)

13-15: Avoid pinning exact version in test.

Hard‑coding "2.0.0" creates churn on version bumps. Assert presence/format or import a single source of truth.

Apply:

-        assert data["message"] == "AIPerf Mock Server"
-        assert data["version"] == "2.0.0"
+        assert data["message"] == "AIPerf Mock Server"
+        assert "version" in data and isinstance(data["version"], str) and data["version"]
tests/server/test_tokens.py (2)

5-10: Avoid importing private _tokenize in tests unless necessary

Prefer exercising the public surface (Tokenizer.tokenize/count_tokens) to reduce test fragility when internals change. Keep one direct test for _tokenize if it guards a critical behavior, otherwise remove.


103-109: ignore_eos behavior assertion

Finish reason “length” check is useful. Consider asserting count == max_tokens when ignore_eos is true to catch regressions.

tests/aiperf_mock_server/app.py (2)

56-61: dcgm_fakers grows across app restarts

On reloads, you append without clearing, leaking instances. Clear before appending.

-    dcgm_fakers.append(_create_dcgm_faker(server_config.dcgm_seed))
+    dcgm_fakers.clear()
+    dcgm_fakers.append(_create_dcgm_faker(server_config.dcgm_seed))

238-244: Path pattern ‘/dcgm{instance_id:int}/metrics’

This intends to match ‘/dcgm1/metrics’. Starlette typically supports {param} but params embedded in path segments can be finicky. Add a companion route /dcgm/{instance_id}/metrics for safety, or add an explicit test ensuring both forms resolve.

 @app.get("/dcgm{instance_id:int}/metrics")
 async def dcgm_metrics(instance_id: int) -> PlainTextResponse:
     ...
+
+@app.get("/dcgm/{instance_id}/metrics")
+async def dcgm_metrics_slash(instance_id: int) -> PlainTextResponse:
+    return await dcgm_metrics(instance_id)
tests/integration/conftest.py (4)

157-167: Tighten health-check backoff; current worst-case ≈200s.

Reduce attempts and per-try timeout for faster failure.

-            for _ in range(100):
+            for _ in range(50):
                 try:
                     async with session.get(
-                        f"{url}/health", timeout=aiohttp.ClientTimeout(total=2)
+                        f"{url}/health", timeout=aiohttp.ClientTimeout(total=0.5)
                     ) as resp:
                         if resp.status == 200:
                             break

Also applies to: 160-165


205-208: Use splat expansion; cleaner and matches Ruff hint (RUF005).

-        full_args = args + ["--artifact-dir", str(temp_output_dir)]
+        full_args = [*args, "--artifact-dir", str(temp_output_dir)]
-        cmd = [python_exe, "-m", "aiperf"] + full_args
+        cmd = [python_exe, "-m", "aiperf", *full_args]

151-153: Consider capturing stderr for failed startup diagnostics.

Silencing server logs complicates debugging when /health never turns green.

-        stdout=asyncio.subprocess.DEVNULL,
-        stderr=asyncio.subprocess.DEVNULL,
+        stdout=asyncio.subprocess.DEVNULL,
+        stderr=asyncio.subprocess.PIPE,  # capture errors for debugging

Optionally surface stderr in the RuntimeError message when startup fails.


236-242: Clarify fixture dependency; silence ARG001.

Param is intentionally unused to ensure server lifecycle; rename it.

-def cli(
-    aiperf_runner: Callable[[list[str], float], AIPerfSubprocessResult],
-    aiperf_mock_server: AIPerfMockServer,
-) -> AIPerfCLI:
+def cli(
+    aiperf_runner: Callable[[list[str], float], AIPerfSubprocessResult],
+    _aiperf_mock_server: AIPerfMockServer,  # ensures server is running
+) -> AIPerfCLI:
tests/aiperf_mock_server/tokens.py (2)

310-314: Make seed stable across runs (Python hash is randomized).

hash() varies per process; prefer a small stable digest.

-        sample = prompt_tokens[:5]
-        return hash(tuple(sample)) % 1000
+        import hashlib
+        sample = "".join(prompt_tokens[:5]).encode("utf-8", "ignore")
+        return int.from_bytes(hashlib.blake2s(sample, digest_size=4).digest(), "big")

254-269: Tighten typing for messages.

Helps static analysis and IDEs.

-    def _extract_chat_messages(self, messages: list) -> str:
+    def _extract_chat_messages(self, messages: list["Message"]) -> str:
tests/server/test_dcgm_faker.py (1)

39-41: Ignore S311 here; randomness is non-crypto test scaffolding.

No action needed.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed31a65 and f8f1e46.

📒 Files selected for processing (32)
  • .github/workflows/run-integration-tests.yml (1 hunks)
  • Makefile (3 hunks)
  • pyproject.toml (2 hunks)
  • tests/aiperf_mock_server/README.md (1 hunks)
  • tests/aiperf_mock_server/__main__.py (1 hunks)
  • tests/aiperf_mock_server/app.py (1 hunks)
  • tests/aiperf_mock_server/config.py (2 hunks)
  • tests/aiperf_mock_server/dcgm_faker.py (1 hunks)
  • tests/aiperf_mock_server/models.py (1 hunks)
  • tests/aiperf_mock_server/pyproject.toml (1 hunks)
  • tests/aiperf_mock_server/tokenizer_service.py (0 hunks)
  • tests/aiperf_mock_server/tokens.py (1 hunks)
  • tests/aiperf_mock_server/utils.py (1 hunks)
  • tests/conftest.py (0 hunks)
  • tests/integration/README.md (1 hunks)
  • tests/integration/conftest.py (1 hunks)
  • tests/integration/models.py (1 hunks)
  • tests/integration/test_chat_endpoint.py (1 hunks)
  • tests/integration/test_completions_endpoint.py (1 hunks)
  • tests/integration/test_default_behavior.py (1 hunks)
  • tests/integration/test_embeddings_endpoint.py (1 hunks)
  • tests/integration/test_gpu_telemetry.py (1 hunks)
  • tests/integration/test_rankings_endpoint.py (1 hunks)
  • tests/integration/utils.py (1 hunks)
  • tests/server/__init__.py (1 hunks)
  • tests/server/conftest.py (1 hunks)
  • tests/server/test_app.py (1 hunks)
  • tests/server/test_config.py (1 hunks)
  • tests/server/test_dcgm_faker.py (1 hunks)
  • tests/server/test_models.py (1 hunks)
  • tests/server/test_tokens.py (1 hunks)
  • tests/server/test_utils.py (1 hunks)
💤 Files with no reviewable changes (2)
  • tests/conftest.py
  • tests/aiperf_mock_server/tokenizer_service.py
🧰 Additional context used
🧬 Code graph analysis (20)
tests/server/test_models.py (1)
tests/aiperf_mock_server/models.py (24)
  • ChatChoice (159-162)
  • ChatCompletionRequest (47-57)
  • ChatCompletionResponse (183-187)
  • ChatDelta (151-156)
  • ChatMessage (143-148)
  • ChatStreamChoice (171-174)
  • ChatStreamCompletionResponse (197-201)
  • CompletionRequest (60-71)
  • EmbeddingRequest (74-87)
  • Message (24-28)
  • Ranking (228-232)
  • RankingRequest (90-110)
  • RankingResponse (235-242)
  • TextChoice (165-168)
  • TextCompletionResponse (190-194)
  • TextStreamChoice (177-180)
  • TextStreamCompletionResponse (204-208)
  • Usage (118-124)
  • include_usage (42-44)
  • prompt_text (67-71)
  • max_output_tokens (55-57)
  • inputs (81-87)
  • passage_texts (103-105)
  • total_tokens (108-110)
tests/server/conftest.py (6)
tests/integration/conftest.py (1)
  • aiperf_mock_server (128-187)
tests/aiperf_mock_server/config.py (2)
  • MockServerConfig (16-123)
  • set_server_config (131-135)
tests/aiperf_mock_server/dcgm_faker.py (1)
  • DCGMFaker (117-148)
tests/aiperf_mock_server/models.py (5)
  • ChatCompletionRequest (47-57)
  • CompletionRequest (60-71)
  • EmbeddingRequest (74-87)
  • Message (24-28)
  • RankingRequest (90-110)
tests/aiperf_mock_server/tokens.py (2)
  • TokenizedText (16-56)
  • content (32-34)
aiperf/common/tokenizer.py (1)
  • Tokenizer (25-161)
tests/integration/test_completions_endpoint.py (2)
tests/integration/conftest.py (4)
  • AIPerfCLI (27-87)
  • cli (237-242)
  • aiperf_mock_server (128-187)
  • run (36-59)
tests/integration/models.py (2)
  • AIPerfMockServer (24-35)
  • request_count (138-142)
tests/server/test_tokens.py (2)
tests/aiperf_mock_server/models.py (4)
  • ChatCompletionRequest (47-57)
  • CompletionRequest (60-71)
  • Message (24-28)
  • total_tokens (108-110)
tests/aiperf_mock_server/tokens.py (9)
  • TokenizedText (16-56)
  • _tokenize (155-179)
  • reasoning_content (37-43)
  • create_usage (45-56)
  • tokenize (185-187)
  • count (27-29)
  • count_tokens (189-191)
  • tokenize_request (193-234)
  • content (32-34)
tests/integration/test_embeddings_endpoint.py (2)
tests/integration/conftest.py (4)
  • AIPerfCLI (27-87)
  • cli (237-242)
  • aiperf_mock_server (128-187)
  • run (36-59)
tests/integration/models.py (2)
  • AIPerfMockServer (24-35)
  • request_count (138-142)
tests/server/test_utils.py (3)
tests/aiperf_mock_server/models.py (3)
  • ChatCompletionRequest (47-57)
  • CompletionRequest (60-71)
  • Message (24-28)
tests/aiperf_mock_server/utils.py (8)
  • LatencySimulator (59-81)
  • RequestContext (84-95)
  • create_request_id (103-115)
  • stream_chat_completion (123-144)
  • stream_text_completion (220-237)
  • with_error_injection (39-51)
  • wait_for_next_token (70-73)
  • wait_for_tokens (75-81)
tests/utils/time_traveler.py (3)
  • time_traveler (114-125)
  • TimeTraveler (20-110)
  • sleeps_for (92-110)
tests/server/test_app.py (1)
tests/server/conftest.py (1)
  • test_client (61-63)
tests/integration/test_rankings_endpoint.py (3)
tests/integration/conftest.py (4)
  • AIPerfCLI (27-87)
  • cli (237-242)
  • aiperf_mock_server (128-187)
  • run (36-59)
tests/integration/models.py (2)
  • AIPerfMockServer (24-35)
  • request_count (138-142)
tests/integration/utils.py (1)
  • create_rankings_dataset (10-30)
tests/integration/test_gpu_telemetry.py (2)
tests/integration/conftest.py (3)
  • AIPerfCLI (27-87)
  • aiperf_mock_server (128-187)
  • run (36-59)
tests/integration/models.py (4)
  • AIPerfMockServer (24-35)
  • dcgm_urls (33-35)
  • request_count (138-142)
  • has_gpu_telemetry (231-233)
tests/integration/test_default_behavior.py (2)
tests/integration/conftest.py (4)
  • AIPerfCLI (27-87)
  • cli (237-242)
  • aiperf_mock_server (128-187)
  • run (36-59)
tests/integration/models.py (2)
  • AIPerfMockServer (24-35)
  • request_count (138-142)
tests/integration/test_chat_endpoint.py (2)
tests/integration/conftest.py (4)
  • AIPerfCLI (27-87)
  • cli (237-242)
  • aiperf_mock_server (128-187)
  • run (36-59)
tests/integration/models.py (3)
  • AIPerfMockServer (24-35)
  • request_count (138-142)
  • has_streaming_metrics (145-154)
tests/server/test_config.py (1)
tests/aiperf_mock_server/config.py (5)
  • MockServerConfig (16-123)
  • _get_env_key (148-150)
  • _propagate_config_to_env (138-145)
  • _serialize_env_value (153-157)
  • set_server_config (131-135)
tests/integration/models.py (3)
aiperf/common/models/dataset_models.py (2)
  • InputsFile (107-114)
  • SessionPayloads (95-104)
aiperf/common/models/export_models.py (1)
  • JsonExportData (73-117)
aiperf/common/models/record_models.py (1)
  • MetricRecordInfo (121-135)
tests/aiperf_mock_server/__main__.py (3)
tests/integration/conftest.py (1)
  • aiperf_mock_server (128-187)
tests/aiperf_mock_server/config.py (2)
  • MockServerConfig (16-123)
  • set_server_config (131-135)
tests/aiperf_mock_server/app.py (1)
  • root (224-230)
tests/integration/conftest.py (2)
tests/utils/time_traveler.py (1)
  • real_sleep (85-86)
tests/integration/models.py (3)
  • AIPerfMockServer (24-35)
  • AIPerfResults (50-239)
  • AIPerfSubprocessResult (16-20)
tests/aiperf_mock_server/utils.py (3)
tests/utils/time_traveler.py (3)
  • time (48-49)
  • perf_counter (51-52)
  • sleep (37-43)
tests/aiperf_mock_server/models.py (11)
  • ChatCompletionRequest (47-57)
  • ChatDelta (151-156)
  • ChatStreamChoice (171-174)
  • ChatStreamCompletionResponse (197-201)
  • CompletionRequest (60-71)
  • EmbeddingRequest (74-87)
  • RankingRequest (90-110)
  • TextStreamChoice (177-180)
  • TextStreamCompletionResponse (204-208)
  • BaseModel (13-16)
  • include_usage (42-44)
tests/aiperf_mock_server/tokens.py (5)
  • tokenize_request (193-234)
  • count (27-29)
  • create_usage (45-56)
  • reasoning_content (37-43)
  • content (32-34)
tests/aiperf_mock_server/tokens.py (1)
tests/aiperf_mock_server/models.py (12)
  • ChatCompletionRequest (47-57)
  • CompletionRequest (60-71)
  • EmbeddingRequest (74-87)
  • RankingRequest (90-110)
  • Usage (118-124)
  • BaseModel (13-16)
  • total_tokens (108-110)
  • max_output_tokens (55-57)
  • prompt_text (67-71)
  • inputs (81-87)
  • query_text (98-100)
  • passage_texts (103-105)
tests/aiperf_mock_server/app.py (5)
tests/server/conftest.py (1)
  • dcgm_faker (49-51)
tests/aiperf_mock_server/dcgm_faker.py (2)
  • DCGMFaker (117-148)
  • generate (143-148)
tests/aiperf_mock_server/models.py (15)
  • ChatChoice (159-162)
  • ChatCompletionRequest (47-57)
  • ChatMessage (143-148)
  • CompletionRequest (60-71)
  • Embedding (211-216)
  • EmbeddingRequest (74-87)
  • EmbeddingResponse (219-225)
  • Ranking (228-232)
  • RankingRequest (90-110)
  • RankingResponse (235-242)
  • TextChoice (165-168)
  • TextCompletionResponse (190-194)
  • inputs (81-87)
  • query_text (98-100)
  • passage_texts (103-105)
tests/aiperf_mock_server/utils.py (5)
  • RequestContext (84-95)
  • stream_chat_completion (123-144)
  • stream_text_completion (220-237)
  • with_error_injection (39-51)
  • wait_until_completion (93-95)
tests/aiperf_mock_server/tokens.py (3)
  • content (32-34)
  • reasoning_content (37-43)
  • create_usage (45-56)
tests/server/test_dcgm_faker.py (2)
tests/server/conftest.py (2)
  • dcgm_faker (49-51)
  • gpu_config (55-57)
tests/aiperf_mock_server/dcgm_faker.py (6)
  • DCGMFaker (117-148)
  • FakeGPU (54-114)
  • GPUConfig (10-19)
  • update (89-114)
  • set_load (130-132)
  • generate (143-148)
tests/aiperf_mock_server/models.py (3)
tests/aiperf_mock_server/tokens.py (2)
  • content (32-34)
  • reasoning_content (37-43)
tests/server/conftest.py (1)
  • reasoning_effort (175-177)
tests/aiperf_mock_server/app.py (1)
  • rankings (186-209)
🪛 actionlint (1.7.8)
.github/workflows/run-integration-tests.yml

24-24: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 markdownlint-cli2 (0.18.1)
tests/integration/README.md

58-58: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


62-62: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🪛 Ruff (0.14.0)
tests/server/test_config.py

35-35: Possible binding to all interfaces

(S104)


41-41: Possible binding to all interfaces

(S104)

tests/integration/conftest.py

174-177: Avoid specifying long messages outside the exception class

(TRY003)


205-205: Consider [*args, "--artifact-dir", str(temp_output_dir)] instead of concatenation

Replace with [*args, "--artifact-dir", str(temp_output_dir)]

(RUF005)


207-207: Consider [python_exe, "-m", "aiperf", *full_args] instead of concatenation

Replace with [python_exe, "-m", "aiperf", *full_args]

(RUF005)


226-226: Avoid specifying long messages outside the exception class

(TRY003)


239-239: Unused function argument: aiperf_mock_server

(ARG001)

tests/server/test_dcgm_faker.py

39-39: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: build (ubuntu-latest, 3.12)
  • GitHub Check: build (macos-latest, 3.11)
  • GitHub Check: integration-tests (macos-latest, 3.11)
  • GitHub Check: build (ubuntu-latest, 3.10)
  • GitHub Check: build (ubuntu-latest, 3.11)
  • GitHub Check: integration-tests (macos-latest, 3.12)
  • GitHub Check: integration-tests (macos-latest, 3.10)
🔇 Additional comments (46)
pyproject.toml (1)

94-111: LGTM! Well-structured pytest configuration.

The new marker definitions and default deselection strategy cleanly separates integration/performance tests from unit tests, aligning with the new test infrastructure introduced in this PR.

Makefile (1)

181-190: LGTM! Well-designed integration test targets.

The dual targets (parallel and sequential verbose) provide good DX—parallel for speed, sequential for debugging with live output.

tests/aiperf_mock_server/pyproject.toml (1)

19-21: LGTM! Dependencies align with the mock server refactor.

The addition of pydantic-settings and orjson supports the new configuration framework and performance-oriented JSON handling introduced in this PR.

tests/integration/test_default_behavior.py (1)

9-29: LGTM! Well-structured integration test.

The test follows the documented pattern and includes a clear docstring explaining the constraint (non-default port requires explicit URL). Good defensive testing of default behavior.

tests/aiperf_mock_server/dcgm_faker.py (1)

27-27: LGTM! H200 configuration matches specifications.

The 141GB memory and other specs align with NVIDIA H200 specifications. Good addition to the GPU configuration catalog.

tests/integration/test_chat_endpoint.py (2)

14-29: LGTM: Clear basic chat test.

The test correctly validates non-streaming chat completions with appropriate assertions.


31-49: LGTM: Streaming test with proper validation.

The test correctly enables streaming and validates both request count and streaming metrics presence.

tests/integration/utils.py (1)

10-30: LGTM: Well-structured dataset generator.

The function correctly generates rankings datasets. The orjson.dumps(entry).decode("utf-8") pattern is appropriate for writing JSON to text files.

tests/integration/test_gpu_telemetry.py (1)

16-56: LGTM: Comprehensive telemetry validation.

The test thoroughly validates the nested GPU telemetry structure, ensuring data is present at all levels (endpoints → GPUs → metrics → values).

tests/integration/test_rankings_endpoint.py (1)

19-40: LGTM: Rankings test with custom dataset.

The test correctly generates a dataset and validates the rankings endpoint. Creating 5 entries for 10 requests appropriately tests dataset cycling behavior.

tests/integration/test_embeddings_endpoint.py (1)

16-36: LGTM: Embeddings test with appropriate assertions.

The test correctly validates that embeddings complete successfully and appropriately lack time-to-first-token metrics (which are specific to streaming completions).

tests/server/conftest.py (4)

26-34: LGTM: Proper test isolation with autouse fixture.

The autouse fixture correctly resets server configuration before and after each test, ensuring test isolation. Setting error_rate=0.0 and random_seed=42 provides deterministic test behavior.


42-63: LGTM: Clean component fixtures.

The core component fixtures provide essential testing dependencies with sensible defaults.


86-137: LGTM: Comprehensive request fixtures.

The request fixtures cover all endpoint types (completion, chat, embedding, ranking) with both basic and specialized variants (e.g., chat with reasoning).


174-189: LGTM: Useful parametrize helpers.

The parametrize fixtures enable efficient testing across multiple reasoning efforts, GPU counts, and GPU models.

tests/server/test_models.py (7)

28-43: LGTM: Thorough property testing.

The parametrized test correctly validates the include_usage property across different stream_options configurations.


46-51: LGTM: Edge case coverage.

The test validates that empty strings in prompt lists are properly filtered.


54-72: LGTM: Property precedence validation.

The test correctly validates that max_completion_tokens takes precedence over max_tokens when both are present.


75-102: LGTM: Request property tests.

The embedding and ranking request tests properly validate normalization and extraction properties.


118-262: LGTM: Comprehensive response model coverage.

The tests thoroughly validate all response model types (chat, text, streaming variants) with proper structure and attribute assertions.


164-200: LGTM: Message and delta model tests.

The tests cover both basic and advanced scenarios including reasoning content.


265-280: LGTM: Rankings response validation.

The test validates the ranking response structure including multiple rankings with relevance scores.

tests/server/test_tokens.py (8)

26-33: Good check for reasoning_content derivation

Asserting join of reasoning_content_tokens is spot on and stable.


35-44: Usage calculation assertions are precise

Validates prompt/completion/total and absence of details when reasoning_tokens==0.


45-56: Covers completion_tokens_details when reasoning present

Solid coverage of reasoning token accounting.


62-73: Call style relies on Tokenizer being an instance

Tests call Tokenizer.tokenize/count_tokens as if Tokenizer is a module-level instance. Ensure aiperf_mock_server.tokens exposes an instance (not a class) named Tokenizer. If it’s a class, instantiate it in tests or export an instance in the module.


92-102: Reasoning-effort expectations look good

Covers positive reasoning_tokens and non-empty reasoning_content_tokens.


117-123: Determinism test is valuable

Idempotence across identical requests is essential. Nice.


161-166: High max_tokens ‘stop’ finish_reason

Good guard to ensure natural termination.


1-166: Heads‑up: mutable default in TokenizedText (in tokens.py)

The referenced TokenizedText uses reasoning_content_tokens: list[str] = [] (mutable default). Switch to Field(default_factory=list) to avoid shared state across instances.
Apply in aiperf_mock_server/tokens.py:

-    reasoning_content_tokens: list[str] = []
+    reasoning_content_tokens: list[str] = Field(default_factory=list)

(Import Field from pydantic if not already.)

Likely an incorrect or invalid review comment.

tests/aiperf_mock_server/app.py (2)

78-111: Chat non-streaming response assembly looks correct

Uses ctx.tokenized for content, finish_reason, optional reasoning_content, and usage.


118-146: Text non-streaming path is consistent with chat

Good parity and clear separation of streaming vs non-streaming.

tests/aiperf_mock_server/utils.py (5)

39-51: Error injection is simple and effective

Decorator cleanly simulates HTTP 500 at configured rate.


59-82: LatencySimulator is minimal and correct

TTFT on first token, ITL thereafter; uses perf_counter and asyncio.sleep.


147-165: Reasoning token streaming

Role assignment and finish_reason handling are correct (no finish_reason on reasoning chunks).


167-194: Output token streaming

Emits role on first non-reasoning token; finish_reason only on the last chunk. Good.


240-243: SSE framing helper is fine

model_dump_json with exclude_none keeps payloads compact.

tests/aiperf_mock_server/README.md (2)

171-193: DCGM endpoints examples align with intended routing

Examples use /dcgm1/metrics and /dcgm2/metrics. Ensure the app also exposes /dcgm/{instance_id}/metrics or verify that the embedded‑param route matches these forms across FastAPI/Starlette versions.


106-127: API parameter docs are clear

Good coverage of stream_options/include_usage, min_tokens, ignore_eos.

tests/integration/models.py (4)

56-65: Artifact loading helpers are straightforward

Good defaults and None/empty fallbacks.


116-136: Pydantic model validations add safety

Nice explicit asserts per artifact type.


170-175: Metric presence helper is concise

all metrics check via getattr is clean.


176-214: Media detection covers OpenAI message format and top‑level fields

Solid coverage for images/audio/video presence detection.

tests/server/test_dcgm_faker.py (1)

136-142: Determinism test is solid.

Seeding produces identical metric snapshots; good coverage.

tests/aiperf_mock_server/models.py (2)

31-45: Streaming usage flag logic looks correct.


228-243: Response shapes align with endpoints; fields are minimal yet sufficient.

Comment thread .github/workflows/run-integration-tests.yml
Comment thread Makefile Outdated
Comment thread tests/aiperf_mock_server/app.py
Comment thread tests/aiperf_mock_server/app.py Outdated
Comment thread tests/aiperf_mock_server/config.py
Comment thread tests/aiperf_mock_server/utils.py
Comment thread tests/integration/conftest.py
Comment thread tests/integration/conftest.py
Comment thread tests/integration/models.py
Comment thread tests/server/test_config.py Outdated
@ajcasagrande ajcasagrande merged commit c6e9cde into main Oct 18, 2025
13 checks passed
@ajcasagrande ajcasagrande deleted the ajc/new-mock branch October 18, 2025 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants