feat: e2e integration tests with new mock server all endpoints#377
feat: e2e integration tests with new mock server all endpoints#377ajcasagrande merged 8 commits intomainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
WalkthroughThis pull request significantly refactors the AIPerf Mock Server from a basic integration test server into a production-grade mock supporting multiple OpenAI API endpoints, GPU telemetry via DCGM, latency simulation, reasoning models, error injection, and streaming responses. Accompanying changes include new test infrastructure, configuration framework, and comprehensive unit and integration tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Rationale: This refactor introduces substantial architectural changes across multiple interconnected modules (app.py, models.py, config.py) with new complex logic (tokenization, streaming, latency simulation). The changes exhibit high heterogeneity across concerns (API endpoints, configuration, telemetry, tests) requiring separate reasoning for each area. While individual test files are straightforward, the mock server implementation modules contain moderate-to-high logic density. The extensive test coverage and documentation changes increase total scope but provide validation context. No single homogeneous pattern; multiple distinct refactors demand careful review of interactions and backward compatibility implications. Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
🧹 Nitpick comments (19)
tests/integration/README.md (1)
56-65: Use proper heading syntax instead of bold emphasis.Lines 58 and 62 use bold text as pseudo-headings. Use proper Markdown heading syntax for better semantic structure and accessibility.
Apply this diff:
## Key Components -**Fixtures (conftest.py)** +### Fixtures (conftest.py) + - `aiperf_mock_server: AIPerfMockServer` - Mock LLM server instance - `cli: AIPerfCLI` - CLI wrapper for running AIPerf commands -**Models (models.py)** +### Models (models.py) + - `AIPerfResults` - Result wrapper with typed properties for all output artifacts - `AIPerfMockServer` - Server connection info - `AIPerfSubprocessResult` - Subprocess execution resulttests/integration/test_completions_endpoint.py (1)
16-31: LGTM: Basic completions test is correct.Consider adding a streaming completions test similar to
test_streaming_chatfor completeness, though non-streaming coverage is sufficient for now.tests/server/test_utils.py (3)
27-38: Avoid hard‑coded latency expectations; derive from config.Coupling tests to 20ms/5ms defaults will break if config changes. Read ttft/itl from server_config to compute expected seconds.
Apply:
@@ - async def test_wait_for_next_token(self, time_traveler: TimeTraveler): - sim = LatencySimulator() - with time_traveler.sleeps_for(expected_seconds=0.02): + async def test_wait_for_next_token(self, time_traveler: TimeTraveler): + from aiperf_mock_server.config import server_config + sim = LatencySimulator() + with time_traveler.sleeps_for(expected_seconds=server_config.ttft * 0.001): await sim.wait_for_next_token() @@ - async def test_wait_for_tokens_multiple(self, time_traveler: TimeTraveler): - sim = LatencySimulator() - with time_traveler.sleeps_for(expected_seconds=0.02 + (0.005 * 5)): + async def test_wait_for_tokens_multiple(self, time_traveler: TimeTraveler): + from aiperf_mock_server.config import server_config + sim = LatencySimulator() + with time_traveler.sleeps_for( + expected_seconds=(server_config.ttft + server_config.itl * 5) * 0.001 + ): await sim.wait_for_tokens(num_tokens=5)
6-10: Expand request ID coverage for embeddings and rankings.Test create_request_id for all supported request types.
Apply:
@@ -from aiperf_mock_server.models import ( - ChatCompletionRequest, - CompletionRequest, - Message, -) +from aiperf_mock_server.models import ( + ChatCompletionRequest, + CompletionRequest, + EmbeddingRequest, + RankingRequest, + Message, +) @@ [ (CompletionRequest(model="test", prompt="test"), "cmpl-"), ( ChatCompletionRequest( model="test", messages=[Message(role="user", content="test")] ), "chatcmpl-", ), + (EmbeddingRequest(model="test", input="hello"), "emb-"), + ( + RankingRequest( + model="test", + query={"text": "q"}, + passages=[{"text": "p1"}], + ), + "rank-", + ), ],Also applies to: 43-59
106-131: Strengthen SSE assertions by parsing JSON instead of substring matching.Validate each non-[DONE] SSE chunk is valid JSON and contains expected keys; assert usage presence robustly.
Apply:
@@ - chunks = [] + import json + chunks = [] async for chunk in stream_text_completion(ctx): chunks.append(chunk) @@ - assert any("data:" in chunk for chunk in chunks) + payloads = [ + json.loads(c[len("data: ") :]) + for c in chunks + if c.startswith("data: ") and c.strip() != "data: [DONE]" + ] + assert payloads and all("id" in p and "model" in p for p in payloads) @@ - chunks = [] + import json + chunks = [] async for chunk in stream_text_completion(ctx): chunks.append(chunk) @@ - assert any("usage" in chunk for chunk in chunks) + usage_chunks = [ + json.loads(c[len("data: ") :]) + for c in chunks + if c.startswith("data: ") and c.strip() != "data: [DONE]" + ] + assert any("usage" in p for p in usage_chunks) @@ - chunks = [] + import json + chunks = [] async for chunk in stream_chat_completion(ctx): chunks.append(chunk) @@ - assert any("data:" in chunk for chunk in chunks) + payloads = [ + json.loads(c[len("data: ") :]) + for c in chunks + if c.startswith("data: ") and c.strip() != "data: [DONE]" + ] + assert payloads and all("id" in p and "model" in p for p in payloads) @@ - chunks = [] + import json + chunks = [] async for chunk in stream_chat_completion(ctx): chunks.append(chunk) @@ - assert any("usage" in chunk for chunk in chunks) + usage_chunks = [ + json.loads(c[len("data: ") :]) + for c in chunks + if c.startswith("data: ") and c.strip() != "data: [DONE]" + ] + assert any("usage" in p for p in usage_chunks)Optional: add time_traveler fixture to streaming tests to avoid wall‑clock sleeps.
Also applies to: 136-150, 166-179
tests/server/test_config.py (1)
32-45: Caution: 0.0.0.0 binding (S104).Using host="0.0.0.0" is valid but exposes the server on all interfaces. If used outside CI, ensure it’s intentional and documented.
tests/aiperf_mock_server/__main__.py (1)
39-46: Keep access_log user‑controlled.Auto‑enabling access logs when log_level=debug overrides explicit config.access_logs=False. Prefer honoring the flag as source of truth.
Apply:
- access_log=config.access_logs or config.log_level.lower() == "debug", + access_log=config.access_logs,If you want auto‑enable in debug, document it and add an explicit “--access-logs=auto” mode.
tests/server/test_app.py (1)
13-15: Avoid pinning exact version in test.Hard‑coding "2.0.0" creates churn on version bumps. Assert presence/format or import a single source of truth.
Apply:
- assert data["message"] == "AIPerf Mock Server" - assert data["version"] == "2.0.0" + assert data["message"] == "AIPerf Mock Server" + assert "version" in data and isinstance(data["version"], str) and data["version"]tests/server/test_tokens.py (2)
5-10: Avoid importing private_tokenizein tests unless necessaryPrefer exercising the public surface (Tokenizer.tokenize/count_tokens) to reduce test fragility when internals change. Keep one direct test for
_tokenizeif it guards a critical behavior, otherwise remove.
103-109: ignore_eos behavior assertionFinish reason “length” check is useful. Consider asserting count == max_tokens when ignore_eos is true to catch regressions.
tests/aiperf_mock_server/app.py (2)
56-61: dcgm_fakers grows across app restartsOn reloads, you append without clearing, leaking instances. Clear before appending.
- dcgm_fakers.append(_create_dcgm_faker(server_config.dcgm_seed)) + dcgm_fakers.clear() + dcgm_fakers.append(_create_dcgm_faker(server_config.dcgm_seed))
238-244: Path pattern ‘/dcgm{instance_id:int}/metrics’This intends to match ‘/dcgm1/metrics’. Starlette typically supports
{param}but params embedded in path segments can be finicky. Add a companion route/dcgm/{instance_id}/metricsfor safety, or add an explicit test ensuring both forms resolve.@app.get("/dcgm{instance_id:int}/metrics") async def dcgm_metrics(instance_id: int) -> PlainTextResponse: ... + +@app.get("/dcgm/{instance_id}/metrics") +async def dcgm_metrics_slash(instance_id: int) -> PlainTextResponse: + return await dcgm_metrics(instance_id)tests/integration/conftest.py (4)
157-167: Tighten health-check backoff; current worst-case ≈200s.Reduce attempts and per-try timeout for faster failure.
- for _ in range(100): + for _ in range(50): try: async with session.get( - f"{url}/health", timeout=aiohttp.ClientTimeout(total=2) + f"{url}/health", timeout=aiohttp.ClientTimeout(total=0.5) ) as resp: if resp.status == 200: breakAlso applies to: 160-165
205-208: Use splat expansion; cleaner and matches Ruff hint (RUF005).- full_args = args + ["--artifact-dir", str(temp_output_dir)] + full_args = [*args, "--artifact-dir", str(temp_output_dir)] - cmd = [python_exe, "-m", "aiperf"] + full_args + cmd = [python_exe, "-m", "aiperf", *full_args]
151-153: Consider capturing stderr for failed startup diagnostics.Silencing server logs complicates debugging when /health never turns green.
- stdout=asyncio.subprocess.DEVNULL, - stderr=asyncio.subprocess.DEVNULL, + stdout=asyncio.subprocess.DEVNULL, + stderr=asyncio.subprocess.PIPE, # capture errors for debuggingOptionally surface stderr in the RuntimeError message when startup fails.
236-242: Clarify fixture dependency; silence ARG001.Param is intentionally unused to ensure server lifecycle; rename it.
-def cli( - aiperf_runner: Callable[[list[str], float], AIPerfSubprocessResult], - aiperf_mock_server: AIPerfMockServer, -) -> AIPerfCLI: +def cli( + aiperf_runner: Callable[[list[str], float], AIPerfSubprocessResult], + _aiperf_mock_server: AIPerfMockServer, # ensures server is running +) -> AIPerfCLI:tests/aiperf_mock_server/tokens.py (2)
310-314: Make seed stable across runs (Python hash is randomized).hash() varies per process; prefer a small stable digest.
- sample = prompt_tokens[:5] - return hash(tuple(sample)) % 1000 + import hashlib + sample = "".join(prompt_tokens[:5]).encode("utf-8", "ignore") + return int.from_bytes(hashlib.blake2s(sample, digest_size=4).digest(), "big")
254-269: Tighten typing for messages.Helps static analysis and IDEs.
- def _extract_chat_messages(self, messages: list) -> str: + def _extract_chat_messages(self, messages: list["Message"]) -> str:tests/server/test_dcgm_faker.py (1)
39-41: Ignore S311 here; randomness is non-crypto test scaffolding.No action needed.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (32)
.github/workflows/run-integration-tests.yml(1 hunks)Makefile(3 hunks)pyproject.toml(2 hunks)tests/aiperf_mock_server/README.md(1 hunks)tests/aiperf_mock_server/__main__.py(1 hunks)tests/aiperf_mock_server/app.py(1 hunks)tests/aiperf_mock_server/config.py(2 hunks)tests/aiperf_mock_server/dcgm_faker.py(1 hunks)tests/aiperf_mock_server/models.py(1 hunks)tests/aiperf_mock_server/pyproject.toml(1 hunks)tests/aiperf_mock_server/tokenizer_service.py(0 hunks)tests/aiperf_mock_server/tokens.py(1 hunks)tests/aiperf_mock_server/utils.py(1 hunks)tests/conftest.py(0 hunks)tests/integration/README.md(1 hunks)tests/integration/conftest.py(1 hunks)tests/integration/models.py(1 hunks)tests/integration/test_chat_endpoint.py(1 hunks)tests/integration/test_completions_endpoint.py(1 hunks)tests/integration/test_default_behavior.py(1 hunks)tests/integration/test_embeddings_endpoint.py(1 hunks)tests/integration/test_gpu_telemetry.py(1 hunks)tests/integration/test_rankings_endpoint.py(1 hunks)tests/integration/utils.py(1 hunks)tests/server/__init__.py(1 hunks)tests/server/conftest.py(1 hunks)tests/server/test_app.py(1 hunks)tests/server/test_config.py(1 hunks)tests/server/test_dcgm_faker.py(1 hunks)tests/server/test_models.py(1 hunks)tests/server/test_tokens.py(1 hunks)tests/server/test_utils.py(1 hunks)
💤 Files with no reviewable changes (2)
- tests/conftest.py
- tests/aiperf_mock_server/tokenizer_service.py
🧰 Additional context used
🧬 Code graph analysis (20)
tests/server/test_models.py (1)
tests/aiperf_mock_server/models.py (24)
ChatChoice(159-162)ChatCompletionRequest(47-57)ChatCompletionResponse(183-187)ChatDelta(151-156)ChatMessage(143-148)ChatStreamChoice(171-174)ChatStreamCompletionResponse(197-201)CompletionRequest(60-71)EmbeddingRequest(74-87)Message(24-28)Ranking(228-232)RankingRequest(90-110)RankingResponse(235-242)TextChoice(165-168)TextCompletionResponse(190-194)TextStreamChoice(177-180)TextStreamCompletionResponse(204-208)Usage(118-124)include_usage(42-44)prompt_text(67-71)max_output_tokens(55-57)inputs(81-87)passage_texts(103-105)total_tokens(108-110)
tests/server/conftest.py (6)
tests/integration/conftest.py (1)
aiperf_mock_server(128-187)tests/aiperf_mock_server/config.py (2)
MockServerConfig(16-123)set_server_config(131-135)tests/aiperf_mock_server/dcgm_faker.py (1)
DCGMFaker(117-148)tests/aiperf_mock_server/models.py (5)
ChatCompletionRequest(47-57)CompletionRequest(60-71)EmbeddingRequest(74-87)Message(24-28)RankingRequest(90-110)tests/aiperf_mock_server/tokens.py (2)
TokenizedText(16-56)content(32-34)aiperf/common/tokenizer.py (1)
Tokenizer(25-161)
tests/integration/test_completions_endpoint.py (2)
tests/integration/conftest.py (4)
AIPerfCLI(27-87)cli(237-242)aiperf_mock_server(128-187)run(36-59)tests/integration/models.py (2)
AIPerfMockServer(24-35)request_count(138-142)
tests/server/test_tokens.py (2)
tests/aiperf_mock_server/models.py (4)
ChatCompletionRequest(47-57)CompletionRequest(60-71)Message(24-28)total_tokens(108-110)tests/aiperf_mock_server/tokens.py (9)
TokenizedText(16-56)_tokenize(155-179)reasoning_content(37-43)create_usage(45-56)tokenize(185-187)count(27-29)count_tokens(189-191)tokenize_request(193-234)content(32-34)
tests/integration/test_embeddings_endpoint.py (2)
tests/integration/conftest.py (4)
AIPerfCLI(27-87)cli(237-242)aiperf_mock_server(128-187)run(36-59)tests/integration/models.py (2)
AIPerfMockServer(24-35)request_count(138-142)
tests/server/test_utils.py (3)
tests/aiperf_mock_server/models.py (3)
ChatCompletionRequest(47-57)CompletionRequest(60-71)Message(24-28)tests/aiperf_mock_server/utils.py (8)
LatencySimulator(59-81)RequestContext(84-95)create_request_id(103-115)stream_chat_completion(123-144)stream_text_completion(220-237)with_error_injection(39-51)wait_for_next_token(70-73)wait_for_tokens(75-81)tests/utils/time_traveler.py (3)
time_traveler(114-125)TimeTraveler(20-110)sleeps_for(92-110)
tests/server/test_app.py (1)
tests/server/conftest.py (1)
test_client(61-63)
tests/integration/test_rankings_endpoint.py (3)
tests/integration/conftest.py (4)
AIPerfCLI(27-87)cli(237-242)aiperf_mock_server(128-187)run(36-59)tests/integration/models.py (2)
AIPerfMockServer(24-35)request_count(138-142)tests/integration/utils.py (1)
create_rankings_dataset(10-30)
tests/integration/test_gpu_telemetry.py (2)
tests/integration/conftest.py (3)
AIPerfCLI(27-87)aiperf_mock_server(128-187)run(36-59)tests/integration/models.py (4)
AIPerfMockServer(24-35)dcgm_urls(33-35)request_count(138-142)has_gpu_telemetry(231-233)
tests/integration/test_default_behavior.py (2)
tests/integration/conftest.py (4)
AIPerfCLI(27-87)cli(237-242)aiperf_mock_server(128-187)run(36-59)tests/integration/models.py (2)
AIPerfMockServer(24-35)request_count(138-142)
tests/integration/test_chat_endpoint.py (2)
tests/integration/conftest.py (4)
AIPerfCLI(27-87)cli(237-242)aiperf_mock_server(128-187)run(36-59)tests/integration/models.py (3)
AIPerfMockServer(24-35)request_count(138-142)has_streaming_metrics(145-154)
tests/server/test_config.py (1)
tests/aiperf_mock_server/config.py (5)
MockServerConfig(16-123)_get_env_key(148-150)_propagate_config_to_env(138-145)_serialize_env_value(153-157)set_server_config(131-135)
tests/integration/models.py (3)
aiperf/common/models/dataset_models.py (2)
InputsFile(107-114)SessionPayloads(95-104)aiperf/common/models/export_models.py (1)
JsonExportData(73-117)aiperf/common/models/record_models.py (1)
MetricRecordInfo(121-135)
tests/aiperf_mock_server/__main__.py (3)
tests/integration/conftest.py (1)
aiperf_mock_server(128-187)tests/aiperf_mock_server/config.py (2)
MockServerConfig(16-123)set_server_config(131-135)tests/aiperf_mock_server/app.py (1)
root(224-230)
tests/integration/conftest.py (2)
tests/utils/time_traveler.py (1)
real_sleep(85-86)tests/integration/models.py (3)
AIPerfMockServer(24-35)AIPerfResults(50-239)AIPerfSubprocessResult(16-20)
tests/aiperf_mock_server/utils.py (3)
tests/utils/time_traveler.py (3)
time(48-49)perf_counter(51-52)sleep(37-43)tests/aiperf_mock_server/models.py (11)
ChatCompletionRequest(47-57)ChatDelta(151-156)ChatStreamChoice(171-174)ChatStreamCompletionResponse(197-201)CompletionRequest(60-71)EmbeddingRequest(74-87)RankingRequest(90-110)TextStreamChoice(177-180)TextStreamCompletionResponse(204-208)BaseModel(13-16)include_usage(42-44)tests/aiperf_mock_server/tokens.py (5)
tokenize_request(193-234)count(27-29)create_usage(45-56)reasoning_content(37-43)content(32-34)
tests/aiperf_mock_server/tokens.py (1)
tests/aiperf_mock_server/models.py (12)
ChatCompletionRequest(47-57)CompletionRequest(60-71)EmbeddingRequest(74-87)RankingRequest(90-110)Usage(118-124)BaseModel(13-16)total_tokens(108-110)max_output_tokens(55-57)prompt_text(67-71)inputs(81-87)query_text(98-100)passage_texts(103-105)
tests/aiperf_mock_server/app.py (5)
tests/server/conftest.py (1)
dcgm_faker(49-51)tests/aiperf_mock_server/dcgm_faker.py (2)
DCGMFaker(117-148)generate(143-148)tests/aiperf_mock_server/models.py (15)
ChatChoice(159-162)ChatCompletionRequest(47-57)ChatMessage(143-148)CompletionRequest(60-71)Embedding(211-216)EmbeddingRequest(74-87)EmbeddingResponse(219-225)Ranking(228-232)RankingRequest(90-110)RankingResponse(235-242)TextChoice(165-168)TextCompletionResponse(190-194)inputs(81-87)query_text(98-100)passage_texts(103-105)tests/aiperf_mock_server/utils.py (5)
RequestContext(84-95)stream_chat_completion(123-144)stream_text_completion(220-237)with_error_injection(39-51)wait_until_completion(93-95)tests/aiperf_mock_server/tokens.py (3)
content(32-34)reasoning_content(37-43)create_usage(45-56)
tests/server/test_dcgm_faker.py (2)
tests/server/conftest.py (2)
dcgm_faker(49-51)gpu_config(55-57)tests/aiperf_mock_server/dcgm_faker.py (6)
DCGMFaker(117-148)FakeGPU(54-114)GPUConfig(10-19)update(89-114)set_load(130-132)generate(143-148)
tests/aiperf_mock_server/models.py (3)
tests/aiperf_mock_server/tokens.py (2)
content(32-34)reasoning_content(37-43)tests/server/conftest.py (1)
reasoning_effort(175-177)tests/aiperf_mock_server/app.py (1)
rankings(186-209)
🪛 actionlint (1.7.8)
.github/workflows/run-integration-tests.yml
24-24: the runner of "actions/setup-python@v4" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🪛 markdownlint-cli2 (0.18.1)
tests/integration/README.md
58-58: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
62-62: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🪛 Ruff (0.14.0)
tests/server/test_config.py
35-35: Possible binding to all interfaces
(S104)
41-41: Possible binding to all interfaces
(S104)
tests/integration/conftest.py
174-177: Avoid specifying long messages outside the exception class
(TRY003)
205-205: Consider [*args, "--artifact-dir", str(temp_output_dir)] instead of concatenation
Replace with [*args, "--artifact-dir", str(temp_output_dir)]
(RUF005)
207-207: Consider [python_exe, "-m", "aiperf", *full_args] instead of concatenation
Replace with [python_exe, "-m", "aiperf", *full_args]
(RUF005)
226-226: Avoid specifying long messages outside the exception class
(TRY003)
239-239: Unused function argument: aiperf_mock_server
(ARG001)
tests/server/test_dcgm_faker.py
39-39: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: build (ubuntu-latest, 3.12)
- GitHub Check: build (macos-latest, 3.11)
- GitHub Check: integration-tests (macos-latest, 3.11)
- GitHub Check: build (ubuntu-latest, 3.10)
- GitHub Check: build (ubuntu-latest, 3.11)
- GitHub Check: integration-tests (macos-latest, 3.12)
- GitHub Check: integration-tests (macos-latest, 3.10)
🔇 Additional comments (46)
pyproject.toml (1)
94-111: LGTM! Well-structured pytest configuration.The new marker definitions and default deselection strategy cleanly separates integration/performance tests from unit tests, aligning with the new test infrastructure introduced in this PR.
Makefile (1)
181-190: LGTM! Well-designed integration test targets.The dual targets (parallel and sequential verbose) provide good DX—parallel for speed, sequential for debugging with live output.
tests/aiperf_mock_server/pyproject.toml (1)
19-21: LGTM! Dependencies align with the mock server refactor.The addition of
pydantic-settingsandorjsonsupports the new configuration framework and performance-oriented JSON handling introduced in this PR.tests/integration/test_default_behavior.py (1)
9-29: LGTM! Well-structured integration test.The test follows the documented pattern and includes a clear docstring explaining the constraint (non-default port requires explicit URL). Good defensive testing of default behavior.
tests/aiperf_mock_server/dcgm_faker.py (1)
27-27: LGTM! H200 configuration matches specifications.The 141GB memory and other specs align with NVIDIA H200 specifications. Good addition to the GPU configuration catalog.
tests/integration/test_chat_endpoint.py (2)
14-29: LGTM: Clear basic chat test.The test correctly validates non-streaming chat completions with appropriate assertions.
31-49: LGTM: Streaming test with proper validation.The test correctly enables streaming and validates both request count and streaming metrics presence.
tests/integration/utils.py (1)
10-30: LGTM: Well-structured dataset generator.The function correctly generates rankings datasets. The
orjson.dumps(entry).decode("utf-8")pattern is appropriate for writing JSON to text files.tests/integration/test_gpu_telemetry.py (1)
16-56: LGTM: Comprehensive telemetry validation.The test thoroughly validates the nested GPU telemetry structure, ensuring data is present at all levels (endpoints → GPUs → metrics → values).
tests/integration/test_rankings_endpoint.py (1)
19-40: LGTM: Rankings test with custom dataset.The test correctly generates a dataset and validates the rankings endpoint. Creating 5 entries for 10 requests appropriately tests dataset cycling behavior.
tests/integration/test_embeddings_endpoint.py (1)
16-36: LGTM: Embeddings test with appropriate assertions.The test correctly validates that embeddings complete successfully and appropriately lack time-to-first-token metrics (which are specific to streaming completions).
tests/server/conftest.py (4)
26-34: LGTM: Proper test isolation with autouse fixture.The autouse fixture correctly resets server configuration before and after each test, ensuring test isolation. Setting
error_rate=0.0andrandom_seed=42provides deterministic test behavior.
42-63: LGTM: Clean component fixtures.The core component fixtures provide essential testing dependencies with sensible defaults.
86-137: LGTM: Comprehensive request fixtures.The request fixtures cover all endpoint types (completion, chat, embedding, ranking) with both basic and specialized variants (e.g., chat with reasoning).
174-189: LGTM: Useful parametrize helpers.The parametrize fixtures enable efficient testing across multiple reasoning efforts, GPU counts, and GPU models.
tests/server/test_models.py (7)
28-43: LGTM: Thorough property testing.The parametrized test correctly validates the
include_usageproperty across differentstream_optionsconfigurations.
46-51: LGTM: Edge case coverage.The test validates that empty strings in prompt lists are properly filtered.
54-72: LGTM: Property precedence validation.The test correctly validates that
max_completion_tokenstakes precedence overmax_tokenswhen both are present.
75-102: LGTM: Request property tests.The embedding and ranking request tests properly validate normalization and extraction properties.
118-262: LGTM: Comprehensive response model coverage.The tests thoroughly validate all response model types (chat, text, streaming variants) with proper structure and attribute assertions.
164-200: LGTM: Message and delta model tests.The tests cover both basic and advanced scenarios including reasoning content.
265-280: LGTM: Rankings response validation.The test validates the ranking response structure including multiple rankings with relevance scores.
tests/server/test_tokens.py (8)
26-33: Good check for reasoning_content derivationAsserting join of reasoning_content_tokens is spot on and stable.
35-44: Usage calculation assertions are preciseValidates prompt/completion/total and absence of details when reasoning_tokens==0.
45-56: Covers completion_tokens_details when reasoning presentSolid coverage of reasoning token accounting.
62-73: Call style relies on Tokenizer being an instanceTests call Tokenizer.tokenize/count_tokens as if Tokenizer is a module-level instance. Ensure aiperf_mock_server.tokens exposes an instance (not a class) named Tokenizer. If it’s a class, instantiate it in tests or export an instance in the module.
92-102: Reasoning-effort expectations look goodCovers positive reasoning_tokens and non-empty reasoning_content_tokens.
117-123: Determinism test is valuableIdempotence across identical requests is essential. Nice.
161-166: High max_tokens ‘stop’ finish_reasonGood guard to ensure natural termination.
1-166: Heads‑up: mutable default in TokenizedText (in tokens.py)The referenced TokenizedText uses
reasoning_content_tokens: list[str] = [](mutable default). Switch toField(default_factory=list)to avoid shared state across instances.
Apply in aiperf_mock_server/tokens.py:- reasoning_content_tokens: list[str] = [] + reasoning_content_tokens: list[str] = Field(default_factory=list)(Import Field from pydantic if not already.)
Likely an incorrect or invalid review comment.
tests/aiperf_mock_server/app.py (2)
78-111: Chat non-streaming response assembly looks correctUses ctx.tokenized for content, finish_reason, optional reasoning_content, and usage.
118-146: Text non-streaming path is consistent with chatGood parity and clear separation of streaming vs non-streaming.
tests/aiperf_mock_server/utils.py (5)
39-51: Error injection is simple and effectiveDecorator cleanly simulates HTTP 500 at configured rate.
59-82: LatencySimulator is minimal and correctTTFT on first token, ITL thereafter; uses perf_counter and asyncio.sleep.
147-165: Reasoning token streamingRole assignment and finish_reason handling are correct (no finish_reason on reasoning chunks).
167-194: Output token streamingEmits role on first non-reasoning token; finish_reason only on the last chunk. Good.
240-243: SSE framing helper is finemodel_dump_json with exclude_none keeps payloads compact.
tests/aiperf_mock_server/README.md (2)
171-193: DCGM endpoints examples align with intended routingExamples use
/dcgm1/metricsand/dcgm2/metrics. Ensure the app also exposes/dcgm/{instance_id}/metricsor verify that the embedded‑param route matches these forms across FastAPI/Starlette versions.
106-127: API parameter docs are clearGood coverage of stream_options/include_usage, min_tokens, ignore_eos.
tests/integration/models.py (4)
56-65: Artifact loading helpers are straightforwardGood defaults and None/empty fallbacks.
116-136: Pydantic model validations add safetyNice explicit asserts per artifact type.
170-175: Metric presence helper is conciseall metrics check via getattr is clean.
176-214: Media detection covers OpenAI message format and top‑level fieldsSolid coverage for images/audio/video presence detection.
tests/server/test_dcgm_faker.py (1)
136-142: Determinism test is solid.Seeding produces identical metric snapshots; good coverage.
tests/aiperf_mock_server/models.py (2)
31-45: Streaming usage flag logic looks correct.
228-243: Response shapes align with endpoints; fields are minimal yet sufficient.
c881f55 to
038e086
Compare
Summary by CodeRabbit
Release Notes
New Features
Tests
Chores